Skip to content
This repository was archived by the owner on Feb 20, 2020. It is now read-only.

Conversation

@aerickson
Copy link
Contributor

@aerickson aerickson commented Oct 23, 2019

g-w should exit if remainingTasks is less than or equal to 0.

Bitbar workers recently had an issue (due to our homegrown supersede detection) where we had a bad worker that failed ~180 tasks because g-w continued to work and didn't exit due to this. We configure g-w to exit after 1 task. We didn't use to delete the completed job counter, so remainingTasks was < 0 and g-w continued to run until it exceeded Bitbar's allowed container run time.

related:

@aerickson aerickson requested a review from petemoore as a code owner October 23, 2019 22:24
@petemoore
Copy link
Member

Hi Andrew,

Many thanks for the PR! We're slap bang in the middle of a migration at the moment which is top priority, so it may be a while before I get back to this, but I appreciate the PR.

I think it may be time for us to make numberOfTasksToRun a disallowed property when the engine is already limited to running a single task before exiting.

At the moment, reboots are required between tasks in the multiuser engine, but at some point we'll have support for headless tasks on multiuser that don't require a graphical interface, and then multiuser engine should be able to run multiple tasks at a time without exiting between them, so then the numberOfTasksToRun config setting is useful. But in that case it would not need to persist this information to a file, so maybe we can abandon reading/writing the tasks-resolved-count.txt file entirely.

This is a throw back to when we didn't have a notion of "engine" in generic-worker and each platform operated slightly differently. In order for numberOfTasksToRun to make sense on Windows where only a single task could run at a time, I chose to persist the total in the tasks-resolved-count.txt file, so you could say "run 100 tasks and then kill the machine" and the worker would run tasks with reboots between, but after 100 tasks had been run, it would have a different exit code to say that the worker had completed everything it needed to do:

    0      Tasks completed successfully; no more tasks to run (see config setting
           numberOfTasksToRun).
...
...
    67     A task user has been created, and the generic-worker needs to reboot in order
           to log on as the new task user. Note, the reboot happens automatically unless
           config setting disableReboots is set to true - in either code this exit code will
           be issued.

But it has caused a lot of confusion, so I think the time to rethink it a little is here.

I'd like to park this while we have our platform migration going on, so it won't be until later in November that I will have much time to come back to this, but it sounds like you have a workaround in place (deleting tasks-resolved-count.txt) so hopefully this won't block you.

Thanks again!

@petemoore petemoore removed their request for review November 18, 2019 15:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants