-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers #1635
Comments
It may be a 'bug' or a 'feature'. I have only observed that 'batch size' has grown. RAF Fenix aka Technologov |
not a bug, just a consideration of what to optimize for. |
Appearance of idle workers just because the task is split by too few workers due to high batches was not an indented behavior probably. |
There is also #1627 |
Can you make a config option to set maximum amount of games per batch to mitigate this bug that causes idle workers because of too large batches? I have batches as large as 16886 games, so all the games in the test are split between too few workers, and I cannot add additional workers - they don't take this test because all the games are split amongst the existing workers. 16886 games of TC 160+1.6 take too long to complete on for a 64-core worker, so even when there is enough workers, they are idle. |
Wouldn't this be solved by the PR that increases the maximum number of games for an SPRT test? |
Note: the number of games for an SPRT tests after which it is stopped, and the number used for task size computation, does not have to be the same (it currently is). |
Fixed by #1659 |
I have an SPSA test (not an SPRT) with 3000000 games, and the workers are "greedy", so they took all the games and I cannot add more workers because they would be idle, unless I use an option implemented in #1659 |
For an SPSA test, I cannot increase the number of games. According to https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests |
I would propose to close this. It's not clear what should be done, if anything. |
I agree this is probably only a corner case. Usually there are enough tests to distribute workers |
This is not a corner case. I am now experiencing such a message the whole day today ("No tasks available at this time, waiting..."):
This did not happen in the past, as there were always work to do. Now the worker is getting "No tasks available at this time, waiting..." |
The current workers are too "greedy". They "reserve" too much work so that they do not leave work to the other workers. |
nodestime is not supported at the moment. There is a pending PR written by me to fix this which needs to be reviewed yet.
How are your workers configured? Depending on # of cores or memory size, some tests will not be allocated to workers. |
There is now at least the worker e85c97d3 that does not receive tasks due to a bug in batch size calculation, so the initial bug report submitted on Apr 27, 2023 still exists. |
Another solution would be by the Fishtest server to count idle workers and display the number of idle workers alongside the total number of workers. Then, if the notice number of idle workers will be frequently more than zero, we will adjust the formula of the batch size calculation accordingly. Another useful feature for the diagnostics is when the server sends to the worker the message |
You are really exaggerating the seriousness of this issue. Depending on the number of submitted tests, a certain number of tasks is available. If there are more workers than tasks (during a fleet visit) then some workers will be idle sometimes. But this is a luxury problem. In 99% of the cases, the problem is the opposite: too few workers instead of too many. So IMHO it is not worth spending a lot of effort on this. One possibility is to raise the number of games for SPRT tests (or abolish the limit) but then we risk hitting the maximum document size in mongod. |
There was a bug introduced in /worker/games.py in one of latest commits which results in a problem that a task is not properly split between the workers which results in idle workers when there are many workers but few tasks. The tasks are split to the batches which becomes too large, therefore they are taking very long time to proceed when there are plenty idle workers available. Please consider reverting the code or have this option configured on the server and transmitted to workers.
batch_size = games_concurrency * 4 * max(1, round(tc_limit_ltc / tc_limit))
P.S. Too large batch sizes issue was also noticed by @Technologov
The bug is reproducable with SPRT tests, but you can also probably reproduce this bug with an SPSA test:, create an SPSA task with
nodestime=600
and160+1.6
as described in fishtest FAQ at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests; have num_games 3000000 and have 1000 machines of workers. It will allocate less than 600 machines, the other machines will be idle unless there are other tasks. With just that single task remaining machines will be idle. However, with the old version of this code when there were nomax(1, round(tc_limit_ltc / tc_limit))
all machines were allocated.The text was updated successfully, but these errors were encountered: