Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers #1635

Closed
maximmasiutin opened this issue Apr 26, 2023 · 19 comments · May be fixed by #1659

Comments

@maximmasiutin
Copy link
Contributor

maximmasiutin commented Apr 26, 2023

There was a bug introduced in /worker/games.py in one of latest commits which results in a problem that a task is not properly split between the workers which results in idle workers when there are many workers but few tasks. The tasks are split to the batches which becomes too large, therefore they are taking very long time to proceed when there are plenty idle workers available. Please consider reverting the code or have this option configured on the server and transmitted to workers.

batch_size = games_concurrency * 4 * max(1, round(tc_limit_ltc / tc_limit))

     # Adjust CPU scaling.
-    _, tc_limit_ltc = adjust_tc("60+0.6", factor)
     scaled_tc, tc_limit = adjust_tc(run["args"]["tc"], factor)
     scaled_new_tc = scaled_tc
     if "new_tc" in run["args"]:
@@ -1313,9 +1312,7 @@ def run_games(worker_info, password, remote, run, task_id, pgn_file):
         tc_limit *= 2

     while games_remaining > 0:
-        # Update frequency for NumGames/SPSA test:
-        # every 4 games at LTC, or a similar time interval at shorter TCs
-        batch_size = games_concurrency * 4 * max(1, round(tc_limit_ltc / tc_limit))
+        batch_size = games_concurrency * 4  # update frequency

P.S. Too large batch sizes issue was also noticed by @Technologov

The bug is reproducable with SPRT tests, but you can also probably reproduce this bug with an SPSA test:, create an SPSA task with nodestime=600 and 160+1.6 as described in fishtest FAQ at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests; have num_games 3000000 and have 1000 machines of workers. It will allocate less than 600 machines, the other machines will be idle unless there are other tasks. With just that single task remaining machines will be idle. However, with the old version of this code when there were no max(1, round(tc_limit_ltc / tc_limit)) all machines were allocated.

@Technologov
Copy link

It may be a 'bug' or a 'feature'. I have only observed that 'batch size' has grown.
What should be the proper 'batch size' ?

RAF Fenix aka Technologov

@vondele
Copy link
Member

vondele commented Apr 27, 2023

not a bug, just a consideration of what to optimize for.

@maximmasiutin
Copy link
Contributor Author

not a bug, just a consideration of what to optimize for.

Appearance of idle workers just because the task is split by too few workers due to high batches was not an indented behavior probably.

@maximmasiutin
Copy link
Contributor Author

See attached screenshot: 2338 games per batch for 16 cores, 3496 games for 24 cores -- that's just too much for 10+0.1s.

cores

@vondele
Copy link
Member

vondele commented Apr 27, 2023

There is also #1627
Creating many small tasks is not a good idea, we've been there and done that.

@maximmasiutin
Copy link
Contributor Author

maximmasiutin commented May 3, 2023

Can you make a config option to set maximum amount of games per batch to mitigate this bug that causes idle workers because of too large batches? I have batches as large as 16886 games, so all the games in the test are split between too few workers, and I cannot add additional workers - they don't take this test because all the games are split amongst the existing workers. 16886 games of TC 160+1.6 take too long to complete on for a 64-core worker, so even when there is enough workers, they are idle.

@vdbergh
Copy link
Contributor

vdbergh commented May 3, 2023

Wouldn't this be solved by the PR that increases the maximum number of games for an SPRT test?

@vdbergh
Copy link
Contributor

vdbergh commented May 3, 2023

Note: the number of games for an SPRT tests after which it is stopped, and the number used for task size computation, does not have to be the same (it currently is).

@maximmasiutin
Copy link
Contributor Author

Fixed by #1659

@maximmasiutin
Copy link
Contributor Author

SPRT

I have an SPSA test (not an SPRT) with 3000000 games, and the workers are "greedy", so they took all the games and I cannot add more workers because they would be idle, unless I use an option implemented in #1659

@maximmasiutin
Copy link
Contributor Author

Wouldn't this be solved by the PR that increases the maximum number of games for an SPRT test?

For an SPSA test, I cannot increase the number of games. According to https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests
Note: Do not modify the number of games in an SPSA test while it's running. This breaks the algorithm.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

I would propose to close this. It's not clear what should be done, if anything.

@vondele
Copy link
Member

vondele commented Jul 4, 2024

I agree this is probably only a corner case. Usually there are enough tests to distribute workers

@vondele vondele closed this as completed Jul 4, 2024
@maximmasiutin
Copy link
Contributor Author

This is not a corner case. I am now experiencing such a message the whole day today ("No tasks available at this time, waiting..."):

Current time is 2024-07-06 17:12:24.609797+00:00 UTC (local offset: +02:00)
Verify worker version...
Post request http://tests.stockfishchess.org:80/api/request_version handled in 39.89ms (server: 0.16ms)
Remaining number of GitHub api calls = 5000
Fetching task...
Post request http://tests.stockfishchess.org:80/api/request_task handled in 37.84ms (server: 3.40ms)
No tasks available at this time, waiting...
Waiting 30.0 seconds before retrying
Current time is 2024-07-06 17:12:54.874769+00:00 UTC (local offset: +02:00)
Verify worker version...
Post request http://tests.stockfishchess.org:80/api/request_version handled in 35.62ms (server: 0.15ms)
Remaining number of GitHub api calls = 5000
Fetching task...
Post request http://tests.stockfishchess.org:80/api/request_task handled in 35.70ms (server: 1.88ms)
No tasks available at this time, waiting...
Waiting 60.0 seconds before retrying
Current time is 2024-07-06 17:13:55.139832+00:00 UTC (local offset: +02:00)
Verify worker version...
Post request http://tests.stockfishchess.org:80/api/request_version handled in 36.91ms (server: 0.15ms)
Remaining number of GitHub api calls = 5000
Fetching task...
Post request http://tests.stockfishchess.org:80/api/request_task handled in 35.79ms (server: 1.56ms)
No tasks available at this time, waiting...
Waiting 120.0 seconds before retrying
  Send heartbeat for e85c97d3-22ba-46d2-abb5-9c706328bea3 ... Skipping heartbeat ...

This did not happen in the past, as there were always work to do.

Now the worker is getting "No tasks available at this time, waiting..."

@maximmasiutin
Copy link
Contributor Author

The current workers are too "greedy". They "reserve" too much work so that they do not leave work to the other workers.

@MinetaS
Copy link
Contributor

MinetaS commented Jul 6, 2024

The bug is reproducable with SPRT tests, but you can also probably reproduce this bug with an SPSA test:, create an SPSA task with nodestime=600 and 160+1.6 as described in fishtest FAQ at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests

nodestime is not supported at the moment. There is a pending PR written by me to fix this which needs to be reviewed yet.

This is not a corner case. I am now experiencing such a message the whole day today ("No tasks available at this time, waiting..."):

How are your workers configured? Depending on # of cores or memory size, some tests will not be allocated to workers.

@maximmasiutin
Copy link
Contributor Author

There is now at least the worker e85c97d3 that does not receive tasks due to a bug in batch size calculation, so the initial bug report submitted on Apr 27, 2023 still exists.

@maximmasiutin
Copy link
Contributor Author

Another solution would be by the Fishtest server to count idle workers and display the number of idle workers alongside the total number of workers. Then, if the notice number of idle workers will be frequently more than zero, we will adjust the formula of the batch size calculation accordingly. Another useful feature for the diagnostics is when the server sends to the worker the message No tasks available at this time, waiting..., additionally send the information about the number of workers and tasks to be logged.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 8, 2024

You are really exaggerating the seriousness of this issue. Depending on the number of submitted tests, a certain number of tasks is available. If there are more workers than tasks (during a fleet visit) then some workers will be idle sometimes. But this is a luxury problem. In 99% of the cases, the problem is the opposite: too few workers instead of too many. So IMHO it is not worth spending a lot of effort on this.

One possibility is to raise the number of games for SPRT tests (or abolish the limit) but then we risk hitting the maximum document size in mongod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants