Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers #1635

maximmasiutin · 2023-04-26T22:14:37Z

There was a bug introduced in /worker/games.py in one of latest commits which results in a problem that a task is not properly split between the workers which results in idle workers when there are many workers but few tasks. The tasks are split to the batches which becomes too large, therefore they are taking very long time to proceed when there are plenty idle workers available. Please consider reverting the code or have this option configured on the server and transmitted to workers.

batch_size = games_concurrency * 4 * max(1, round(tc_limit_ltc / tc_limit))

     # Adjust CPU scaling.
-    _, tc_limit_ltc = adjust_tc("60+0.6", factor)
     scaled_tc, tc_limit = adjust_tc(run["args"]["tc"], factor)
     scaled_new_tc = scaled_tc
     if "new_tc" in run["args"]:
@@ -1313,9 +1312,7 @@ def run_games(worker_info, password, remote, run, task_id, pgn_file):
         tc_limit *= 2

     while games_remaining > 0:
-        # Update frequency for NumGames/SPSA test:
-        # every 4 games at LTC, or a similar time interval at shorter TCs
-        batch_size = games_concurrency * 4 * max(1, round(tc_limit_ltc / tc_limit))
+        batch_size = games_concurrency * 4  # update frequency

P.S. Too large batch sizes issue was also noticed by @Technologov

The bug is reproducable with SPRT tests, but you can also probably reproduce this bug with an SPSA test:, create an SPSA task with nodestime=600 and 160+1.6 as described in fishtest FAQ at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests; have num_games 3000000 and have 1000 machines of workers. It will allocate less than 600 machines, the other machines will be idle unless there are other tasks. With just that single task remaining machines will be idle. However, with the old version of this code when there were no max(1, round(tc_limit_ltc / tc_limit)) all machines were allocated.

The text was updated successfully, but these errors were encountered:

Technologov · 2023-04-26T22:44:41Z

It may be a 'bug' or a 'feature'. I have only observed that 'batch size' has grown.
What should be the proper 'batch size' ?

RAF Fenix aka Technologov

vondele · 2023-04-27T04:59:59Z

not a bug, just a consideration of what to optimize for.

maximmasiutin · 2023-04-27T08:12:43Z

not a bug, just a consideration of what to optimize for.

Appearance of idle workers just because the task is split by too few workers due to high batches was not an indented behavior probably.

maximmasiutin · 2023-04-27T09:32:22Z

See attached screenshot: 2338 games per batch for 16 cores, 3496 games for 24 cores -- that's just too much for 10+0.1s.

vondele · 2023-04-27T09:35:02Z

There is also #1627
Creating many small tasks is not a good idea, we've been there and done that.

maximmasiutin · 2023-05-03T09:25:49Z

Can you make a config option to set maximum amount of games per batch to mitigate this bug that causes idle workers because of too large batches? I have batches as large as 16886 games, so all the games in the test are split between too few workers, and I cannot add additional workers - they don't take this test because all the games are split amongst the existing workers. 16886 games of TC 160+1.6 take too long to complete on for a 64-core worker, so even when there is enough workers, they are idle.

vdbergh · 2023-05-03T10:23:10Z

Wouldn't this be solved by the PR that increases the maximum number of games for an SPRT test?

vdbergh · 2023-05-03T10:24:35Z

Note: the number of games for an SPRT tests after which it is stopped, and the number used for task size computation, does not have to be the same (it currently is).

maximmasiutin · 2023-05-03T11:40:25Z

Fixed by #1659

maximmasiutin · 2023-05-03T11:41:38Z

SPRT

I have an SPSA test (not an SPRT) with 3000000 games, and the workers are "greedy", so they took all the games and I cannot add more workers because they would be idle, unless I use an option implemented in #1659

maximmasiutin · 2023-05-03T11:50:30Z

Wouldn't this be solved by the PR that increases the maximum number of games for an SPRT test?

For an SPSA test, I cannot increase the number of games. According to https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests
Note: Do not modify the number of games in an SPSA test while it's running. This breaks the algorithm.

vdbergh · 2024-07-04T14:26:22Z

I would propose to close this. It's not clear what should be done, if anything.

vondele · 2024-07-04T14:52:27Z

I agree this is probably only a corner case. Usually there are enough tests to distribute workers

maximmasiutin · 2024-07-06T17:16:22Z

This is not a corner case. I am now experiencing such a message the whole day today ("No tasks available at this time, waiting..."):

Current time is 2024-07-06 17:12:24.609797+00:00 UTC (local offset: +02:00)
Verify worker version...
Post request http://tests.stockfishchess.org:80/api/request_version handled in 39.89ms (server: 0.16ms)
Remaining number of GitHub api calls = 5000
Fetching task...
Post request http://tests.stockfishchess.org:80/api/request_task handled in 37.84ms (server: 3.40ms)
No tasks available at this time, waiting...
Waiting 30.0 seconds before retrying
Current time is 2024-07-06 17:12:54.874769+00:00 UTC (local offset: +02:00)
Verify worker version...
Post request http://tests.stockfishchess.org:80/api/request_version handled in 35.62ms (server: 0.15ms)
Remaining number of GitHub api calls = 5000
Fetching task...
Post request http://tests.stockfishchess.org:80/api/request_task handled in 35.70ms (server: 1.88ms)
No tasks available at this time, waiting...
Waiting 60.0 seconds before retrying
Current time is 2024-07-06 17:13:55.139832+00:00 UTC (local offset: +02:00)
Verify worker version...
Post request http://tests.stockfishchess.org:80/api/request_version handled in 36.91ms (server: 0.15ms)
Remaining number of GitHub api calls = 5000
Fetching task...
Post request http://tests.stockfishchess.org:80/api/request_task handled in 35.79ms (server: 1.56ms)
No tasks available at this time, waiting...
Waiting 120.0 seconds before retrying
  Send heartbeat for e85c97d3-22ba-46d2-abb5-9c706328bea3 ... Skipping heartbeat ...

This did not happen in the past, as there were always work to do.

Now the worker is getting "No tasks available at this time, waiting..."

maximmasiutin · 2024-07-06T17:17:29Z

The current workers are too "greedy". They "reserve" too much work so that they do not leave work to the other workers.

MinetaS · 2024-07-06T17:37:38Z

The bug is reproducable with SPRT tests, but you can also probably reproduce this bug with an SPSA test:, create an SPSA task with nodestime=600 and 160+1.6 as described in fishtest FAQ at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#spsa-tests

nodestime is not supported at the moment. There is a pending PR written by me to fix this which needs to be reviewed yet.

This is not a corner case. I am now experiencing such a message the whole day today ("No tasks available at this time, waiting..."):

How are your workers configured? Depending on # of cores or memory size, some tests will not be allocated to workers.

maximmasiutin · 2024-07-06T17:56:40Z

There is now at least the worker e85c97d3 that does not receive tasks due to a bug in batch size calculation, so the initial bug report submitted on Apr 27, 2023 still exists.

maximmasiutin · 2024-07-08T05:41:08Z

Another solution would be by the Fishtest server to count idle workers and display the number of idle workers alongside the total number of workers. Then, if the notice number of idle workers will be frequently more than zero, we will adjust the formula of the batch size calculation accordingly. Another useful feature for the diagnostics is when the server sends to the worker the message No tasks available at this time, waiting..., additionally send the information about the number of workers and tasks to be logged.

vdbergh · 2024-07-08T06:50:39Z

You are really exaggerating the seriousness of this issue. Depending on the number of submitted tests, a certain number of tasks is available. If there are more workers than tasks (during a fleet visit) then some workers will be idle sometimes. But this is a luxury problem. In 99% of the cases, the problem is the opposite: too few workers instead of too many. So IMHO it is not worth spending a lot of effort on this.

One possibility is to raise the number of games for SPRT tests (or abolish the limit) but then we risk hitting the maximum document size in mongod.

maximmasiutin mentioned this issue May 3, 2023

Config option to set worker cap #1659

Open

vondele closed this as completed Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers #1635

Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers #1635

maximmasiutin commented Apr 26, 2023 •

edited

Loading

Technologov commented Apr 26, 2023

vondele commented Apr 27, 2023

maximmasiutin commented Apr 27, 2023

maximmasiutin commented Apr 27, 2023

vondele commented Apr 27, 2023

maximmasiutin commented May 3, 2023 •

edited

Loading

vdbergh commented May 3, 2023

vdbergh commented May 3, 2023

maximmasiutin commented May 3, 2023

maximmasiutin commented May 3, 2023

maximmasiutin commented May 3, 2023

vdbergh commented Jul 4, 2024

vondele commented Jul 4, 2024

maximmasiutin commented Jul 6, 2024

maximmasiutin commented Jul 6, 2024

MinetaS commented Jul 6, 2024

maximmasiutin commented Jul 6, 2024

maximmasiutin commented Jul 8, 2024

vdbergh commented Jul 8, 2024

Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers #1635

Bug in batch size calculation introduced recently: when there are many workers but few tasks, the task is not properly split between the workers which results in idle workers #1635

Comments

maximmasiutin commented Apr 26, 2023 • edited Loading

Technologov commented Apr 26, 2023

vondele commented Apr 27, 2023

maximmasiutin commented Apr 27, 2023

maximmasiutin commented Apr 27, 2023

vondele commented Apr 27, 2023

maximmasiutin commented May 3, 2023 • edited Loading

vdbergh commented May 3, 2023

vdbergh commented May 3, 2023

maximmasiutin commented May 3, 2023

maximmasiutin commented May 3, 2023

maximmasiutin commented May 3, 2023

vdbergh commented Jul 4, 2024

vondele commented Jul 4, 2024

maximmasiutin commented Jul 6, 2024

maximmasiutin commented Jul 6, 2024

MinetaS commented Jul 6, 2024

maximmasiutin commented Jul 6, 2024

maximmasiutin commented Jul 8, 2024

vdbergh commented Jul 8, 2024

maximmasiutin commented Apr 26, 2023 •

edited

Loading

maximmasiutin commented May 3, 2023 •

edited

Loading