-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mc_ensure_workers seems to ensure _all_ workers are ready - shouldn't it ensure if _any_ are ready? #453
Comments
At first I thought so too, but
So if we assign targets greedily, most of the workers will idle while the early birds monopolize all the worms. Not great load balancing. You can see what happens with |
Finally trying this with I'm currently running a job and the first two workers are running with 6 others waiting in the slurm queue. They're both happilly running through targets before their friends show up. It seems to be working perfectly. Given this single experience, I think that the default should be |
Update: a third worker joined in and started working right away. Do you mean "Not great load balancing" in the sense that the late workers end up doing less? That seems like the ideal behaviour. |
Yes, that is the behavior of
Yes. I was encountering a situation in which the most time-consuming targets were assigned before most of the workers had a chance to spin up. For the part of the project where I needed parallelism the most, I had fewer workers than promised. |
This is confusing. The above task was a set of many embarrassingly parallel (but slow) tasks and the first two workers were around for the first 5 minutes. The third worker showed up then got working right away. If the master process is assigning targets to non-idle workers, shouldn't I have seen the third worker alive but not doing anything? |
Perhaps you see the "worker_ready" message as soon as the SLURM job gets submitted, rather than when it actually starts running on a physical node? |
First, persistent Lines 75 to 76 in 21f0ba1
Then, Lines 74 to 85 in 21f0ba1
Hmm... I actually see now that ready targets are only assigned to empty queues (idle workers). That was not always the case. It has been a while since I write the code, and I am only now just jogging my memory. Lines 102 to 106 in 21f0ba1
Does this help things make more sense? |
I am surprised by the behavior you mention in #453 (comment). In any case, I am hoping that someday that |
Can you explain your surprise? Sounds consistent with:
|
I guess I was confused by
Did these two workers exit after 5 minutes and leave the rest to the third worker? Because if so, something could be wrong. A worker should only exit if it is done with its current task and there are no longer any targets left to assign. |
No. The first two stayed around after the first 5 minutes |
The test in the below code in
fl_master
looks like it waits for all workers to be "ready" before moving on. Shouldn't it be able to go if any of the workers are "ready"? Ref #449The text was updated successfully, but these errors were encountered: