Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mc_ensure_workers seems to ensure _all_ workers are ready - shouldn't it ensure if _any_ are ready? #453

Closed
kendonB opened this issue Jul 3, 2018 · 11 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Jul 3, 2018

The test in the below code in fl_master looks like it waits for all workers to be "ready" before moving on. Shouldn't it be able to go if any of the workers are "ready"? Ref #449

mc_ensure_workers <- function(config){
  paths <- vapply(
    X = config$cache$list(namespace = "mc_ready_db"),
    FUN = function(worker){
      config$cache$get(key = worker, namespace = "mc_ready_db")
    },
    FUN.VALUE = character(1)
  )
  while (!all(file.exists(paths))){
    Sys.sleep(mc_wait) # nocov
  }
}
@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

At first I thought so too, but

  1. I found it best to only delegate to workers that actually showed up, and
  2. The master process can assign targets faster than the workers can post.

So if we assign targets greedily, most of the workers will idle while the early birds monopolize all the worms. Not great load balancing. You can see what happens with make(ensure_workers = FALSE).

@kendonB
Copy link
Contributor Author

kendonB commented Oct 9, 2018

Finally trying this with ensure_workers = FALSE. Does the master process assign tasks to non-idle processes? I'm a bit confused as to how the lack of load balancing actually occurs.

I'm currently running a job and the first two workers are running with 6 others waiting in the slurm queue. They're both happilly running through targets before their friends show up. It seems to be working perfectly. Given this single experience, I think that the default should be ensure_workers = FALSE :).

@kendonB
Copy link
Contributor Author

kendonB commented Oct 9, 2018

Update: a third worker joined in and started working right away. Do you mean "Not great load balancing" in the sense that the late workers end up doing less? That seems like the ideal behaviour.

@wlandau
Copy link
Member

wlandau commented Oct 9, 2018

Does the master process assign tasks to non-idle processes? I'm a bit confused as to how the lack of load balancing actually occurs.

Yes, that is the behavior of mc_assign_ready_targets().

Update: a third worker joined in and started working right away. Do you mean "Not great load balancing" in the sense that the late workers end up doing less?

Yes. I was encountering a situation in which the most time-consuming targets were assigned before most of the workers had a chance to spin up. For the part of the project where I needed parallelism the most, I had fewer workers than promised.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 9, 2018

This is confusing. The above task was a set of many embarrassingly parallel (but slow) tasks and the first two workers were around for the first 5 minutes. The third worker showed up then got working right away. If the master process is assigning targets to non-idle workers, shouldn't I have seen the third worker alive but not doing anything?

@kendonB
Copy link
Contributor Author

kendonB commented Oct 9, 2018

Perhaps you see the "worker_ready" message as soon as the SLURM job gets submitted, rather than when it actually starts running on a physical node?

@wlandau
Copy link
Member

wlandau commented Oct 12, 2018

First, persistent mclapply/parLapply/future_lapply workers need to "post" by initializing their queues.

drake/R/mclapply.R

Lines 75 to 76 in 21f0ba1

ready_queue <- mc_get_ready_queue(worker, config)
done_queue <- mc_get_done_queue(worker, config)

Then, mc_master() notices the new queues when it calls mc_refresh_queue_lists(). That way, when it comes time to assign targets to workers, all the targets with no outdated dependencies are enqueued.

drake/R/mc_utils.R

Lines 74 to 85 in 21f0ba1

mc_assign_ready_targets <- function(config){
if (!length(config$mc_ready_queues)){
return()
}
for (target in config$queue$list0()){
queue <- mc_preferred_queue(target = target, config = config)
if (!is.null(queue)){
queue$push(title = target, message = "target")
config$queue$remove(targets = target)
}
}
}

Hmm... I actually see now that ready targets are only assigned to empty queues (idle workers). That was not always the case. It has been a while since I write the code, and I am only now just jogging my memory.

drake/R/mc_utils.R

Lines 102 to 106 in 21f0ba1

for (queue in config$mc_ready_queues){
if (queue$empty()){
return(queue)
}
}

Does this help things make more sense?

@wlandau
Copy link
Member

wlandau commented Oct 12, 2018

I am surprised by the behavior you mention in #453 (comment).

In any case, I am hoping that someday that drake can offload all its scheduling to more sophisticated message/task brokers. clustermq has much better load balancing, so that is a good start.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 12, 2018

Can you explain your surprise? Sounds consistent with:

Hmm... I actually see now that ready targets are only assigned to empty queues (idle workers). That was not always the case. It has been a while since I write the code, and I am only now just jogging my memory.

@wlandau
Copy link
Member

wlandau commented Oct 13, 2018

I guess I was confused by

the first two workers were around for the first 5 minutes

Did these two workers exit after 5 minutes and leave the rest to the third worker? Because if so, something could be wrong. A worker should only exit if it is done with its current task and there are no longer any targets left to assign.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 13, 2018

No. The first two stayed around after the first 5 minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants