Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master process hangs upon oom error on SLURM with clustermq #813

Closed
kendonB opened this issue Apr 2, 2019 · 10 comments
Closed

Master process hangs upon oom error on SLURM with clustermq #813

kendonB opened this issue Apr 2, 2019 · 10 comments
Assignees

Comments

@kendonB
Copy link
Contributor

kendonB commented Apr 2, 2019

The content of #549 suggests that I should get a message on master upon an oom error.

I run:

library(tidyverse)
library(drake)
options(clustermq.scheduler = "slurm", 
        clustermq.template = "slurm_clustermq.tmpl")

make(
  drake_plan(
    assign_massive_vector = {
      big_vector <- 1:1e15
      
      massive_vector = c(big_vector, big_vector, big_vector)
    }
  ),
  verbose = 4,
  jobs = 1,
  parallelism = "clustermq",
  template = list(memory = "5M",
                  # minutes
                  walltime = 5,
                  # partition = "prepost",
                  log_file = "make.log"),
  console_log_file = NULL)

And on the master I see (and it's hanging after the oom error):

Submitting 1 worker jobs (ID: 7358) ...
Warning in private$fill_options(...) :
  Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication

On the worker I see:

/var/spool/slurm/job2958959/slurm_script: line 15: 138755 Killed                  Rscript -e 'clustermq:::worker("tcp://mahuika01:7358", v
erbose = TRUE)'
slurmstepd: error: Detected 1 oom-kill event(s) in step 2958959.batch cgroup.
@wlandau
Copy link
Member

wlandau commented Apr 3, 2019

Unfortunately, there is nothing drake can do here. It looks like @mschubert plans to address this problem in mschubert/clustermq#33.

@wlandau wlandau closed this as completed Apr 3, 2019
@mschubert
Copy link

Also relevant: mschubert/clustermq#110 (comment)

28/400 workers ended with the following out of memory error

Can you try adding a call to ulimit at the beginning of your function and see if this works? This way it will produce an R error that will get sent back to the master process.

I used this in clustermq in the past, but since it's not on CRAN had to rely on the shell's ulimit instead (which should do the same, but sometimes fails).

@wlandau
Copy link
Member

wlandau commented Apr 3, 2019

Oh, interesting. @kendonB, maybe in the template (e.g. slurm_clustermq.tmpl) instead of

ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

you could try

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'ulimit::memory_limit({{ memory | 4096 }}); clustermq:::worker("{{ master }}")'

@kendonB
Copy link
Contributor Author

kendonB commented Apr 3, 2019

Tried this and no change in behavior. I also thought I had asked about this before and couldn't find the issue!

@mschubert
Copy link

Hm, interesting. I had memory limits fail on occasion, but not in a way that I could reproduce.

It is possible that there is some process overhead that is tracked by SLURM but not the shell/R, or that your job reserves some memory outside of the reach of ulimit.

Can you check how much memory you are over the limit (job log), and if you can get around this by setting e.g.:

ulimit::memory_limit({{ memory | 4096 }}-100)

@mschubert
Copy link

mschubert commented Apr 4, 2019

Oh never mind, I didn't see you tested with a huge vector.

The issue here is a different one: your Slurm scheduler understands "5M", but ulimit does not. And R already uses more than that, so this limit would likely be ineffective too.

I should probably add a warning if memory is not numeric. But your worker log should show this already?

Warning message:
    In ulimit::memory_limit("5M") : NAs introduced by coercion

or on the shell

ulimit: Invalid limit '5M'

@kendonB
Copy link
Contributor Author

kendonB commented Apr 7, 2019

@wlandau @mschubert turns out I had actually disabled ulimit in the first place as it had been failing since I was using "5G" etc instead of 5120.

Switched to just using the numeric argument and now can see the error message get through. Thanks!

@mschubert
Copy link

@kendonB Do you remember if you already had ulimit disabled for mschubert/clustermq#110?

@kendonB
Copy link
Contributor Author

kendonB commented Apr 8, 2019

It is likely that I didn't have ulimit on but I can't remember exactly. I wonder if it's worth having some code translate the "5G" etc that slurm understands to numeric MBs within the clustermq call?

@mschubert
Copy link

Yes, that makes sense. I'll likely add this when I'm moving ulimit to R and don't have to use a lot of bash magic (expected v1.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants