Master process hangs upon oom error on SLURM with clustermq #813

kendonB · 2019-04-02T22:42:52Z

The content of #549 suggests that I should get a message on master upon an oom error.

I run:

library(tidyverse)
library(drake)
options(clustermq.scheduler = "slurm", 
        clustermq.template = "slurm_clustermq.tmpl")

make(
  drake_plan(
    assign_massive_vector = {
      big_vector <- 1:1e15
      
      massive_vector = c(big_vector, big_vector, big_vector)
    }
  ),
  verbose = 4,
  jobs = 1,
  parallelism = "clustermq",
  template = list(memory = "5M",
                  # minutes
                  walltime = 5,
                  # partition = "prepost",
                  log_file = "make.log"),
  console_log_file = NULL)

And on the master I see (and it's hanging after the oom error):

Submitting 1 worker jobs (ID: 7358) ...
Warning in private$fill_options(...) :
  Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication

On the worker I see:

/var/spool/slurm/job2958959/slurm_script: line 15: 138755 Killed                  Rscript -e 'clustermq:::worker("tcp://mahuika01:7358", v
erbose = TRUE)'
slurmstepd: error: Detected 1 oom-kill event(s) in step 2958959.batch cgroup.

The text was updated successfully, but these errors were encountered:

wlandau · 2019-04-03T12:21:49Z

Unfortunately, there is nothing drake can do here. It looks like @mschubert plans to address this problem in mschubert/clustermq#33.

mschubert · 2019-04-03T12:29:50Z

Also relevant: mschubert/clustermq#110 (comment)

28/400 workers ended with the following out of memory error

Can you try adding a call to ulimit at the beginning of your function and see if this works? This way it will produce an R error that will get sent back to the master process.

I used this in clustermq in the past, but since it's not on CRAN had to rely on the shell's ulimit instead (which should do the same, but sometimes fails).

wlandau · 2019-04-03T12:37:18Z

Oh, interesting. @kendonB, maybe in the template (e.g. slurm_clustermq.tmpl) instead of

ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

you could try

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'ulimit::memory_limit({{ memory | 4096 }}); clustermq:::worker("{{ master }}")'

kendonB · 2019-04-03T22:03:58Z

Tried this and no change in behavior. I also thought I had asked about this before and couldn't find the issue!

mschubert · 2019-04-04T05:55:36Z

Hm, interesting. I had memory limits fail on occasion, but not in a way that I could reproduce.

It is possible that there is some process overhead that is tracked by SLURM but not the shell/R, or that your job reserves some memory outside of the reach of ulimit.

Can you check how much memory you are over the limit (job log), and if you can get around this by setting e.g.:

ulimit::memory_limit({{ memory | 4096 }}-100)

mschubert · 2019-04-04T13:55:20Z

Oh never mind, I didn't see you tested with a huge vector.

The issue here is a different one: your Slurm scheduler understands "5M", but ulimit does not. And R already uses more than that, so this limit would likely be ineffective too.

I should probably add a warning if memory is not numeric. But your worker log should show this already?

Warning message:
    In ulimit::memory_limit("5M") : NAs introduced by coercion

or on the shell

ulimit: Invalid limit '5M'

kendonB · 2019-04-07T23:20:51Z

@wlandau @mschubert turns out I had actually disabled ulimit in the first place as it had been failing since I was using "5G" etc instead of 5120.

Switched to just using the numeric argument and now can see the error message get through. Thanks!

mschubert · 2019-04-08T16:27:11Z

@kendonB Do you remember if you already had ulimit disabled for mschubert/clustermq#110?

kendonB · 2019-04-08T20:47:26Z

It is likely that I didn't have ulimit on but I can't remember exactly. I wonder if it's worth having some code translate the "5G" etc that slurm understands to numeric MBs within the clustermq call?

mschubert · 2019-04-10T13:53:30Z

Yes, that makes sense. I'll likely add this when I'm moving ulimit to R and don't have to use a lot of bash magic (expected v1.0).

kendonB added the type: bug label Apr 2, 2019

kendonB assigned wlandau Apr 2, 2019

wlandau closed this as completed Apr 3, 2019

mschubert mentioned this issue Apr 10, 2019

Master stalls with various worker "expirations" mschubert/clustermq#110

Closed

mschubert mentioned this issue Apr 10, 2019

ulimit does not understand memory multipliers mschubert/clustermq#137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master process hangs upon oom error on SLURM with clustermq #813

Master process hangs upon oom error on SLURM with clustermq #813

kendonB commented Apr 2, 2019

wlandau commented Apr 3, 2019

mschubert commented Apr 3, 2019

wlandau commented Apr 3, 2019

kendonB commented Apr 3, 2019 •

edited

Loading

mschubert commented Apr 4, 2019

mschubert commented Apr 4, 2019 •

edited

Loading

kendonB commented Apr 7, 2019

mschubert commented Apr 8, 2019

kendonB commented Apr 8, 2019

mschubert commented Apr 10, 2019

Master process hangs upon oom error on SLURM with clustermq #813

Master process hangs upon oom error on SLURM with clustermq #813

Comments

kendonB commented Apr 2, 2019

wlandau commented Apr 3, 2019

mschubert commented Apr 3, 2019

wlandau commented Apr 3, 2019

kendonB commented Apr 3, 2019 • edited Loading

mschubert commented Apr 4, 2019

mschubert commented Apr 4, 2019 • edited Loading

kendonB commented Apr 7, 2019

mschubert commented Apr 8, 2019

kendonB commented Apr 8, 2019

mschubert commented Apr 10, 2019

kendonB commented Apr 3, 2019 •

edited

Loading

mschubert commented Apr 4, 2019 •

edited

Loading