Uninformative error message when clustermq workers run out of memory #549

idavydov · 2018-10-16T14:43:22Z

I'm currently running my huge plan (>5k targets) on a cluster using clustermq & SLURM.

When I run make() on the cluster with caching="worker" I randomly get the following message:

R> make(my.plan, parallelism='clustermq', jobs=10, caching='worker')
target a
target b
target b
Error: $ operator is invalid for atomic vectors
R>

After that make() just drops me back to R. This does not seem to depend on the target or number of workers and happens more or less randomly after 20-50 targets.

Do you have any suggestions on how to debug this?

Settings caching="master" seem to resolve the issue. However this is not an option for me since I often get out of memory problem on the main process. I tried setting lazy_load=TRUE, pruning_strategy='memory', and garbage_collection=TRUE. But I still get the memory problem. I think the problem is that saving data from all workers happens in the master process.

R 3.5.1
drake revision b68fc8e
clustermq github commit 82b89e0375a08d49dd85c3f3c5df2c724bff3177 (also tried 0.8.5, same behavior)

The text was updated successfully, but these errors were encountered:

wlandau · 2018-10-16T15:23:44Z

I remember encountering this problem early in the development of #532, but not since. Do any of the targets report errors in diagnose()? When I get a chance, I will expose the log_worker argument from clustermq::Q(), which will hopefully help you debug.

wlandau · 2018-10-16T15:29:42Z

Would you try make(my.plan, parallelism = "clustermq", jobs = 1, caching = "worker", verbose = 4)? The extra verbosity and jobs = 1 could help us tell if the error is occurring before or after the target is being stored.

wlandau · 2018-10-16T15:30:40Z

And if possible, it would really help to have a reproducible example. Probably a lot of work for you to disentangle one and post it, but it would help me debug.

wlandau · 2018-10-16T16:14:40Z

Edit: you can use the template argument to specify a clustermq log file:

make(
  my.plan,
  parallelism = "clustermq",
  jobs = 1,
  caching = "worker",
  verbose = 4,
  template = list(log_file = "my_log_file.txt")
)

wlandau · 2018-10-16T20:42:29Z

I am trying to reproduce the error myself, and things seem to be running fine so far on an SGE cluster.

library(drake)
plan <- evaluate_plan(
  drake_plan(x = mean(rnorm(1e5) + z__), y = x_z__ + mean(rnorm(1e5))),
  wildcard = "z__",
  values = seq_len(5000)
)
make(plan, parallelism = "clustermq", jobs = 4, caching = "worker", verbose = 4)
#> cache <CENSORED>/.drake
#> analyze environment
#> analyze 6 imports: ld, wd, td, spell_check_ignore, plan, dl
#> analyze 10000 targets: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10, x_1...
#> construct graph edges
#> construct vertex attributes
#> construct graph
#> Submitting 4 worker jobs (ID: 7883) ...
#> target x_1
#> target x_2
#> target x_3
#> ...
#> target x_1030
#> target x_1031
#> target x_1032
#> ...

idavydov · 2018-10-16T21:09:50Z

Thanks a lot for working on this. It seems that the problem was the out of memory error in workers.

I will provide more details tomorrow. But I could find a memory allocation error by using diagnose().

Perhaps it's still a bug since drake exists without a proper diagnostic message.

wlandau · 2018-10-16T21:50:16Z

Glad to know. Running out of memory on remote workers seems like a common problem, but I think it is outside the scope of drake: from @mschubert's comment at #449 (comment), if workers run out of memory, the optional clustermq log file should say so. The log_worker argument is deprecated, but you can use the template argument of make() . I did not explain it correctly before, so I updated #549 (comment) with a correction just now.

wlandau · 2018-10-16T21:50:38Z

I know I closed this issue, but I am still looking forward to learning about the details.

wlandau · 2018-10-16T21:52:41Z

Then again, I am not sure why this problem occurs with caching = "worker" and not caching = "master". Perhaps it has something to do with the in-memory caching used by default in storr_rds() objects.

idavydov · 2018-10-17T07:15:47Z

Thanks.

Just to give a bit more information. When I use failed() I am able to get the last target. When I use diagnose() I get:

R-3.5.1> diagnose('a')
$target
[1] "a"

$error
<simpleError: cannot allocate vector of size 41.2 Mb>

So if drake is able to get the error message, is it possible perhaps to add a simple error message instead of Error: $ operator is invalid for atomic vectors? Because the last one is very misleading and does not report which target caused the problem.

P.S. Sorry, I didn't notice you changed the title already.

mschubert · 2018-10-17T07:52:04Z

clustermq is working as expected here by raising the allocation error. Looks to me like drake tries to read a result that does not exist (due to this error) and then fails because of that instead of the underlying error?

idavydov · 2018-10-17T07:59:22Z

I think I have a reproducible example:

==> problem.R <==
library(drake)
options(clustermq.scheduler='slurm', clustermq.template='slurm_clustermq.tmpl')
my.plan <- drake_plan(
        a=rnorm(100000000)
)
make(my.plan, parallelism='clustermq', jobs=1, caching='worker', verbose=4)

==> slurm_clustermq.tmpl <==
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=defq
#SBATCH --output={{ log_file | out.%a }} # you can add .%a for array index
#SBATCH --error={{ log_file | err.%a }}
#SBATCH --mem-per-cpu={{ memory | 1000 }}
#SBATCH --time=10
#SBATCH --array=1-{{ n_jobs }}
ulimit -v $(( 1024 * {{ memory | 1000 }} ))
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

$ Rscript problem.R
cache <...>/drake-mem/.drake
analyze environment
Submitting 1 worker jobs (ID: 7991) ...
target a
Error: $ operator is invalid for atomic vectors
Execution halted

The error log looks like this:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
<...>
R-3.5.1> clustermq:::worker("tcp://address:7991")
Master: tcp://address:7991
WORKER_UP to: tcp://address:7991
> DO_SETUP (0.573s wait)
token from msg: set_common_data_token
> DO_CALL (0.000s wait)
fail a
Error : Target `a` failed. Call `diagnose(a)` for details. Error message:
  cannot allocate vector of size 762.9 Mb
eval'd: drake::cmq_buildtargetmetadepsconfig
slurmstepd: error: *** JOB 17784553 ON address CANCELLED AT 2018-10-17T09:56:36 ***

idavydov · 2018-10-17T08:06:24Z

By the way, it seems that any error in this case is not reported properly. I just encountered a similar situation (Error: $ operator is invalid for atomic vectors) with the following target error message:

$error
<simpleError in kable_styling(.): could not find function "kable_styling">

mschubert · 2018-10-17T08:16:58Z

Yes, that should be reproducible by running drake on

function(...) {
    stop("this is an error")
}

idavydov · 2018-10-17T08:36:49Z

Yes, @mschubert. I confirm:

library(drake)
options(clustermq.scheduler='slurm', clustermq.template='slurm_clustermq.tmpl')
my.err <- function(...) {
    stop("this is an error")
}

my.plan <- drake_plan(
        a=my.err()
)
make(my.plan, parallelism='clustermq', jobs=1, caching='worker', verbose = 4)

err.1
<...>
R-3.5.1> clustermq:::worker("tcp://host:7066")
Master: tcp://host:7066
WORKER_UP to: tcp://host:7066
> DO_SETUP (0.844s wait)
token from msg: set_common_data_token
> DO_CALL (0.000s wait)
fail a
Error : Target `a` failed. Call `diagnose(a)` for details. Error message:
  this is an error
eval'd: drake::cmq_buildtargetmetadepsconfig
slurmstepd: error: *** JOB 17785079 ON host CANCELLED AT 2018-10-17T10:33:35 ***

$ Rscript problem.R
cache <...>/drake-mem/.drake
analyze environment
analyze 2 imports: my.err, my.plan
analyze 1 target: a
construct graph edges
construct vertex attributes
construct graph
import my.err
Submitting 1 worker jobs (ID: 7066) ...
target a
Error: $ operator is invalid for atomic vectors
Execution halted

wlandau · 2018-10-17T17:17:11Z

Thanks! I can now reproduce the issue on SGE.

wlandau · 2018-10-17T17:36:55Z

The hotfix in c09c4c5 makes sure the error message gets back to the master process.

pat-s · 2019-02-17T21:22:34Z

I also face this error when "making" multiple targets in parallel with caching = "master". (Haven't tried yet caching = "workers" because I do it via SSH and the paths are different).

"Making" them separately works. Can something be done on the R side here? Looks like the master process gets in trouble in terms of memory if workers try to "save" intermediate results of jobs. Searching for the error brings up the suggestion to raise the memory limit but this seems only to apply to Windows.

Lately I tried using garbage_collection = TRUE and lazy_load = "promise" but haven't rerun the target yet.

wlandau · 2019-02-18T12:34:53Z

If the master process is running out of memory, you could try make(memory_strategy = "memory") or make(memory_strategy = "lookahead").

I also face this error when "making" multiple targets in parallel with caching = "master". (Haven't tried yet caching = "workers" because I do it via SSH and the paths are different).

Where is make() running, and where does the .drake cache live? If both are on the login node of the cluster (recommended) then caching = "worker" could help (though it may be a bit slower).

pat-s · 2019-02-19T10:59:47Z

If the master process is running out of memory, you could try make(memory_strategy = "memory") or make(memory_strategy = "lookahead")

I'll try next time. But you also do not know what limit there currently is for the master process and how it can be increased?

Where is make() running, and where does the .drake cache live? If both are on the login node of the cluster (recommended) then caching = "worker" could help (though it may be a bit slower).

Make is running on a different server via SSH. The cache also doesn't yet live on the HPC. I am still testing things out :)
In the future the cache will live in the login node and then I could use caching = "worker". But for now it seems that I have to do the "large calculations" locally on the machine where the cache lives to circumvent these problems.

wlandau · 2019-02-19T13:43:35Z

I'll try next time. But you also do not know what limit there currently is for the master process and how it can be increased?

Which OS are you using? As far as I know, Linux does not limit the memory of its processes. Maybe check ulimit? Otherwise, maybe you could try make(garbage_collection = TRUE).

wlandau · 2019-02-19T13:53:17Z

Make is running on a different server via SSH. The cache also doesn't yet live on the HPC.

I have found that things work well when the cache and the master process live on the login node. If you base these things on a machine that is part of the cluster, then less time is spent shuffling data over the network.

wlandau · 2019-02-19T13:56:53Z

Update: in 8b79713, I added optional garbage collection to manage_memory(). So make(garbage_collection = TRUE, parallelism = "clustermq", caching = "master") should do more garbage collection on the master process.

wlandau self-assigned this Oct 16, 2018

wlandau added the type: bug label Oct 16, 2018

wlandau added topic: performance depends: reprex labels Oct 16, 2018

wlandau removed the type: bug label Oct 16, 2018

wlandau closed this as completed Oct 16, 2018

wlandau changed the title ~~Error: $ operator is invalid for atomic vectors~~ clustermq workers run out of memory Oct 16, 2018

wlandau changed the title ~~clustermq workers run out of memory~~ Uninformative error message when clustermq workers run out of memory Oct 16, 2018

wlandau added the topic: documentation label Oct 16, 2018

wlandau reopened this Oct 17, 2018

wlandau removed the depends: reprex label Oct 17, 2018

wlandau-lilly closed this as completed in c09c4c5 Oct 17, 2018

wlandau-lilly added a commit that referenced this issue Oct 17, 2018

Test #549

b296917

kendonB mentioned this issue Apr 2, 2019

Master process hangs upon oom error on SLURM with clustermq #813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uninformative error message when clustermq workers run out of memory #549

Uninformative error message when clustermq workers run out of memory #549

idavydov commented Oct 16, 2018 •

edited

Loading

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018 •

edited

Loading

wlandau commented Oct 16, 2018

idavydov commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

idavydov commented Oct 17, 2018 •

edited

Loading

mschubert commented Oct 17, 2018

idavydov commented Oct 17, 2018

idavydov commented Oct 17, 2018

mschubert commented Oct 17, 2018

idavydov commented Oct 17, 2018

wlandau commented Oct 17, 2018

wlandau commented Oct 17, 2018

pat-s commented Feb 17, 2019

wlandau commented Feb 18, 2019

pat-s commented Feb 19, 2019

wlandau commented Feb 19, 2019

wlandau commented Feb 19, 2019

wlandau commented Feb 19, 2019

Uninformative error message when clustermq workers run out of memory #549

Uninformative error message when clustermq workers run out of memory #549

Comments

idavydov commented Oct 16, 2018 • edited Loading

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018 • edited Loading

wlandau commented Oct 16, 2018

idavydov commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

wlandau commented Oct 16, 2018

idavydov commented Oct 17, 2018 • edited Loading

mschubert commented Oct 17, 2018

idavydov commented Oct 17, 2018

idavydov commented Oct 17, 2018

mschubert commented Oct 17, 2018

idavydov commented Oct 17, 2018

wlandau commented Oct 17, 2018

wlandau commented Oct 17, 2018

pat-s commented Feb 17, 2019

wlandau commented Feb 18, 2019

pat-s commented Feb 19, 2019

wlandau commented Feb 19, 2019

wlandau commented Feb 19, 2019

wlandau commented Feb 19, 2019

idavydov commented Oct 16, 2018 •

edited

Loading

wlandau commented Oct 16, 2018 •

edited

Loading

idavydov commented Oct 17, 2018 •

edited

Loading