Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uninformative error message when clustermq workers run out of memory #549

Closed
idavydov opened this issue Oct 16, 2018 · 23 comments
Closed

Uninformative error message when clustermq workers run out of memory #549

idavydov opened this issue Oct 16, 2018 · 23 comments

Comments

@idavydov
Copy link

idavydov commented Oct 16, 2018

I'm currently running my huge plan (>5k targets) on a cluster using clustermq & SLURM.

When I run make() on the cluster with caching="worker" I randomly get the following message:

R> make(my.plan, parallelism='clustermq', jobs=10, caching='worker')
target a
target b
target b
Error: $ operator is invalid for atomic vectors
R>

After that make() just drops me back to R. This does not seem to depend on the target or number of workers and happens more or less randomly after 20-50 targets.

Do you have any suggestions on how to debug this?

Settings caching="master" seem to resolve the issue. However this is not an option for me since I often get out of memory problem on the main process. I tried setting lazy_load=TRUE, pruning_strategy='memory', and garbage_collection=TRUE. But I still get the memory problem. I think the problem is that saving data from all workers happens in the master process.

R 3.5.1
drake revision b68fc8e
clustermq github commit 82b89e0375a08d49dd85c3f3c5df2c724bff3177 (also tried 0.8.5, same behavior)

@wlandau wlandau self-assigned this Oct 16, 2018
@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

I remember encountering this problem early in the development of #532, but not since. Do any of the targets report errors in diagnose()? When I get a chance, I will expose the log_worker argument from clustermq::Q(), which will hopefully help you debug.

@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

Would you try make(my.plan, parallelism = "clustermq", jobs = 1, caching = "worker", verbose = 4)? The extra verbosity and jobs = 1 could help us tell if the error is occurring before or after the target is being stored.

@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

And if possible, it would really help to have a reproducible example. Probably a lot of work for you to disentangle one and post it, but it would help me debug.

@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

Edit: you can use the template argument to specify a clustermq log file:

make(
  my.plan,
  parallelism = "clustermq",
  jobs = 1,
  caching = "worker",
  verbose = 4,
  template = list(log_file = "my_log_file.txt")
)

@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

I am trying to reproduce the error myself, and things seem to be running fine so far on an SGE cluster.

library(drake)
plan <- evaluate_plan(
  drake_plan(x = mean(rnorm(1e5) + z__), y = x_z__ + mean(rnorm(1e5))),
  wildcard = "z__",
  values = seq_len(5000)
)
make(plan, parallelism = "clustermq", jobs = 4, caching = "worker", verbose = 4)
#> cache <CENSORED>/.drake
#> analyze environment
#> analyze 6 imports: ld, wd, td, spell_check_ignore, plan, dl
#> analyze 10000 targets: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10, x_1...
#> construct graph edges
#> construct vertex attributes
#> construct graph
#> Submitting 4 worker jobs (ID: 7883) ...
#> target x_1
#> target x_2
#> target x_3
#> ...
#> target x_1030
#> target x_1031
#> target x_1032
#> ...

@idavydov
Copy link
Author

Thanks a lot for working on this. It seems that the problem was the out of memory error in workers.

I will provide more details tomorrow. But I could find a memory allocation error by using diagnose().

Perhaps it's still a bug since drake exists without a proper diagnostic message.

@wlandau wlandau removed the type: bug label Oct 16, 2018
@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

Glad to know. Running out of memory on remote workers seems like a common problem, but I think it is outside the scope of drake: from @mschubert's comment at #449 (comment), if workers run out of memory, the optional clustermq log file should say so. The log_worker argument is deprecated, but you can use the template argument of make() . I did not explain it correctly before, so I updated #549 (comment) with a correction just now.

@wlandau wlandau closed this as completed Oct 16, 2018
@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

I know I closed this issue, but I am still looking forward to learning about the details.

@wlandau wlandau changed the title Error: $ operator is invalid for atomic vectors clustermq workers run out of memory Oct 16, 2018
@wlandau
Copy link
Member

wlandau commented Oct 16, 2018

Then again, I am not sure why this problem occurs with caching = "worker" and not caching = "master". Perhaps it has something to do with the in-memory caching used by default in storr_rds() objects.

@wlandau wlandau changed the title clustermq workers run out of memory Uninformative error message when clustermq workers run out of memory Oct 16, 2018
@idavydov
Copy link
Author

idavydov commented Oct 17, 2018

Thanks.

Just to give a bit more information. When I use failed() I am able to get the last target. When I use diagnose() I get:

R-3.5.1> diagnose('a')
$target
[1] "a"

$error
<simpleError: cannot allocate vector of size 41.2 Mb>

So if drake is able to get the error message, is it possible perhaps to add a simple error message instead of Error: $ operator is invalid for atomic vectors? Because the last one is very misleading and does not report which target caused the problem.

P.S. Sorry, I didn't notice you changed the title already.

@mschubert
Copy link

clustermq is working as expected here by raising the allocation error. Looks to me like drake tries to read a result that does not exist (due to this error) and then fails because of that instead of the underlying error?

@idavydov
Copy link
Author

I think I have a reproducible example:

==> problem.R <==
library(drake)
options(clustermq.scheduler='slurm', clustermq.template='slurm_clustermq.tmpl')
my.plan <- drake_plan(
        a=rnorm(100000000)
)
make(my.plan, parallelism='clustermq', jobs=1, caching='worker', verbose=4)
==> slurm_clustermq.tmpl <==
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=defq
#SBATCH --output={{ log_file | out.%a }} # you can add .%a for array index
#SBATCH --error={{ log_file | err.%a }}
#SBATCH --mem-per-cpu={{ memory | 1000 }}
#SBATCH --time=10
#SBATCH --array=1-{{ n_jobs }}
ulimit -v $(( 1024 * {{ memory | 1000 }} ))
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
$ Rscript problem.R
cache <...>/drake-mem/.drake
analyze environment
Submitting 1 worker jobs (ID: 7991) ...
target a
Error: $ operator is invalid for atomic vectors
Execution halted

The error log looks like this:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
<...>
R-3.5.1> clustermq:::worker("tcp://address:7991")
Master: tcp://address:7991
WORKER_UP to: tcp://address:7991
> DO_SETUP (0.573s wait)
token from msg: set_common_data_token
> DO_CALL (0.000s wait)
fail a
Error : Target `a` failed. Call `diagnose(a)` for details. Error message:
  cannot allocate vector of size 762.9 Mb
eval'd: drake::cmq_buildtargetmetadepsconfig
slurmstepd: error: *** JOB 17784553 ON address CANCELLED AT 2018-10-17T09:56:36 ***

@idavydov
Copy link
Author

By the way, it seems that any error in this case is not reported properly. I just encountered a similar situation (Error: $ operator is invalid for atomic vectors) with the following target error message:

$error
<simpleError in kable_styling(.): could not find function "kable_styling">

@mschubert
Copy link

Yes, that should be reproducible by running drake on

function(...) {
    stop("this is an error")
}

@idavydov
Copy link
Author

Yes, @mschubert. I confirm:

library(drake)
options(clustermq.scheduler='slurm', clustermq.template='slurm_clustermq.tmpl')
my.err <- function(...) {
    stop("this is an error")
}

my.plan <- drake_plan(
        a=my.err()
)
make(my.plan, parallelism='clustermq', jobs=1, caching='worker', verbose = 4)
err.1
<...>
R-3.5.1> clustermq:::worker("tcp://host:7066")
Master: tcp://host:7066
WORKER_UP to: tcp://host:7066
> DO_SETUP (0.844s wait)
token from msg: set_common_data_token
> DO_CALL (0.000s wait)
fail a
Error : Target `a` failed. Call `diagnose(a)` for details. Error message:
  this is an error
eval'd: drake::cmq_buildtargetmetadepsconfig
slurmstepd: error: *** JOB 17785079 ON host CANCELLED AT 2018-10-17T10:33:35 ***
$ Rscript problem.R
cache <...>/drake-mem/.drake
analyze environment
analyze 2 imports: my.err, my.plan
analyze 1 target: a
construct graph edges
construct vertex attributes
construct graph
import my.err
Submitting 1 worker jobs (ID: 7066) ...
target a
Error: $ operator is invalid for atomic vectors
Execution halted

@wlandau
Copy link
Member

wlandau commented Oct 17, 2018

Thanks! I can now reproduce the issue on SGE.

@wlandau
Copy link
Member

wlandau commented Oct 17, 2018

The hotfix in c09c4c5 makes sure the error message gets back to the master process.

wlandau-lilly added a commit that referenced this issue Oct 17, 2018
@pat-s
Copy link
Member

pat-s commented Feb 17, 2019

I also face this error when "making" multiple targets in parallel with caching = "master". (Haven't tried yet caching = "workers" because I do it via SSH and the paths are different).

"Making" them separately works. Can something be done on the R side here? Looks like the master process gets in trouble in terms of memory if workers try to "save" intermediate results of jobs. Searching for the error brings up the suggestion to raise the memory limit but this seems only to apply to Windows.

Lately I tried using garbage_collection = TRUE and lazy_load = "promise" but haven't rerun the target yet.

@wlandau
Copy link
Member

wlandau commented Feb 18, 2019

If the master process is running out of memory, you could try make(memory_strategy = "memory") or make(memory_strategy = "lookahead").

I also face this error when "making" multiple targets in parallel with caching = "master". (Haven't tried yet caching = "workers" because I do it via SSH and the paths are different).

Where is make() running, and where does the .drake cache live? If both are on the login node of the cluster (recommended) then caching = "worker" could help (though it may be a bit slower).

@pat-s
Copy link
Member

pat-s commented Feb 19, 2019

If the master process is running out of memory, you could try make(memory_strategy = "memory") or make(memory_strategy = "lookahead")

I'll try next time. But you also do not know what limit there currently is for the master process and how it can be increased?

Where is make() running, and where does the .drake cache live? If both are on the login node of the cluster (recommended) then caching = "worker" could help (though it may be a bit slower).

Make is running on a different server via SSH. The cache also doesn't yet live on the HPC. I am still testing things out :)
In the future the cache will live in the login node and then I could use caching = "worker". But for now it seems that I have to do the "large calculations" locally on the machine where the cache lives to circumvent these problems.

@wlandau
Copy link
Member

wlandau commented Feb 19, 2019

I'll try next time. But you also do not know what limit there currently is for the master process and how it can be increased?

Which OS are you using? As far as I know, Linux does not limit the memory of its processes. Maybe check ulimit? Otherwise, maybe you could try make(garbage_collection = TRUE).

@wlandau
Copy link
Member

wlandau commented Feb 19, 2019

Make is running on a different server via SSH. The cache also doesn't yet live on the HPC.

I have found that things work well when the cache and the master process live on the login node. If you base these things on a machine that is part of the cluster, then less time is spent shuffling data over the network.

@wlandau
Copy link
Member

wlandau commented Feb 19, 2019

Update: in 8b79713, I added optional garbage collection to manage_memory(). So make(garbage_collection = TRUE, parallelism = "clustermq", caching = "master") should do more garbage collection on the master process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants