-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uninformative error message when clustermq workers run out of memory #549
Comments
I remember encountering this problem early in the development of #532, but not since. Do any of the targets report errors in |
Would you try |
And if possible, it would really help to have a reproducible example. Probably a lot of work for you to disentangle one and post it, but it would help me debug. |
Edit: you can use the make(
my.plan,
parallelism = "clustermq",
jobs = 1,
caching = "worker",
verbose = 4,
template = list(log_file = "my_log_file.txt")
) |
I am trying to reproduce the error myself, and things seem to be running fine so far on an SGE cluster. library(drake)
plan <- evaluate_plan(
drake_plan(x = mean(rnorm(1e5) + z__), y = x_z__ + mean(rnorm(1e5))),
wildcard = "z__",
values = seq_len(5000)
)
make(plan, parallelism = "clustermq", jobs = 4, caching = "worker", verbose = 4)
#> cache <CENSORED>/.drake
#> analyze environment
#> analyze 6 imports: ld, wd, td, spell_check_ignore, plan, dl
#> analyze 10000 targets: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10, x_1...
#> construct graph edges
#> construct vertex attributes
#> construct graph
#> Submitting 4 worker jobs (ID: 7883) ...
#> target x_1
#> target x_2
#> target x_3
#> ...
#> target x_1030
#> target x_1031
#> target x_1032
#> ... |
Thanks a lot for working on this. It seems that the problem was the out of memory error in workers. I will provide more details tomorrow. But I could find a memory allocation error by using Perhaps it's still a bug since drake exists without a proper diagnostic message. |
Glad to know. Running out of memory on remote workers seems like a common problem, but I think it is outside the scope of |
I know I closed this issue, but I am still looking forward to learning about the details. |
Then again, I am not sure why this problem occurs with |
Thanks. Just to give a bit more information. When I use
So if drake is able to get the error message, is it possible perhaps to add a simple error message instead of P.S. Sorry, I didn't notice you changed the title already. |
|
I think I have a reproducible example: ==> problem.R <==
library(drake)
options(clustermq.scheduler='slurm', clustermq.template='slurm_clustermq.tmpl')
my.plan <- drake_plan(
a=rnorm(100000000)
)
make(my.plan, parallelism='clustermq', jobs=1, caching='worker', verbose=4) ==> slurm_clustermq.tmpl <==
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=defq
#SBATCH --output={{ log_file | out.%a }} # you can add .%a for array index
#SBATCH --error={{ log_file | err.%a }}
#SBATCH --mem-per-cpu={{ memory | 1000 }}
#SBATCH --time=10
#SBATCH --array=1-{{ n_jobs }}
ulimit -v $(( 1024 * {{ memory | 1000 }} ))
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")' $ Rscript problem.R
cache <...>/drake-mem/.drake
analyze environment
Submitting 1 worker jobs (ID: 7991) ...
target a
Error: $ operator is invalid for atomic vectors
Execution halted The error log looks like this:
|
By the way, it seems that any error in this case is not reported properly. I just encountered a similar situation (
|
Yes, that should be reproducible by running function(...) {
stop("this is an error")
} |
Yes, @mschubert. I confirm: library(drake)
options(clustermq.scheduler='slurm', clustermq.template='slurm_clustermq.tmpl')
my.err <- function(...) {
stop("this is an error")
}
my.plan <- drake_plan(
a=my.err()
)
make(my.plan, parallelism='clustermq', jobs=1, caching='worker', verbose = 4)
|
Thanks! I can now reproduce the issue on SGE. |
The hotfix in c09c4c5 makes sure the error message gets back to the master process. |
I also face this error when "making" multiple targets in parallel with "Making" them separately works. Can something be done on the R side here? Looks like the master process gets in trouble in terms of memory if workers try to "save" intermediate results of jobs. Searching for the error brings up the suggestion to raise the memory limit but this seems only to apply to Windows. Lately I tried using |
If the master process is running out of memory, you could try
Where is |
I'll try next time. But you also do not know what limit there currently is for the master process and how it can be increased?
Make is running on a different server via SSH. The cache also doesn't yet live on the HPC. I am still testing things out :) |
Which OS are you using? As far as I know, Linux does not limit the memory of its processes. Maybe check |
I have found that things work well when the cache and the master process live on the login node. If you base these things on a machine that is part of the cluster, then less time is spent shuffling data over the network. |
Update: in 8b79713, I added optional garbage collection to |
I'm currently running my huge plan (>5k targets) on a cluster using clustermq & SLURM.
When I run
make()
on the cluster withcaching="worker"
I randomly get the following message:After that
make()
just drops me back to R. This does not seem to depend on the target or number of workers and happens more or less randomly after 20-50 targets.Do you have any suggestions on how to debug this?
Settings
caching="master"
seem to resolve the issue. However this is not an option for me since I often get out of memory problem on the main process. I tried settinglazy_load=TRUE
,pruning_strategy='memory'
, andgarbage_collection=TRUE
. But I still get the memory problem. I think the problem is that saving data from all workers happens in the master process.R 3.5.1
drake revision b68fc8e
clustermq github commit 82b89e0375a08d49dd85c3f3c5df2c724bff3177 (also tried 0.8.5, same behavior)
The text was updated successfully, but these errors were encountered: