-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master process hangs upon oom error on SLURM with clustermq #813
Comments
Unfortunately, there is nothing |
Also relevant: mschubert/clustermq#110 (comment)
|
Oh, interesting. @kendonB, maybe in the template (e.g. ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")' you could try CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'ulimit::memory_limit({{ memory | 4096 }}); clustermq:::worker("{{ master }}")' |
Tried this and no change in behavior. I also thought I had asked about this before and couldn't find the issue! |
Hm, interesting. I had memory limits fail on occasion, but not in a way that I could reproduce. It is possible that there is some process overhead that is tracked by SLURM but not the shell/R, or that your job reserves some memory outside of the reach of ulimit. Can you check how much memory you are over the limit (job log), and if you can get around this by setting e.g.: ulimit::memory_limit({{ memory | 4096 }}-100) |
Oh never mind, I didn't see you tested with a huge vector. The issue here is a different one: your Slurm scheduler understands "5M", but I should probably add a warning if Warning message:
In ulimit::memory_limit("5M") : NAs introduced by coercion or on the shell ulimit: Invalid limit '5M' |
@wlandau @mschubert turns out I had actually disabled ulimit in the first place as it had been failing since I was using "5G" etc instead of 5120. Switched to just using the numeric argument and now can see the error message get through. Thanks! |
@kendonB Do you remember if you already had ulimit disabled for mschubert/clustermq#110? |
It is likely that I didn't have ulimit on but I can't remember exactly. I wonder if it's worth having some code translate the "5G" etc that slurm understands to numeric MBs within the clustermq call? |
Yes, that makes sense. I'll likely add this when I'm moving ulimit to R and don't have to use a lot of bash magic (expected |
The content of #549 suggests that I should get a message on master upon an oom error.
I run:
And on the master I see (and it's hanging after the oom error):
On the worker I see:
The text was updated successfully, but these errors were encountered: