-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying out SLURM-Style Job Arrays ! #308
Comments
Hi, @rsignell, thanks for giving this a try! Maybe the problem is In your cluster logs I'm seeing
|
You can use |
It should be fine to use a |
We don't have a perfect one-size-fits-all solution to credentials currently. One option is to use There isn't currently an option to explicitly include other files to upload (i.e., no Maybe a good option for you would be for us to give you a way to set env vars that are "secret". We do have |
@ntabris, thanks for the super speedy response! |
|
Nat, okay, I gave it another go with
with this CLI call:
Again, I can't see what is happening but it just seems idle, so I killed it. Hopefully getting closer? |
Not sure, but based on the logs maybe it's this? https://stackoverflow.com/questions/60232708/dask-fails-with-freeze-support-bug |
Hmm I'm seeing the famous
Also I see warning logs like
which is inconsistent with using I've tried a few things but haven't been able to reproduce the issue. @rsignell could you try the exact same print("Hello world") and let me know if that hangs too? |
Yeah, I'm able to reproduce the same thing you're seeing @rsignell if I just instantiate a |
@rsignell FYI @ntabris mentioned he did a bit of digging and found that the script you're running in fact doesn't have an |
Grrr, indeed, sorry guys. What I thought I was running was indeed not what I was running. :( |
FYI we'll push out a change early next week so that code/scripts will be shown in the UI (like it is for dask clusters or |
Grrr.... I still can't get this to work. I created a little rechunk repo here for the files I'm testing. It works if I try this on a VM:
but when I try on coiled with
@ntabris I'm happy to share the AWS credentials with you on Coiled Slack or something if you would like to test. They are just credentials to write to a specific Open Storage Network pod (not real AWS credentials, so you can't do any damage even if you wanted to! :)) |
I'm happy to take a look but it looks like this is a private repo |
Oops! I guess that private must be the default now. Fixed! |
@rsignell could you try again with the EDIT: Looking into that failure upstream here pangeo-data/rechunker#153 |
I tried with:
|
Is something getting lost (like the credentials) between the scheduler and the workers? |
Thanks for trying that out @rsignell. When I look at your most recent run, which is using I've made this change # For c7g.8xlarge (32cpu, 64GB RAM)
- n_workers=30
+ n_workers = 20
mem = 64
- cluster = LocalCluster(n_workers=n_workers, threads_per_worker=1)
+ cluster = LocalCluster(n_workers=n_workers, threads_per_worker=1, memory_limit=None)
client = Client(cluster) to use reduce the load on the VM to see if that helps. So far it's been running for ~5 minutes without an issue. |
@jrbourbeau Oh jeez, thanks! I didn't realize it was a memory issue! I assigned the |
@jrbourbeau can you tell from this log whether it's still running out of memory? I keep decreasing the |
Hmm that cluster is actually hitting an S3 / boto issue
By chance, did you change anything with your Btw, more importantly, I notice you're running on a small VM when you intend to run on a big one. This PR OpenScienceComputing/rechunk-jobarray#1 should help |
I'm getting closer here -- everything seems fine when I launch only one job but when I launch all 19, it seems to be crapping out part way through and I can't tell why! Help? https://cloud.coiled.io/clusters/691886/account/esip-lab/information?workspace=esip-lab |
Ah, sorry about that @rsignell. We just deployed a fix -- can you try again? |
Tried again and it's running but I expected it to take 30 min, and it's at around 90 so far... https://cloud.coiled.io/clusters/693164/account/esip-lab/information |
There are tons of |
I was wondering whether the problem might be when it tries to write to the Open Storage Network pod with all those machines, so I tried switching the storage to regular AWS S3 in the same region as the compute. That one finished just fine: To see if I could replicate the problem with OSN I tried that run again and it had the same issues as the first time: I'll investigate the problem with OSN, but I guess it's not a Coiled issue since it's working fine with AWS S3! I did notice something that could be improved with the SLURM-like workflows. If you have run with just N=1 (say for testing), then the dashboard says 0 workers instead of 1, and also it seem there are no metrics, which is really too bad! |
Good catch, thanks! I've opened internal issue for this, no need for you to open separate issue here. |
I was very excited to see the blog post on SLURM-Style Job Arrays because we often use a job array approach to rechunking big data on prem: we use Dask with LocalCluster on each machine to rechunk a certain time range of data based on the job array index, and the result is a bunch of rechunked zarr datasets (one generated by each machine). (We then create references for the collection of zarr datasets using kerchunk and save the references to parquet, and create an intake catalog item so users can conveniently and efficiently load the virtual dataset with just one line.)
On each machine, the rechunker process uses an intermediate zarr dataset that we usually write to
/tmp
and then the target rechunked zarr dataset is written to object storage (we are using S3-compatible OSN here). We usually use aLocalCluster
on each machine to parallelize the rechunking process for each dataset.I'm not sure how to best accomplish this workflow with Coiled:
--file
on the command line, but not sure whether that will workI tried just using the code pretty much the way we run it without Coiled, using this script: ERA5-rechunker-AWS.py.
I created a
run_rechunk.sh
script:which I the submitted with:
The
osn_keys.env
file contains the keys needed to write to the S3-compatible Open Storage Network pod. It's just a text file that looks like :I can't really tell what's going wrong -- the process is:
https://cloud.coiled.io/clusters/678117/account/esip-lab/information?workspace=esip-lab
The text was updated successfully, but these errors were encountered: