Trying out SLURM-Style Job Arrays ! #308

rsignell · 2024-12-05T22:26:43Z

I was very excited to see the blog post on SLURM-Style Job Arrays because we often use a job array approach to rechunking big data on prem: we use Dask with LocalCluster on each machine to rechunk a certain time range of data based on the job array index, and the result is a bunch of rechunked zarr datasets (one generated by each machine). (We then create references for the collection of zarr datasets using kerchunk and save the references to parquet, and create an intake catalog item so users can conveniently and efficiently load the virtual dataset with just one line.)

On each machine, the rechunker process uses an intermediate zarr dataset that we usually write to /tmp and then the target rechunked zarr dataset is written to object storage (we are using S3-compatible OSN here). We usually use a LocalCluster on each machine to parallelize the rechunking process for each dataset.

I'm not sure how to best accomplish this workflow with Coiled:

how do we ensure we have the 90GB for the intermediate dataset we need on /tmp?
Is it okay to use a LocalCluster or should be using a Coiled cluster?
If okay to use a LocalCluster, I guess we want to specify a 32-cpu machine or something to Coiled
How best to pass the credentials needed to write to OSN? We are uploading a file containing the AWS keys using --file on the command line, but not sure whether that will work

I tried just using the code pretty much the way we run it without Coiled, using this script: ERA5-rechunker-AWS.py.

I created a run_rechunk.sh script:

#COILED memory 32 GiB
#COILED ntasks 1
#COILED workspace esip-lab
#COILED software pangeo-notebook
#COILED region us-east-1
python run ERA5-rechunker-AWS.py

which I the submitted with:

 coiled batch run ./run_rechunk.sh --file osn_keys.env

The osn_keys.env file contains the keys needed to write to the S3-compatible Open Storage Network pod. It's just a text file that looks like :

AWS_ACCESS_KEY_ID=6023XCXRG7xxxxxxxxxxxx
AWS_SECRET_ACCESS_KEY=YvRyfg5DoCaYbxxxxxxxxxxx

I can't really tell what's going wrong -- the process is:
https://cloud.coiled.io/clusters/678117/account/esip-lab/information?workspace=esip-lab

The text was updated successfully, but these errors were encountered:

ntabris · 2024-12-05T22:35:02Z

Hi, @rsignell, thanks for giving this a try!

Maybe the problem is python run ERA5-rechunker-AWS.py? Should that just be python ERA5-rechunker-AWS.py?

In your cluster logs I'm seeing

python: can't open file '/scratch/batch/run': [Errno 2] No such file or directory

ntabris · 2024-12-05T22:37:01Z

how do we ensure we have the 90GB for the intermediate dataset we need on /tmp?

You can use --disk-size 120GB (or some other value) to set the disk size per VM.

ntabris · 2024-12-05T22:38:36Z

Is it okay to use a LocalCluster or should be using a Coiled cluster? If okay to use a LocalCluster, I guess we want to specify a 32-cpu machine or something to Coiled

It should be fine to use a LocalCluster in each task, and yes, if that's what you're doing then it makes sense to use large VMs.

ntabris · 2024-12-05T22:48:08Z

How best to pass the credentials needed to write to OSN? We are uploading a file containing the AWS keys using --file on the command line, but not sure whether that will work

We don't have a perfect one-size-fits-all solution to credentials currently.

One option is to use --forward-aws-credentials to forward STS taken based on your local AWS credentials; this won't be refreshed though, so things will stop working if your job takes longer than the STS token expiration (which could be 12 hours or could be 15 minutes depending on how your local AWS session is authenticated).

There isn't currently an option to explicitly include other files to upload (i.e., no --file).

Maybe a good option for you would be for us to give you a way to set env vars that are "secret". We do have --env FOO=bar already, but that will be stored in our database and not deleted. We could easily have --secret-env FOO=bar which would still be stored in our database, but only temporarily. Would that work well for you?

rsignell · 2024-12-05T22:53:09Z

@ntabris, thanks for the super speedy response!
I'll try these adjustments, and yes, the -secret-env would work great.

ntabris · 2024-12-06T15:52:53Z

pip install coiled==1.67.1.dev4 will give you --secret-env if you want to give that a try (or you can wait for a non-pre-release version in the next couple days)

rsignell · 2024-12-06T18:25:47Z

Nat, okay, I gave it another go with
This SLURM-like script:

#COILED memory 32GiB
#COILED disk-size 120GB
#COILED ntasks 1
#COILED workspace esip-lab
#COILED software rechunk_arm
#COILED region us-east-1
python ERA5-rechunker-AWS.py

ERA5-rechunker-AWS.py

with this CLI call:

 coiled batch run ./run_rechunk.sh  --secret-env AWS_ACCESS_KEY_ID=6023XCXxxxxxxxxxxxxxxxxx --secret-env AWS_SECRET_ACCESS_KEY=YvRyfg5DoCxxxxxxxxxxxxxx  --worker_vm_types=["c7g.8xlarge"] --arm

Again, I can't see what is happening but it just seems idle, so I killed it. Hopefully getting closer?
https://cloud.coiled.io/clusters/679178/account/esip-lab/information?workspace=esip-lab

mrocklin · 2024-12-06T19:48:01Z

Not sure, but based on the logs maybe it's this? https://stackoverflow.com/questions/60232708/dask-fails-with-freeze-support-bug

jrbourbeau · 2024-12-06T19:49:32Z

Hmm I'm seeing the famous if __name__ == '__main__' error when spawning a subprocess in the logs

untimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

Also I see warning logs like

distributed.nanny.memory - WARNING - Ignoring provided memory limit 100GB due to system memory limit of 30.13 GiB

which is inconsistent with using memory_limit='64GB' for the LocalCluster in ERA5-rechunker-AWS.py. Can you confirm that https://gist.github.com/rsignell/d772a6d334365134709d5393a49db51b is in fact the script you were running?

I've tried a few things but haven't been able to reproduce the issue. @rsignell could you try the exact same coiled batch run command but have your Python script just be

print("Hello world")

and let me know if that hangs too?

jrbourbeau · 2024-12-06T19:54:55Z

Yeah, I'm able to reproduce the same thing you're seeing @rsignell if I just instantiate a LocalCluster w/o including it in an if __name__ == "__main__": block. Was there maybe a previous version of your script that didn't have that? The 100GB vs 64GB mismatch also points to some other Python script being used

jrbourbeau · 2024-12-06T20:09:36Z

@rsignell FYI @ntabris mentioned he did a bit of digging and found that the script you're running in fact doesn't have an if __name__ == "__main__" block (looks like it's this previous iteration of your gist https://gist.github.com/rsignell/d772a6d334365134709d5393a49db51b/d181f4acc4cd7413aaf7c4cd72686fdb355c9c18). You should have a better time if you use your most recent version with the __main__ block

rsignell · 2024-12-06T21:04:07Z

Grrr, indeed, sorry guys. What I thought I was running was indeed not what I was running. :(
Trying again. Thanks!

ntabris · 2024-12-06T21:07:56Z

FYI we'll push out a change early next week so that code/scripts will be shown in the UI (like it is for dask clusters or coiled run), which should make "what code did I run?" much easier to see.

rsignell · 2024-12-09T15:50:40Z

Grrr.... I still can't get this to work. I created a little rechunk repo here for the files I'm testing.

It works if I try this on a VM:

export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxHF
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxCJ
export COILED_ARRAY_TASK_ID=1
python ERA5-rechunker-AWS.py

but when I try on coiled with ./submit_coiled_batch.sh, it fails, and the logs contain errors from the scheduler I don't understand, like:

Exception: 'AttributeError("\'str\' object has no attribute \'intermediate\'")'

@ntabris I'm happy to share the AWS credentials with you on Coiled Slack or something if you would like to test. They are just credentials to write to a specific Open Storage Network pod (not real AWS credentials, so you can't do any damage even if you wanted to! :))

jrbourbeau · 2024-12-09T15:57:47Z

I created a little rechunk repo here for the files I'm testing

I'm happy to take a look but it looks like this is a private repo

rsignell · 2024-12-09T16:12:39Z

Oops! I guess that private must be the default now. Fixed!

jrbourbeau · 2024-12-09T20:18:42Z

@rsignell could you try again with the dask=2024.11.2 release instead of the latest dask=2024.12.0 release? I'm able to reproduce your Exception: 'AttributeError("\'str\' object has no attribute \'intermediate\'")' error with 2024.12.0 but not when I roll back one release to 2024.11.2 (haven't root caused the issue yet though)

EDIT: Looking into that failure upstream here pangeo-data/rechunker#153

rsignell · 2024-12-09T22:15:25Z

I tried with:

dask=2024.11.2 and xarray=2024.7.0
dask (latest) and xarray=2024.7.0
dask=2024.11.2 and xarray (latest)
none of these worked -- they all errored out before 2 minutes.

rsignell · 2024-12-10T11:29:58Z

Is something getting lost (like the credentials) between the scheduler and the workers?

jrbourbeau · 2024-12-10T16:09:44Z

Thanks for trying that out @rsignell. When I look at your most recent run, which is using dask=2024.11.2, I see that rechunking is actually running but after about a minute we start seeing might memory warnings coming from the LocalCluster (running on the coiled batch run VM) where the rechunking is happening. So, things are still breaking, but now it's from the load on the VM and not library incompatibility issues (xref pangeo-data/rechunker#153 (comment)).

I've made this change

     # For c7g.8xlarge (32cpu, 64GB RAM)
-    n_workers=30
+    n_workers = 20
     mem = 64
-    cluster = LocalCluster(n_workers=n_workers, threads_per_worker=1)
+    cluster = LocalCluster(n_workers=n_workers, threads_per_worker=1, memory_limit=None)
     client = Client(cluster)

to use reduce the load on the VM to see if that helps. So far it's been running for ~5 minutes without an issue.

rsignell · 2024-12-10T16:29:58Z

@jrbourbeau Oh jeez, thanks! I didn't realize it was a memory issue! I assigned the max_mem for rechunk to be total_memory/n_workers*0.7 but I guess that wasn't good enough!

rsignell · 2024-12-11T16:02:44Z

@jrbourbeau can you tell from this log whether it's still running out of memory? I keep decreasing the max_mem passed to rechunker and decreasing the number of workers and it's still bombing out:
https://cloud.coiled.io/clusters/684396/account/esip-lab/information?workspace=esip-lab&tab=Logs&filterPattern=

jrbourbeau · 2024-12-11T17:02:55Z

https://cloud.coiled.io/clusters/684396/account/esip-lab/information?workspace=esip-lab&tab=Logs&filterPattern=

Hmm that cluster is actually hitting an S3 / boto issue

botocore.exceptions.ClientError: An error occurred (SignatureDoesNotMatch) when calling the ListObjectsV2 operation: None

By chance, did you change anything with your --secret-env credentials forwarding?

Btw, more importantly, I notice you're running on a small VM when you intend to run on a big one. This PR OpenScienceComputing/rechunk-jobarray#1 should help

rsignell · 2024-12-18T18:07:39Z

I'm getting closer here -- everything seems fine when I launch only one job but when I launch all 19, it seems to be crapping out part way through and I can't tell why! Help?

https://cloud.coiled.io/clusters/691886/account/esip-lab/information?workspace=esip-lab

jrbourbeau · 2024-12-19T16:16:15Z

Ah, sorry about that @rsignell. We just deployed a fix -- can you try again?

rsignell · 2024-12-19T19:51:38Z

Tried again and it's running but I expected it to take 30 min, and it's at around 90 so far...

https://cloud.coiled.io/clusters/693164/account/esip-lab/information

phofl · 2024-12-19T19:56:51Z

There are tons of ReadTimeoutError and restarted workers. Memory is very high

mrocklin · 2024-12-19T20:10:48Z

Looks like high memory use happened at around the same time network bandwidth mostly shut down.

Makes one wonder what rechunker was doing at this moment

rsignell · 2024-12-20T18:31:13Z

I was wondering whether the problem might be when it tries to write to the Open Storage Network pod with all those machines, so I tried switching the storage to regular AWS S3 in the same region as the compute. That one finished just fine:
https://cloud.coiled.io/clusters/694266/account/esip-lab/information?workspace=esip-lab

To see if I could replicate the problem with OSN I tried that run again and it had the same issues as the first time:
https://cloud.coiled.io/clusters/694353/account/esip-lab/information?workspace=esip-lab

I'll investigate the problem with OSN, but I guess it's not a Coiled issue since it's working fine with AWS S3!

I did notice something that could be improved with the SLURM-like workflows. If you have run with just N=1 (say for testing), then the dashboard says 0 workers instead of 1, and also it seem there are no metrics, which is really too bad!
https://cloud.coiled.io/clusters/694131/account/esip-lab/information?workspace=esip-lab&tab=Metrics
Should I raise this as a separate issue?

ntabris · 2025-01-02T18:43:18Z

I did notice something that could be improved with the SLURM-like workflows. If you have run with just N=1 (say for testing), then the dashboard says 0 workers instead of 1, and also it seem there are no metrics, which is really too bad!
https://cloud.coiled.io/clusters/694131/account/esip-lab/information?workspace=esip-lab&tab=Metrics
Should I raise this as a separate issue?

Good catch, thanks! I've opened internal issue for this, no need for you to open separate issue here.

jrbourbeau mentioned this issue Dec 9, 2024

Test CI pangeo-data/rechunker#153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying out SLURM-Style Job Arrays ! #308

Trying out SLURM-Style Job Arrays ! #308

rsignell commented Dec 5, 2024 •

edited

Loading

ntabris commented Dec 5, 2024

ntabris commented Dec 5, 2024

ntabris commented Dec 5, 2024 •

edited

Loading

ntabris commented Dec 5, 2024 •

edited

Loading

rsignell commented Dec 5, 2024

ntabris commented Dec 6, 2024

rsignell commented Dec 6, 2024

mrocklin commented Dec 6, 2024

jrbourbeau commented Dec 6, 2024

jrbourbeau commented Dec 6, 2024

jrbourbeau commented Dec 6, 2024

rsignell commented Dec 6, 2024

ntabris commented Dec 6, 2024

rsignell commented Dec 9, 2024

jrbourbeau commented Dec 9, 2024

rsignell commented Dec 9, 2024

jrbourbeau commented Dec 9, 2024 •

edited

Loading

rsignell commented Dec 9, 2024 •

edited

Loading

rsignell commented Dec 10, 2024

jrbourbeau commented Dec 10, 2024

rsignell commented Dec 10, 2024

rsignell commented Dec 11, 2024 •

edited

Loading

jrbourbeau commented Dec 11, 2024

rsignell commented Dec 18, 2024

jrbourbeau commented Dec 19, 2024

rsignell commented Dec 19, 2024

phofl commented Dec 19, 2024

mrocklin commented Dec 19, 2024

rsignell commented Dec 20, 2024 •

edited

Loading

ntabris commented Jan 2, 2025

Trying out SLURM-Style Job Arrays ! #308

Trying out SLURM-Style Job Arrays ! #308

Comments

rsignell commented Dec 5, 2024 • edited Loading

ntabris commented Dec 5, 2024

ntabris commented Dec 5, 2024

ntabris commented Dec 5, 2024 • edited Loading

ntabris commented Dec 5, 2024 • edited Loading

rsignell commented Dec 5, 2024

ntabris commented Dec 6, 2024

rsignell commented Dec 6, 2024

mrocklin commented Dec 6, 2024

jrbourbeau commented Dec 6, 2024

jrbourbeau commented Dec 6, 2024

jrbourbeau commented Dec 6, 2024

rsignell commented Dec 6, 2024

ntabris commented Dec 6, 2024

rsignell commented Dec 9, 2024

jrbourbeau commented Dec 9, 2024

rsignell commented Dec 9, 2024

jrbourbeau commented Dec 9, 2024 • edited Loading

rsignell commented Dec 9, 2024 • edited Loading

rsignell commented Dec 10, 2024

jrbourbeau commented Dec 10, 2024

rsignell commented Dec 10, 2024

rsignell commented Dec 11, 2024 • edited Loading

jrbourbeau commented Dec 11, 2024

rsignell commented Dec 18, 2024

jrbourbeau commented Dec 19, 2024

rsignell commented Dec 19, 2024

phofl commented Dec 19, 2024

mrocklin commented Dec 19, 2024

rsignell commented Dec 20, 2024 • edited Loading

ntabris commented Jan 2, 2025

rsignell commented Dec 5, 2024 •

edited

Loading

ntabris commented Dec 5, 2024 •

edited

Loading

ntabris commented Dec 5, 2024 •

edited

Loading

jrbourbeau commented Dec 9, 2024 •

edited

Loading

rsignell commented Dec 9, 2024 •

edited

Loading

rsignell commented Dec 11, 2024 •

edited

Loading

rsignell commented Dec 20, 2024 •

edited

Loading