Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying out SLURM-Style Job Arrays ! #308

Open
rsignell opened this issue Dec 5, 2024 · 30 comments
Open

Trying out SLURM-Style Job Arrays ! #308

rsignell opened this issue Dec 5, 2024 · 30 comments

Comments

@rsignell
Copy link

rsignell commented Dec 5, 2024

I was very excited to see the blog post on SLURM-Style Job Arrays because we often use a job array approach to rechunking big data on prem: we use Dask with LocalCluster on each machine to rechunk a certain time range of data based on the job array index, and the result is a bunch of rechunked zarr datasets (one generated by each machine). (We then create references for the collection of zarr datasets using kerchunk and save the references to parquet, and create an intake catalog item so users can conveniently and efficiently load the virtual dataset with just one line.)

On each machine, the rechunker process uses an intermediate zarr dataset that we usually write to /tmp and then the target rechunked zarr dataset is written to object storage (we are using S3-compatible OSN here). We usually use a LocalCluster on each machine to parallelize the rechunking process for each dataset.

I'm not sure how to best accomplish this workflow with Coiled:

  • how do we ensure we have the 90GB for the intermediate dataset we need on /tmp?
  • Is it okay to use a LocalCluster or should be using a Coiled cluster?
  • If okay to use a LocalCluster, I guess we want to specify a 32-cpu machine or something to Coiled
  • How best to pass the credentials needed to write to OSN? We are uploading a file containing the AWS keys using --file on the command line, but not sure whether that will work

I tried just using the code pretty much the way we run it without Coiled, using this script: ERA5-rechunker-AWS.py.

I created a run_rechunk.sh script:

#COILED memory 32 GiB
#COILED ntasks 1
#COILED workspace esip-lab
#COILED software pangeo-notebook
#COILED region us-east-1
python run ERA5-rechunker-AWS.py

which I the submitted with:

 coiled batch run ./run_rechunk.sh --file osn_keys.env

The osn_keys.env file contains the keys needed to write to the S3-compatible Open Storage Network pod. It's just a text file that looks like :

AWS_ACCESS_KEY_ID=6023XCXRG7xxxxxxxxxxxx
AWS_SECRET_ACCESS_KEY=YvRyfg5DoCaYbxxxxxxxxxxx

I can't really tell what's going wrong -- the process is:
https://cloud.coiled.io/clusters/678117/account/esip-lab/information?workspace=esip-lab

@ntabris
Copy link
Member

ntabris commented Dec 5, 2024

Hi, @rsignell, thanks for giving this a try!

Maybe the problem is python run ERA5-rechunker-AWS.py? Should that just be python ERA5-rechunker-AWS.py?

In your cluster logs I'm seeing

python: can't open file '/scratch/batch/run': [Errno 2] No such file or directory

@ntabris
Copy link
Member

ntabris commented Dec 5, 2024

how do we ensure we have the 90GB for the intermediate dataset we need on /tmp?

You can use --disk-size 120GB (or some other value) to set the disk size per VM.

@ntabris
Copy link
Member

ntabris commented Dec 5, 2024

Is it okay to use a LocalCluster or should be using a Coiled cluster? If okay to use a LocalCluster, I guess we want to specify a 32-cpu machine or something to Coiled

It should be fine to use a LocalCluster in each task, and yes, if that's what you're doing then it makes sense to use large VMs.

@ntabris
Copy link
Member

ntabris commented Dec 5, 2024

How best to pass the credentials needed to write to OSN? We are uploading a file containing the AWS keys using --file on the command line, but not sure whether that will work

We don't have a perfect one-size-fits-all solution to credentials currently.

One option is to use --forward-aws-credentials to forward STS taken based on your local AWS credentials; this won't be refreshed though, so things will stop working if your job takes longer than the STS token expiration (which could be 12 hours or could be 15 minutes depending on how your local AWS session is authenticated).

There isn't currently an option to explicitly include other files to upload (i.e., no --file).

Maybe a good option for you would be for us to give you a way to set env vars that are "secret". We do have --env FOO=bar already, but that will be stored in our database and not deleted. We could easily have --secret-env FOO=bar which would still be stored in our database, but only temporarily. Would that work well for you?

@rsignell
Copy link
Author

rsignell commented Dec 5, 2024

@ntabris, thanks for the super speedy response!
I'll try these adjustments, and yes, the -secret-env would work great.

@ntabris
Copy link
Member

ntabris commented Dec 6, 2024

pip install coiled==1.67.1.dev4 will give you --secret-env if you want to give that a try (or you can wait for a non-pre-release version in the next couple days)

@rsignell
Copy link
Author

rsignell commented Dec 6, 2024

Nat, okay, I gave it another go with
This SLURM-like script:

#COILED memory 32GiB
#COILED disk-size 120GB
#COILED ntasks 1
#COILED workspace esip-lab
#COILED software rechunk_arm
#COILED region us-east-1
python ERA5-rechunker-AWS.py

ERA5-rechunker-AWS.py

with this CLI call:

 coiled batch run ./run_rechunk.sh  --secret-env AWS_ACCESS_KEY_ID=6023XCXxxxxxxxxxxxxxxxxx --secret-env AWS_SECRET_ACCESS_KEY=YvRyfg5DoCxxxxxxxxxxxxxx  --worker_vm_types=["c7g.8xlarge"] --arm

Again, I can't see what is happening but it just seems idle, so I killed it. Hopefully getting closer?
https://cloud.coiled.io/clusters/679178/account/esip-lab/information?workspace=esip-lab

@mrocklin
Copy link
Member

mrocklin commented Dec 6, 2024

Not sure, but based on the logs maybe it's this? https://stackoverflow.com/questions/60232708/dask-fails-with-freeze-support-bug

@jrbourbeau
Copy link
Member

Hmm I'm seeing the famous if __name__ == '__main__' error when spawning a subprocess in the logs

untimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

Also I see warning logs like

distributed.nanny.memory - WARNING - Ignoring provided memory limit 100GB due to system memory limit of 30.13 GiB

which is inconsistent with using memory_limit='64GB' for the LocalCluster in ERA5-rechunker-AWS.py. Can you confirm that https://gist.github.com/rsignell/d772a6d334365134709d5393a49db51b is in fact the script you were running?

I've tried a few things but haven't been able to reproduce the issue. @rsignell could you try the exact same coiled batch run command but have your Python script just be

print("Hello world")

and let me know if that hangs too?

@jrbourbeau
Copy link
Member

Yeah, I'm able to reproduce the same thing you're seeing @rsignell if I just instantiate a LocalCluster w/o including it in an if __name__ == "__main__": block. Was there maybe a previous version of your script that didn't have that? The 100GB vs 64GB mismatch also points to some other Python script being used

@jrbourbeau
Copy link
Member

@rsignell FYI @ntabris mentioned he did a bit of digging and found that the script you're running in fact doesn't have an if __name__ == "__main__" block (looks like it's this previous iteration of your gist https://gist.github.com/rsignell/d772a6d334365134709d5393a49db51b/d181f4acc4cd7413aaf7c4cd72686fdb355c9c18). You should have a better time if you use your most recent version with the __main__ block

@rsignell
Copy link
Author

rsignell commented Dec 6, 2024

Grrr, indeed, sorry guys. What I thought I was running was indeed not what I was running. :(
Trying again. Thanks!

@ntabris
Copy link
Member

ntabris commented Dec 6, 2024

FYI we'll push out a change early next week so that code/scripts will be shown in the UI (like it is for dask clusters or coiled run), which should make "what code did I run?" much easier to see.

@rsignell
Copy link
Author

rsignell commented Dec 9, 2024

Grrr.... I still can't get this to work. I created a little rechunk repo here for the files I'm testing.

It works if I try this on a VM:

export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxHF
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxCJ
export COILED_ARRAY_TASK_ID=1
python ERA5-rechunker-AWS.py

but when I try on coiled with ./submit_coiled_batch.sh, it fails, and the logs contain errors from the scheduler I don't understand, like:

Exception: 'AttributeError("\'str\' object has no attribute \'intermediate\'")'

@ntabris I'm happy to share the AWS credentials with you on Coiled Slack or something if you would like to test. They are just credentials to write to a specific Open Storage Network pod (not real AWS credentials, so you can't do any damage even if you wanted to! :))

@jrbourbeau
Copy link
Member

I created a little rechunk repo here for the files I'm testing

I'm happy to take a look but it looks like this is a private repo

@rsignell
Copy link
Author

rsignell commented Dec 9, 2024

Oops! I guess that private must be the default now. Fixed!

@jrbourbeau
Copy link
Member

jrbourbeau commented Dec 9, 2024

@rsignell could you try again with the dask=2024.11.2 release instead of the latest dask=2024.12.0 release? I'm able to reproduce your Exception: 'AttributeError("\'str\' object has no attribute \'intermediate\'")' error with 2024.12.0 but not when I roll back one release to 2024.11.2 (haven't root caused the issue yet though)

EDIT: Looking into that failure upstream here pangeo-data/rechunker#153

@rsignell
Copy link
Author

rsignell commented Dec 9, 2024

I tried with:

  • dask=2024.11.2 and xarray=2024.7.0
  • dask (latest) and xarray=2024.7.0
  • dask=2024.11.2 and xarray (latest)
    none of these worked -- they all errored out before 2 minutes.

@rsignell
Copy link
Author

Is something getting lost (like the credentials) between the scheduler and the workers?

@jrbourbeau
Copy link
Member

Thanks for trying that out @rsignell. When I look at your most recent run, which is using dask=2024.11.2, I see that rechunking is actually running but after about a minute we start seeing might memory warnings coming from the LocalCluster (running on the coiled batch run VM) where the rechunking is happening. So, things are still breaking, but now it's from the load on the VM and not library incompatibility issues (xref pangeo-data/rechunker#153 (comment)).

I've made this change

     # For c7g.8xlarge (32cpu, 64GB RAM)
-    n_workers=30
+    n_workers = 20
     mem = 64
-    cluster = LocalCluster(n_workers=n_workers, threads_per_worker=1)
+    cluster = LocalCluster(n_workers=n_workers, threads_per_worker=1, memory_limit=None)
     client = Client(cluster)

to use reduce the load on the VM to see if that helps. So far it's been running for ~5 minutes without an issue.

@rsignell
Copy link
Author

@jrbourbeau Oh jeez, thanks! I didn't realize it was a memory issue! I assigned the max_mem for rechunk to be total_memory/n_workers*0.7 but I guess that wasn't good enough!

@rsignell
Copy link
Author

rsignell commented Dec 11, 2024

@jrbourbeau can you tell from this log whether it's still running out of memory? I keep decreasing the max_mem passed to rechunker and decreasing the number of workers and it's still bombing out:
https://cloud.coiled.io/clusters/684396/account/esip-lab/information?workspace=esip-lab&tab=Logs&filterPattern=

@jrbourbeau
Copy link
Member

https://cloud.coiled.io/clusters/684396/account/esip-lab/information?workspace=esip-lab&tab=Logs&filterPattern=

Hmm that cluster is actually hitting an S3 / boto issue

botocore.exceptions.ClientError: An error occurred (SignatureDoesNotMatch) when calling the ListObjectsV2 operation: None

By chance, did you change anything with your --secret-env credentials forwarding?

Btw, more importantly, I notice you're running on a small VM when you intend to run on a big one. This PR OpenScienceComputing/rechunk-jobarray#1 should help

@rsignell
Copy link
Author

I'm getting closer here -- everything seems fine when I launch only one job but when I launch all 19, it seems to be crapping out part way through and I can't tell why! Help?

https://cloud.coiled.io/clusters/691886/account/esip-lab/information?workspace=esip-lab

@jrbourbeau
Copy link
Member

Ah, sorry about that @rsignell. We just deployed a fix -- can you try again?

@rsignell
Copy link
Author

Tried again and it's running but I expected it to take 30 min, and it's at around 90 so far...

https://cloud.coiled.io/clusters/693164/account/esip-lab/information

@phofl
Copy link

phofl commented Dec 19, 2024

There are tons of ReadTimeoutError and restarted workers. Memory is very high

@mrocklin
Copy link
Member

Looks like high memory use happened at around the same time network bandwidth mostly shut down.

Screenshot 2024-12-19 at 12 09 51 PM

Makes one wonder what rechunker was doing at this moment

@rsignell
Copy link
Author

rsignell commented Dec 20, 2024

I was wondering whether the problem might be when it tries to write to the Open Storage Network pod with all those machines, so I tried switching the storage to regular AWS S3 in the same region as the compute. That one finished just fine:
https://cloud.coiled.io/clusters/694266/account/esip-lab/information?workspace=esip-lab

To see if I could replicate the problem with OSN I tried that run again and it had the same issues as the first time:
https://cloud.coiled.io/clusters/694353/account/esip-lab/information?workspace=esip-lab

I'll investigate the problem with OSN, but I guess it's not a Coiled issue since it's working fine with AWS S3!

I did notice something that could be improved with the SLURM-like workflows. If you have run with just N=1 (say for testing), then the dashboard says 0 workers instead of 1, and also it seem there are no metrics, which is really too bad!
https://cloud.coiled.io/clusters/694131/account/esip-lab/information?workspace=esip-lab&tab=Metrics
Should I raise this as a separate issue?

@ntabris
Copy link
Member

ntabris commented Jan 2, 2025

I did notice something that could be improved with the SLURM-like workflows. If you have run with just N=1 (say for testing), then the dashboard says 0 workers instead of 1, and also it seem there are no metrics, which is really too bad!
https://cloud.coiled.io/clusters/694131/account/esip-lab/information?workspace=esip-lab&tab=Metrics
Should I raise this as a separate issue?

Good catch, thanks! I've opened internal issue for this, no need for you to open separate issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants