Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.11.0 start up time longer than 3.9.1 #6479

Open
gwolski opened this issue Oct 17, 2024 · 11 comments
Open

3.11.0 start up time longer than 3.9.1 #6479

gwolski opened this issue Oct 17, 2024 · 11 comments
Labels

Comments

@gwolski
Copy link

gwolski commented Oct 17, 2024

Using 3.9.1 the time to start a compute node based on my custom AMI is taking 4:11 (four minutes, 11 seconds).
Moving to 3.11.0 same custom AMI configured with 3.11.0 now takes 4:47 (four minutes, 47 seconds).
These start up times come from starting an m7a.medium.

My users are already complaining.

Is there any "performance work" being done to improve these start up times?
The actual machine is up around time 3 minutes IIRC, would be nice if we could get under four minutes before job starts.

@gwolski gwolski added the 3.x label Oct 17, 2024
@hanwen-pcluste
Copy link
Contributor

Hi Guntram,

To help us reproduce the issue, can you provide your cluster configuration file without sensitive information?

Performance work is being done. But we were not aware of any scaling speed difference between 3.9.1 and 3.11.0.

Thank you,
Hanwen

@gwolski
Copy link
Author

gwolski commented Oct 18, 2024

Hello Hanwen,
I would be happy to provide, but before I do, and you dig into my config file, is it possible for you to run a comparison with the setup you have? I based my times on how long the machine is in CF STATE until the job goes to RUNNING as displayed by squeue. I'd hate for you to dig through my config file w/o first confirming at your end with your vanilla setups? Let me know what you see?

@hanwen-pcluste
Copy link
Contributor

Hi Guntram,

I am not able to reproduce scaling time difference between 3.11.0 and 3.9.1. So a cluster configuration file is helpful for us to reproduce the issue.

Thank you,
Hanwen

@gwolski
Copy link
Author

gwolski commented Oct 22, 2024

Hi Hanwen,
I will do some more benchmarking on Tuesday and get back to you with the results and the cluster configuration file.
--G

@gwolski
Copy link
Author

gwolski commented Oct 24, 2024

I've got nothing definitive. I ran some tests by submitting jobs with srun. I watched the output of the squeue and noted the time at which it went from CONFIGURING to RUNNING. I even had some outliers that confuse me more. Here are the startup times for various instance types:

<style> </style>
instance 3.9.1 3.11.0
r7i.large 4:49 5:03
m7a.medium 4:08 4:16
m7a.large 3:17 5:03
r7a.xlarge 3:24 4:25
m7a.4xlarge 6:28 5:04

I even had the m7a.large in 3.11.0 take 7:09 in one attempt. Go figure. I wish things were consistent, I don't understand why there should be such a strange variation.

If you have any articles/wiki/instructions on how to ensure I have the fastest startup times, I'd appreciate a link.
Someday, I hope we'll be able to hibernate systems and then revive them so start up times are on the order of seconds (under a minute).

@gwolski gwolski closed this as completed Oct 24, 2024
@joehellmersNOAA
Copy link

@gwolski Thanks for collecting this data. This is very useful. @hanwen-pcluste It would be nice if AWS could somehow break down those times into the constituent parts to diagnose what the differences are.

@gmarciani
Copy link
Contributor

@gwolski thank you for collecting the startup time.
Can you please share the cluster config file with private data redacted?

@gwolski
Copy link
Author

gwolski commented Nov 13, 2024

cluster config and scontrol show info attached.

I went back and reviewed my data. Most of the start up times I show above are from spot machine launches. I have one OnDemand 3.11.1 launch of an r7a.medium that took 5:52 in 3.11.1.

I have been reviewing the slurmctld.log file and i think I can parse out launch to start times from that file. On my list to do once I resolve #6529

I also started a cluster using just your supported rhel8 x86_64 AMI yesterday. The first machine I started up with srun was a spot based r7a.medium.. It took 7:08 to go from CF to RUNNING.
tsi4_config_files.tar.gz

(If you find any private data in there that I failed to redact, please let me know so we can delete this attachment and reshare).

@gwolski gwolski reopened this Nov 13, 2024
@gwolski
Copy link
Author

gwolski commented Nov 13, 2024

I forgot to mention - my cluster config file is created by config files I code for https://github.com/aws-samples/aws-eda-slurm-cluster and cluster created by same.

(more anecdotal info: Just had 11 jobs all request spot r7a.medium take about 3:58 to boot. - Nice)

@hanwen-cluster
Copy link
Contributor

Apologies for the late reply. We are still looking into this issue.

@hanwen-cluster
Copy link
Contributor

hanwen-cluster commented Jan 24, 2025

Hi Guntram,

  1. Regarding the data here, are they OnDemand instances or SPOT instances?
  2. How many instances are you launching at once?
  3. Can you share the logs from the compute node which took longest to launch? The logs should be available on CloudWatch log group with the instance id in their name. See details here

Thank you,
Hanwen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants