-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add GPU accounting for SMHP #462
base: main
Are you sure you want to change the base?
Conversation
d53d46f
to
f152b80
Compare
Please test auto-resume with this configuration enabled to confirm there is not a conflict, and post results here. I dont think it should be, since you are only modifying accounting.conf, however I have seen funky behavior with modifying gres attributes, or using gres attributes, in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments.
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh
Outdated
Show resolved
Hide resolved
f152b80
to
840c226
Compare
@@ -99,6 +99,8 @@ JobAcctGatherFrequency=30 | |||
AccountingStorageType=accounting_storage/slurmdbd | |||
AccountingStorageHost=$DBD_HOST | |||
AccountingStoragePort=6819 | |||
AccountingStorageTRES=gres/gpu | |||
GresTypes=gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me copy paste my comment from slack.
GresTypes=gpu doesn't allow customers to resume a faulty job if they want. AutoResume plugin will always requeue a job in this case. Personally, I'm fine with this because result is the same -> we recover a faulty job from hardware failure. But this is a behaviour change, for existing customers. We can initiate discussion if gres should be on by default, because in my opinion it has more benefits for customers than resuming jobs.
We have CPU only instances. Even trn1/trn2 are marked as "CPU only" and don't have any Gres attached now, and they defenitely will not have gpus. Maybe it make sense to configure these values via ClusterAgent depends on instance types in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussing internally on gres setup.
This PR proposes to add GPU accounting to
setup_mariadb_accounting.sh
.