-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use process_retry on amd-smi #22
base: main
Are you sure you want to change the base?
Conversation
@@ -67,7 +67,7 @@ jobs: | |||
options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --env HIP_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ | |||
steps: | |||
- name: AMD-SMI | |||
run: amd-smi list | |||
run: python3 utils/process_retry.py 'amd-smi list' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to update the repository before running this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to update this one before merge.
@@ -120,7 +120,7 @@ jobs: | |||
echo "slice_ids=$(python3 -c 'd = list(range(${{ env.NUM_SLICES }})); print(d)')" >> $GITHUB_OUTPUT | |||
|
|||
- name: AMD-SMI | |||
run: amd-smi list | |||
run: python3 utils/process_retry.py 'amd-smi list' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this one being after name: Update clone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
(I would prefer, in general, opening separate RPs for different purpose. I believe there are some changes here are just about styles.)
Yeah, most of those are from my IDE actually. It might be a bit too opinionated. |
Both
rocm-smi
andamd-smi list
randomly fail on our runners.This should increase the odds of getting past those failures.