-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Linux H100] containers with images considered "legacy" failed to do docker run #6238
Comments
The bug might related to some pet instances have legacy images, and cause docker run error due to NVIDIA/nvidia-container-toolkit#797 |
I think we mitigated this a while ago by pinning |
I manually install the older version of https://github.com/pytorch/ao/actions/runs/13190616812/job/36824369092 |
@jeanschmidt do we need to change how H100s are setup/refreshed to ensure we use an appropriate |
Oops, I moved this one to cold storage, this seems like an item we want to fix sooner than later now that compiler team is using H100 for their benchmarks pytorch/pytorch#146868 |
If pinning would solve the issue, I believe we can pin the version. AFAIK this package already comes pre-installed in the image version we use. So we might want to downgrade it. |
This PR should pin it: https://github.com/pytorch-labs/pytorch-gha-infra/pull/618 We might want to merge those and run cattle spa in all instances during the weekend |
description
when AO test runs with h100 it's not consistent during the linux test job, when the image is 'legacy', it causes problem
Error Peak
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown.
example
AO test
failure example: https://github.com/pytorch/ao/actions/runs/13042725250/job/36387841392
success exmaple: https://github.com/pytorch/ao/actions/runs/12999348107/job/36254475921
The text was updated successfully, but these errors were encountered: