Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Linux H100] containers with images considered "legacy" failed to do docker run #6238

Open
yangw-dev opened this issue Jan 30, 2025 · 7 comments

Comments

@yangw-dev
Copy link
Contributor

yangw-dev commented Jan 30, 2025

description

when AO test runs with h100 it's not consistent during the linux test job, when the image is 'legacy', it causes problem

Error Peak

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown.

example

AO test
failure example: https://github.com/pytorch/ao/actions/runs/13042725250/job/36387841392
success exmaple: https://github.com/pytorch/ao/actions/runs/12999348107/job/36254475921

@yangw-dev
Copy link
Contributor Author

The bug might related to some pet instances have legacy images, and cause docker run error due to NVIDIA/nvidia-container-toolkit#797

@huydhn
Copy link
Contributor

huydhn commented Feb 7, 2025

I think we mitigated this a while ago by pinning nvidia-container-toolkit #5852, maybe we need to do the same here

cc @jeanschmidt @ZainRizvi

@huydhn
Copy link
Contributor

huydhn commented Feb 7, 2025

I manually install the older version of nvidia-container-toolkit on the broken runner i-0d3ed1ff3ccbeec77 with apt-get install nvidia-container-toolkit=1.16.2-1 nvidia-container-toolkit-base=1.16.2-1 to get the runner up now

https://github.com/pytorch/ao/actions/runs/13190616812/job/36824369092

@ZainRizvi
Copy link
Contributor

@jeanschmidt do we need to change how H100s are setup/refreshed to ensure we use an appropriate nvidia-container-toolkit version?

@huydhn
Copy link
Contributor

huydhn commented Feb 21, 2025

Oops, I moved this one to cold storage, this seems like an item we want to fix sooner than later now that compiler team is using H100 for their benchmarks pytorch/pytorch#146868

@huydhn huydhn moved this from Cold Storage to Prioritized in PyTorch OSS Dev Infra Feb 21, 2025
@jeanschmidt
Copy link
Contributor

If pinning would solve the issue, I believe we can pin the version. AFAIK this package already comes pre-installed in the image version we use. So we might want to downgrade it.

@jeanschmidt
Copy link
Contributor

This PR should pin it:

https://github.com/pytorch-labs/pytorch-gha-infra/pull/618

We might want to merge those and run cattle spa in all instances during the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Prioritized
Development

No branches or pull requests

4 participants