[Linux H100] containers with images considered "legacy" failed to do docker run #6238

yangw-dev · 2025-01-30T18:18:11Z

description

when AO test runs with h100 it's not consistent during the linux test job, when the image is 'legacy', it causes problem

Error Peak

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown.

example

AO test
failure example: https://github.com/pytorch/ao/actions/runs/13042725250/job/36387841392
success exmaple: https://github.com/pytorch/ao/actions/runs/12999348107/job/36254475921

The text was updated successfully, but these errors were encountered:

yangw-dev · 2025-01-30T18:18:25Z

The bug might related to some pet instances have legacy images, and cause docker run error due to NVIDIA/nvidia-container-toolkit#797

huydhn · 2025-02-07T01:07:29Z

I think we mitigated this a while ago by pinning nvidia-container-toolkit #5852, maybe we need to do the same here

cc @jeanschmidt @ZainRizvi

huydhn · 2025-02-07T01:26:46Z

I manually install the older version of nvidia-container-toolkit on the broken runner i-0d3ed1ff3ccbeec77 with apt-get install nvidia-container-toolkit=1.16.2-1 nvidia-container-toolkit-base=1.16.2-1 to get the runner up now

https://github.com/pytorch/ao/actions/runs/13190616812/job/36824369092

ZainRizvi · 2025-02-21T19:33:11Z

@jeanschmidt do we need to change how H100s are setup/refreshed to ensure we use an appropriate nvidia-container-toolkit version?

huydhn · 2025-02-21T19:43:52Z

Oops, I moved this one to cold storage, this seems like an item we want to fix sooner than later now that compiler team is using H100 for their benchmarks pytorch/pytorch#146868

jeanschmidt · 2025-02-21T20:44:53Z

If pinning would solve the issue, I believe we can pin the version. AFAIK this package already comes pre-installed in the image version we use. So we might want to downgrade it.

jeanschmidt · 2025-02-21T20:51:07Z

This PR should pin it:

https://github.com/pytorch-labs/pytorch-gha-infra/pull/618

We might want to merge those and run cattle spa in all instances during the weekend

github-project-automation bot added this to PyTorch OSS Dev Infra Jan 30, 2025

yangw-dev mentioned this issue Jan 30, 2025

linux-job sets NVIDIA_IMEX_CHANNELS=0 when running with CUDA gpus #6236

Closed

huydhn moved this to Cold Storage in PyTorch OSS Dev Infra Feb 4, 2025

huydhn moved this from Cold Storage to Prioritized in PyTorch OSS Dev Infra Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Linux H100] containers with images considered "legacy" failed to do docker run #6238

[Linux H100] containers with images considered "legacy" failed to do docker run #6238

yangw-dev commented Jan 30, 2025 •

edited

Loading

yangw-dev commented Jan 30, 2025

huydhn commented Feb 7, 2025

huydhn commented Feb 7, 2025 •

edited

Loading

ZainRizvi commented Feb 21, 2025

huydhn commented Feb 21, 2025 •

edited

Loading

jeanschmidt commented Feb 21, 2025

jeanschmidt commented Feb 21, 2025

[Linux H100] containers with images considered "legacy" failed to do docker run #6238

[Linux H100] containers with images considered "legacy" failed to do docker run #6238

Comments

yangw-dev commented Jan 30, 2025 • edited Loading

description

Error Peak

example

yangw-dev commented Jan 30, 2025

huydhn commented Feb 7, 2025

huydhn commented Feb 7, 2025 • edited Loading

ZainRizvi commented Feb 21, 2025

huydhn commented Feb 21, 2025 • edited Loading

jeanschmidt commented Feb 21, 2025

jeanschmidt commented Feb 21, 2025

yangw-dev commented Jan 30, 2025 •

edited

Loading

huydhn commented Feb 7, 2025 •

edited

Loading

huydhn commented Feb 21, 2025 •

edited

Loading