Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preflight+Post-mortem machine health checks #386

Open
JamesKunstle opened this issue Jan 8, 2025 · 0 comments
Open

Preflight+Post-mortem machine health checks #386

JamesKunstle opened this issue Jan 8, 2025 · 0 comments

Comments

@JamesKunstle
Copy link
Contributor

It'd be nice to be able to verify that a machine is in a heuristically OK state before kicking off a long-running training job. There are some common things to check:

  1. Host -> Card throughput
  2. Card -> Card ring throughput
  3. Card power throttling
  4. Card memory row remapping
  5. Card memory page tainting

to name a few broadly.

Occasionally, our training runs experience in-situ failures, possibly due to some collectives timing out (NCCL, RCCL, HCCL implied). Watchdog timers currently kill training for restarting in these situations but it'd be great to do better than that.

Nvidia DCGM levels 1,2 can provide card-level diagnostics for pre-flight checks, and can do some post-mortem info logging with level 3,4. However, these have to be manually run manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant