You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It'd be nice to be able to verify that a machine is in a heuristically OK state before kicking off a long-running training job. There are some common things to check:
Host -> Card throughput
Card -> Card ring throughput
Card power throttling
Card memory row remapping
Card memory page tainting
to name a few broadly.
Occasionally, our training runs experience in-situ failures, possibly due to some collectives timing out (NCCL, RCCL, HCCL implied). Watchdog timers currently kill training for restarting in these situations but it'd be great to do better than that.
Nvidia DCGM levels 1,2 can provide card-level diagnostics for pre-flight checks, and can do some post-mortem info logging with level 3,4. However, these have to be manually run manually.
The text was updated successfully, but these errors were encountered:
It'd be nice to be able to verify that a machine is in a heuristically OK state before kicking off a long-running training job. There are some common things to check:
to name a few broadly.
Occasionally, our training runs experience in-situ failures, possibly due to some collectives timing out (NCCL, RCCL, HCCL implied). Watchdog timers currently kill training for restarting in these situations but it'd be great to do better than that.
Nvidia DCGM levels 1,2 can provide card-level diagnostics for pre-flight checks, and can do some post-mortem info logging with level 3,4. However, these have to be manually run manually.
The text was updated successfully, but these errors were encountered: