You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Power monitoring (and more generally, shaping a power profile according to eg cost or power mix) is a thing for NAIC and may become a thing for others.
Slurm has a power monitoring capability but requires access to IMPI or similar for it, here are some ancient slides from when that was introduced.
For sonar, we can read GPU power from the cards but we can read blade or node power (not sure which) using ipmi / dcmi:
[root@gpu-2 ~]# ipmi-dcmi --get-system-power-statistics
Current Power : 1190 Watts
Minimum Power over sampling duration : 1189 watts
Maximum Power over sampling duration : 1190 watts
Average Power over sampling duration : 1189 watts
Time Stamp : 11/26/2024 - 14:21:55
Statistics reporting time period : 688000 milliseconds
Power Measurement : Active
And one that's not so busy:
[root@gpu-4 ~]# ipmi-dcmi --get-system-power-statistics
Current Power : 1123 Watts
Minimum Power over sampling duration : 961 watts
Maximum Power over sampling duration : 1123 watts
Average Power over sampling duration : 1065 watts
Time Stamp : 11/26/2024 - 14:23:07
Statistics reporting time period : 2362000 milliseconds
Power Measurement : Active
Unfortunately this does require root access, which probably means we don't want sonar to be able to do it directly. But we may be able to go via an audited suid-root executable that only invokes ipmi-dcmi or similar.
The text was updated successfully, but these errors were encountered:
uncertain whether gpu cards are included in ipmi readings or not, probably "it depends"
ipmi readings tend to be "pretty good" relative to board draw
outside board there's often many nodes per psu, even on saga there appears to be multiple nodes per ipmi sensor
geneally, "it's complicated"
ipmitool will allow one to play with the sensors and understand system architecture
there are interesting use cases for determining how much energy is used by a job, specifically, a job can be billed for energy or a job that is producing excess heat that goes back into some heating system can be credited for the heat that it produces
Power monitoring (and more generally, shaping a power profile according to eg cost or power mix) is a thing for NAIC and may become a thing for others.
Slurm has a power monitoring capability but requires access to IMPI or similar for it, here are some ancient slides from when that was introduced.
For sonar, we can read GPU power from the cards but we can read blade or node power (not sure which) using ipmi / dcmi:
And one that's not so busy:
Unfortunately this does require root access, which probably means we don't want sonar to be able to do it directly. But we may be able to go via an audited suid-root executable that only invokes ipmi-dcmi or similar.
The text was updated successfully, but these errors were encountered: