Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System power monitoring #215

Open
lars-t-hansen opened this issue Nov 26, 2024 · 1 comment
Open

System power monitoring #215

lars-t-hansen opened this issue Nov 26, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

Power monitoring (and more generally, shaping a power profile according to eg cost or power mix) is a thing for NAIC and may become a thing for others.

Slurm has a power monitoring capability but requires access to IMPI or similar for it, here are some ancient slides from when that was introduced.

For sonar, we can read GPU power from the cards but we can read blade or node power (not sure which) using ipmi / dcmi:

[root@gpu-2 ~]# ipmi-dcmi --get-system-power-statistics
Current Power                        : 1190 Watts
Minimum Power over sampling duration : 1189 watts
Maximum Power over sampling duration : 1190 watts
Average Power over sampling duration : 1189 watts
Time Stamp                           : 11/26/2024 - 14:21:55
Statistics reporting time period     : 688000 milliseconds
Power Measurement                    : Active

And one that's not so busy:

[root@gpu-4 ~]# ipmi-dcmi --get-system-power-statistics
Current Power                        : 1123 Watts
Minimum Power over sampling duration : 961 watts
Maximum Power over sampling duration : 1123 watts
Average Power over sampling duration : 1065 watts
Time Stamp                           : 11/26/2024 - 14:23:07
Statistics reporting time period     : 2362000 milliseconds
Power Measurement                    : Active

Unfortunately this does require root access, which probably means we don't want sonar to be able to do it directly. But we may be able to go via an audited suid-root executable that only invokes ipmi-dcmi or similar.

@lars-t-hansen lars-t-hansen added the enhancement New feature or request label Nov 26, 2024
@lars-t-hansen
Copy link
Collaborator Author

Some notes from a chat with Einar:

  • ipmi read-only unprivileged user may be possible
  • uncertain whether gpu cards are included in ipmi readings or not, probably "it depends"
  • ipmi readings tend to be "pretty good" relative to board draw
  • outside board there's often many nodes per psu, even on saga there appears to be multiple nodes per ipmi sensor
  • geneally, "it's complicated"
  • ipmitool will allow one to play with the sensors and understand system architecture
  • there are interesting use cases for determining how much energy is used by a job, specifically, a job can be billed for energy or a job that is producing excess heat that goes back into some heating system can be credited for the heat that it produces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant