Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Loading eBPF Objects (softirq_entry) #1765

Closed
marvin-steinke opened this issue Sep 5, 2024 · 2 comments
Closed

Error Loading eBPF Objects (softirq_entry) #1765

marvin-steinke opened this issue Sep 5, 2024 · 2 comments
Labels
kind/bug report bug issue

Comments

@marvin-steinke
Copy link

marvin-steinke commented Sep 5, 2024

What happened?

Kepler seems to have problems with eBPF on my current setup. Kepler logs state:

failed to create eBPF exporter: error loading eBPF objects: field KeplerIrqTrace: program kepler_irq_trace: attach Tracing/TraceRawTp: raw_tp softirq_entry not supported

However softirq_entry is present at /sys/kernel/debug/ on the host. I did find the similar issue #727 which points to to a permission problem. Do I need to configure my host differently?

What did you expect to happen?

Installation succeeds.

How can we reproduce it (as minimally and precisely as possible)?

helm install kepler kepler/kepler --namespace kepler --create-namespace

Anything else we need to know?

OS: Ubuntu 20.04.3 LTS x86_64
Host: SYS-1019GP-TT 0123456789
Kernel: 5.4.0-192-generic
CPU: Intel Xeon Silver 4208 (16) @ 3.200GHz
GPU: NVIDIA Quadro RTX 5000
GPU: NVIDIA Quadro RTX 5000
Memory: 95208MiB

Kepler image tag

I0905 08:13:52.281482       1 gpu.go:38] Trying to initialize GPU collector using dcgm
W0905 08:13:52.281702       1 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found
W0905 08:13:52.281727       1 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found
I0905 08:13:52.281733       1 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found
I0905 08:13:52.281739       1 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml
I0905 08:13:52.281789       1 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0905 08:13:52.281798       1 gpu.go:38] Trying to initialize GPU collector using dummy
I0905 08:13:52.281803       1 gpu.go:42] Using dummy to obtain gpu power
I0905 08:13:52.285110       1 exporter.go:100] Kepler running on version: v0.7.11
I0905 08:13:52.285158       1 config.go:284] using gCgroup ID in the BPF program: true
I0905 08:13:52.285182       1 config.go:286] kernel version: 5.4
I0905 08:13:52.285247       1 config.go:311] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0905 08:13:52.285302       1 power.go:53] use sysfs to obtain power
I0905 08:13:52.285315       1 redfish.go:169] failed to get redfish credential file path
I0905 08:13:52.289657       1 power.go:73] using acpi to obtain power
I0905 08:13:52.292851       1 exporter.go:89] Number of CPUs: 16
F0905 08:13:52.412014       1 exporter.go:140] failed to create eBPF exporter: error loading eBPF objects: field KeplerIrqTrace: program kepler_irq_trace: attach Tracing/TraceRawTp: raw_tp softirq_entry not supported

Kubernetes version

$ kubectl version
Client Version: v1.30.4+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.4+k3s1

Cloud provider or bare metal

Bare Meal

OS version

# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

$ uname -a
Linux gpu01 5.4.0-192-generic #212-Ubuntu SMP Fri Jul 5 09:47:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

helm according to the docs with default values

Kepler deployment config

$ kubectl describe pod -l app.kubernetes.io/name=kepler -n kepler
Name:             kepler-9qknx
Namespace:        kepler
Priority:         0
Service Account:  kepler
Node:             gpu01/130.149.248.50
Start Time:       Thu, 05 Sep 2024 08:20:19 +0000
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler
                  controller-revision-hash=5d8c546465
                  pod-template-generation=1
Annotations:      <none>
Status:           Running
IP:               130.149.248.50
IPs:
  IP:           130.149.248.50
Controlled By:  DaemonSet/kepler
Containers:
  kepler-exporter:
    Container ID:  containerd://9d1f5d14b5ee7e74dc723dc9734efdf1ad4f1d10eb548da8a2631240406107d2
    Image:         quay.io/sustainable_computing_io/kepler:release-0.7.11
    Image ID:      quay.io/sustainable_computing_io/kepler@sha256:72e7cd2e866c696900b9b9a33a72fc61a77d06e1c0300b08074784510da4013a
    Port:          9102/TCP
    Host Port:     9102/TCP
    Args:
      -v=$(KEPLER_LOG_LEVEL)
    State:          Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Thu, 05 Sep 2024 08:21:57 +0000
      Finished:     Thu, 05 Sep 2024 08:21:57 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Thu, 05 Sep 2024 08:21:09 +0000
      Finished:     Thu, 05 Sep 2024 08:21:09 +0000
    Ready:          False
    Restart Count:  4
    Liveness:       http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:                      (v1:status.hostIP)
      NODE_NAME:                    (v1:spec.nodeName)
      METRIC_PATH:                 /metrics
      BIND_ADDRESS:                0.0.0.0:9102
      CGROUP_METRICS:              *
      CPU_ARCH_OVERRIDE:           
      ENABLE_EBPF_CGROUPID:        true
      ENABLE_GPU:                  true
      ENABLE_PROCESS_METRICS:      false
      ENABLE_QAT:                  false
      EXPOSE_CGROUP_METRICS:       false
      EXPOSE_HW_COUNTER_METRICS:   true
      EXPOSE_IRQ_COUNTER_METRICS:  true
      KEPLER_LOG_LEVEL:            1
    Mounts:
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /usr/src from usr-src (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jvxx5 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  DirectoryOrCreate
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src
    HostPathType:  Directory
  kube-api-access-jvxx5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  107s                default-scheduler  Successfully assigned kepler/kepler-9qknx to gpu01
  Normal   Pulled     107s                kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 564ms (564ms including waiting). Image size: 117793827 bytes.
  Normal   Pulled     106s                kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 459ms (459ms including waiting). Image size: 117793827 bytes.
  Normal   Pulled     90s                 kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 552ms (552ms including waiting). Image size: 117793827 bytes.
  Normal   Created    58s (x4 over 107s)  kubelet            Created container kepler-exporter
  Normal   Started    58s (x4 over 107s)  kubelet            Started container kepler-exporter
  Normal   Pulled     58s                 kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:release-0.7.11" in 490ms (490ms including waiting). Image size: 117793827 bytes.
  Warning  BackOff    23s (x8 over 105s)  kubelet            Back-off restarting failed container kepler-exporter in pod kepler-9qknx_kepler(63598b16-bf4d-4d1b-af97-55672ac817b4)
  Normal   Pulling    11s (x5 over 107s)  kubelet            Pulling image "quay.io/sustainable_computing_io/kepler:release-0.7.11"

Container runtime (CRI) and version (if applicable)

Containerd v1.7.20-k3s1

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@marvin-steinke marvin-steinke added the kind/bug report bug issue label Sep 5, 2024
@marvin-steinke
Copy link
Author

So I installed a newer version of the kernel and this fixed the issue. I think the minimum kernel requirements in the docs should be updated (or maybe I overlooked something?). I'd be happy to do this. Where do you think this should be stated best and what version is the minimum based on the eBPF features used?

@dave-tucker
Copy link
Collaborator

Relates to: #1483

5.12 is the minimum supported kernel version: #1483 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

2 participants