When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset` #715

Levi080513 · 2024-05-06T03:04:26Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.9
Kernel Version: 4.18.0-513.24.1.el8.9
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd 1.6.31
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): 1.25.16
GPU Operator Version: 23.6.2

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to eleted and recreated nvidia-driver-daemonset.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

Create a k8s cluster with one GPU nodes.
Install gpu-operator and configure driver.usePrecompiled = true.
Login to the GPU node and trigger the node not ready through systemctl stop kubelet
nvidia-driver-daemonset will be deleted and recreated, and this process will continue until the GPU node is Ready.

When the node is not ready, the node taints like this：

Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

But the nvidia-driver-daemonset pod tolerations is like this:

  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists

Node taint node.kubernetes.io/unreachable:NoSchedule is not tolerated, so nvidia-driver-daemonset .status.desiredNumberScheduled is 0.

Following the logic of cleanupStalePrecompiledDaemonsets, nvidia-driver-daemonset will be deleted and then created again because the cluster still has GPU nodes.

gpu-operator/controllers/object_controls.go

Lines 3689 to 3728 in a9e6a94

    
           // cleanupStalePrecompiledDaemonsets deletes stale driver daemonsets which can happen 
        
           // 1. If all nodes upgraded to the latest kernel 
        
           // 2. no GPU nodes are present 
        
           func (n ClusterPolicyController) cleanupStalePrecompiledDaemonsets(ctx context.Context) error { 
        
           	opts := []client.ListOption{ 
        
           		client.MatchingLabels{ 
        
           			precompiledIdentificationLabelKey: precompiledIdentificationLabelValue, 
        
           		}, 
        
           	} 
        
           	list := &appsv1.DaemonSetList{} 
        
           	err := n.client.List(ctx, list, opts...) 
        
           	if err != nil { 
        
           		n.logger.Error(err, "could not get daemonset list") 
        
           		return err 
        
           	} 
        
           	for idx := range list.Items { 
        
           		name := list.Items[idx].ObjectMeta.Name 
        
           		desiredNumberScheduled := list.Items[idx].Status.DesiredNumberScheduled 
        
           		n.logger.V(1).Info("Driver DaemonSet found", 
        
           			"Name", name, 
        
           			"desiredNumberScheduled", desiredNumberScheduled) 
        
           		if desiredNumberScheduled != 0 { 
        
           			n.logger.Info("Driver DaemonSet active, keep it.", 
        
           				"Name", name, "Status.DesiredNumberScheduled", desiredNumberScheduled) 
        
           			continue 
        
           		} 
        
           		n.logger.Info("Delete Driver DaemonSet", "Name", name) 
        
           		err = n.client.Delete(ctx, &list.Items[idx]) 
        
           		if err != nil { 
        
           			n.logger.Info("ERROR: Could not get delete DaemonSet", 
        
           				"Name", name, "Error", err) 
        
           		} 
        
           	} 
        
           	return nil 
        
           }

This does not appear to be normal behavior.

The temporary solution is to add the following configuration when installing gpu-operator:

daemonsets:
  tolerations:
  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoSchedule

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2024-05-06T22:17:58Z

@Levi080513 thanks for the detailed issue! I think our logic which detects stale DaemonSets and cleans them up can be improved to avoid the behavior you are experiencing.

Levi080513 · 2024-06-04T15:40:14Z

Is there any latest progress on this issue?

cdesiniotis · 2024-06-04T21:07:35Z

@Levi080513 This change was merged into master and should fix the issue you reported: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1085. It will be included in the next release. If you are willing to try out these changes before then and confirm it resolves your issue that would be helpful as well.

Levi080513 · 2024-06-05T03:44:51Z

I cherry-pick this MR to version 23.6.2 and it works well.
Thx!

charanteja333 · 2024-06-19T19:16:17Z

Hi @cdesiniotis

We are facing a similart issue but not a precompiled library. Whenever nfd restarts , driver daemon is restarting as well. When driver restarts, it is stuck leading to either restarting the gpu pods or draining the node. We are using chart 24.3.0. will fix be cherry picked to 24.3.0 as well ?

cdesiniotis · 2024-06-24T18:02:53Z

@charanteja333 what you described is different than what is being reported in this issue. Can you create a new issue with more details on the behavior you are observing?

cdesiniotis · 2024-07-11T21:45:07Z

@Levi080513 closing as you confirmed the fix addresses this issue.

cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label May 6, 2024

cdesiniotis assigned tariq1890 May 6, 2024

cdesiniotis closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset` #715

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset` #715

Levi080513 commented May 6, 2024

cdesiniotis commented May 6, 2024

Levi080513 commented Jun 4, 2024

cdesiniotis commented Jun 4, 2024

Levi080513 commented Jun 5, 2024

charanteja333 commented Jun 19, 2024 •

edited

Loading

cdesiniotis commented Jun 24, 2024

cdesiniotis commented Jul 11, 2024

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated nvidia-driver-daemonset #715

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated nvidia-driver-daemonset #715

Comments

Levi080513 commented May 6, 2024

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

cdesiniotis commented May 6, 2024

Levi080513 commented Jun 4, 2024

cdesiniotis commented Jun 4, 2024

Levi080513 commented Jun 5, 2024

charanteja333 commented Jun 19, 2024 • edited Loading

cdesiniotis commented Jun 24, 2024

cdesiniotis commented Jul 11, 2024

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset` #715

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset` #715

charanteja333 commented Jun 19, 2024 •

edited

Loading