Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libnvidia-ml.so.1 not found under /usr. #1149

Open
Shehjad-Ishan opened this issue Feb 12, 2025 · 7 comments
Open

libnvidia-ml.so.1 not found under /usr. #1149

Shehjad-Ishan opened this issue Feb 12, 2025 · 7 comments

Comments

@Shehjad-Ishan
Copy link

Shehjad-Ishan commented Feb 12, 2025

I am using time slicing. I replicated to 4 GPUs

 microk8s kubectl describe node sigmind-survey | grep -A8 Capacity
Capacity:
  cpu:                8
  ephemeral-storage:  459850824Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32813884Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:

Why am I getting this error now?

sudo microk8s kubectl logs nemo-embedding-embedding-deployment-566969dc9-5kfw9

===================================
== NVIDIA NIM for Text Embedding ==
===================================

NVIDIA Release 1.3.0
Model: nvidia/llama-3.2-nv-embedqa-1b-v2

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/).
Third Party Software Attributions and Licenses can be found under /opt/nim/NOTICE

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

libnvidia-ml.so.1 not found under /usr.

gpu-operator pods are running fine.

microk8s kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-f7fgs                                       2/2     Running     2 (24h ago)   25h
gpu-operator-1738668402-node-feature-discovery-gc-654f4bf5p5swz   1/1     Running     1 (24h ago)   25h
gpu-operator-1738668402-node-feature-discovery-master-567d5ztzs   1/1     Running     1 (24h ago)   25h
gpu-operator-1738668402-node-feature-discovery-worker-lkwmk       1/1     Running     1 (24h ago)   25h
gpu-operator-5cff5bbd9d-vzlrw                                     1/1     Running     1 (24h ago)   25h
nvidia-container-toolkit-daemonset-6b66l                          1/1     Running     1 (24h ago)   25h
nvidia-cuda-validator-ljdjs                                       0/1     Completed   0             24h
nvidia-dcgm-exporter-jmwc7                                        1/1     Running     1 (24h ago)   25h
nvidia-device-plugin-daemonset-tdttc                              2/2     Running     0             50m
nvidia-operator-validator-t9tmt                                   1/1     Running     1 (24h ago)   25h



nvidia-smi
Wed Feb 12 15:36:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000 12GB          Off |   00000000:01:00.0  On |                    0 |
| 30%   56C    P8             13W /   70W |     532MiB /  11514MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|

This is the .yaml I am trying to deploy.

global: {ngcImagePullSecretName: ""}
nvcf:
  dockerRegSecrets: []
  additionalSecrets: []
  localStorageProvisioner: []
vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: FRONTEND_PORT
            value: '9000'
          - name: BACKEND_PORT
            value: '8000'
          - name: GRAPH_DB_URI
            value: bolt://neo-4-j-service:7687
          - name: GRAPH_DB_USERNAME
            value: neo4j
          - name: GRAPH_DB_PASSWORD
            value: password
          - name: MILVUS_DB_HOST
            value: milvus-milvus-deployment-milvus-service
          - name: MILVUS_DB_PORT
            value: '19530'
          - name: VLM_MODEL_TO_USE
            # value: vila-1.5
            value: openai-compat
          - name: OPENAI_API_KEY
            valueFrom:
              secretKeyRef:
                name: openai-api-key-secret
                key: OPENAI_API_KEY
          # - name: MODEL_PATH
          #   value: ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
          - name: DISABLE_GUARDRAILS
            value: 'false'
          - name: OPENAI_API_KEY_NAME
            value: VSS_OPENAI_API_KEY
          - name: NVIDIA_API_KEY_NAME
            value: VSS_NVIDIA_API_KEY
          - name: NGC_API_KEY_NAME
            value: VSS_NGC_API_KEY
          - name: TRT_LLM_MODE
            value: int4_awq
          - name: VLM_BATCH_SIZE
            value: ''
          - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            value: ''
          - name: VIA_VLM_ENDPOINT
            value: ''
          - name: VIA_VLM_API_KEY
            value: ''
          - name: OPENAI_API_VERSION
            value: ''
          - name: AZURE_OPENAI_API_VERSION
            value: ''
          # - name: NVIDIA_VISIBLE_DEVICES
          #   value: "0"
      initContainers:
      - command:
        - sh
        - -c
        - until nc -z -w 2 milvus-milvus-deployment-milvus-service 19530; do echo
          waiting for milvus; sleep 2; done
        image: busybox:1.28
        imagePullPolicy: IfNotPresent
        name: check-milvus-up
      - command:
        - sh
        - -c
        - until nc -z -w 2 neo-4-j-service 7687; do echo waiting for neo4j; sleep
          2; done
        image: busybox:1.28
        imagePullPolicy: IfNotPresent
        name: check-neo4j-up
      - args:
        - "while ! curl -s -f -o /dev/null http://llm-nim-svc:8000/v1/health/live;\
          \ do\n  echo \"Waiting for LLM...\"\n  sleep 2\ndone\n"
        command:
        - sh
        - -c
        image: curlimages/curl:latest
        name: check-llm-up
  llmModel: meta/llama-3.1-8b-instruct
  llmModelChat: meta/llama-3.1-8b-instruct
  resources:
    limits:
      nvidia.com/gpu: 1

  # vlmModelPath: ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
  # vlmModelType: vila-1.5
  configs:
    ca_rag_config.yaml:
      chat:
        embedding:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        llm:
          base_url: http://llm-nim-svc:8000/v1
          model: meta/llama-3.1-8b-instruct
        reranker:
          base_url: http://nemo-rerank-ranking-deployment-ranking-service:8000/v1
      summarization:
        embedding:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        llm:
          base_url: http://llm-nim-svc:8000/v1
          model: meta/llama-3.1-8b-instruct
    guardrails_config.yaml:
      models:
      - engine: nim
        model: meta/llama-3.1-8b-instruct
        parameters:
          base_url: http://llm-nim-svc:8000/v1
        type: main
      - engine: nim_patch
        model: nvidia/llama-3.2-nv-embedqa-1b-v2
        parameters:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        type: embeddings
  extraPodVolumes:
  - name: secret-ngc-api-key-volume
    secret:
      secretName: ngc-api-key-secret
      items:
      - key: NGC_API_KEY
        path: ngc-api-key
  - name: secret-graph-db-username-volume
    secret:
      secretName: graph-db-creds-secret
      items:
      - key: username
        path: graph-db-username
  - name: secret-graph-db-password-volume
    secret:
      secretName: graph-db-creds-secret
      items:
      - key: password
        path: graph-db-password
  extraPodVolumeMounts:
  - name: secret-ngc-api-key-volume
    mountPath: /secrets/ngc-api-key
    subPath: ngc-api-key
    readOnly: true
  - name: secret-graph-db-username-volume
    mountPath: /secrets/graph-db-username
    subPath: graph-db-username
    readOnly: true
  - name: secret-graph-db-password-volume
    mountPath: /secrets/graph-db-password
    subPath: graph-db-password
    readOnly: true
  egress:
    milvus:
      address: milvus-milvus-deployment-milvus-service
      port: 19530
    neo4j-bolt:
      address: neo-4-j-service
      port: 7687
    llm-openai-api:
      address: llm-nim-svc
      port: 8000
    nemo-embed:
      address: nemo-embedding-embedding-deployment-embedding-service
      port: 8000
    nemo-rerank:
      address: nemo-rerank-ranking-deployment-ranking-service
      port: 8000
milvus:
  applicationSpecs:
    milvus-deployment:
      containers:
        milvus-container:
          env:
          - name: ETCD_ENDPOINTS
            value: etcd-etcd-deployment-etcd-service:2379
          - name: MINIO_ADDRESS
            value: minio-minio-deployment-minio-service:9010
          - name: KNOWHERE_GPU_MEM_POOL_SIZE
            value: 2048;4096
  egress:
    etcd:
      address: etcd-etcd-deployment-etcd-service
      port: 2379
    minio:
      address: minio-minio-deployment-minio-service
      port: 9010
neo4j:
  extraPodVolumes:
  - name: secret-db-username-volume
    secret:
      secretName: graph-db-creds-secret
      items:
      - key: username
        path: db-username
  - name: secret-db-password-volume
    secret:
      secretName: graph-db-creds-secret
      items:
      - key: password
        path: db-password
  extraPodVolumeMounts:
  - name: secret-db-username-volume
    mountPath: /secrets/db-username
    subPath: db-username
    readOnly: true
  - name: secret-db-password-volume
    mountPath: /secrets/db-password
    subPath: db-password
    readOnly: true
nim-llm:
  env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "compute,utility"
    - name: LD_LIBRARY_PATH
      value: "/usr/local/lib:/usr/lib/i386-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.1.0
  resources:
    limits:
      nvidia.com/gpu: 1
  runtimeClassName: nvidia
  securityContext:
    capabilities:
      add: ["SYS_ADMIN"]
  model:
    name: meta/llama-3.1-8b-instruct
    ngcAPISecret: ngc-api-key-secret
  persistence:
    enabled: true
  hostPath:
    enabled: true
  service:
    name: llm-nim-svc
  llmModel: meta/llama-3.1-8b-instruct

nemo-embedding:
  env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "compute,utility"
    - name: LD_LIBRARY_PATH
      value: "/usr/local/lib:/usr/lib/i386-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
  resources:
    limits:
      nvidia.com/gpu: 1
  runtimeClassName: nvidia
  securityContext:
    capabilities:
      add: ["SYS_ADMIN"]

nemo-rerank:
  env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "compute,utility"
    - name: LD_LIBRARY_PATH
      value: "/usr/local/lib:/usr/lib/i386-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
  resources:
    limits:
      nvidia.com/gpu: 1
  runtimeClassName: nvidia  # Added this line
  securityContext:
    capabilities:
      add: ["SYS_ADMIN"]

@klueska
Copy link
Contributor

klueska commented Feb 12, 2025

I'm not familiar with whatever yaml spec this is, so it's a bit hard to reason about how the request for GPUs are mapped to your container

What said, when you exec into the container -- are you able to run nvidia-smi and/or see any GPUs injected under /dev/nvidia*?

Did this used to work without time-slicing enabled?

@Shehjad-Ishan
Copy link
Author

I am trying to deploy this blueprint.
Before time slicing I used to see insufficient GPU.

execing into the gpu-operator gpu-feature-discovery container I see this:

microk8s kubectl exec -it -n gpu-operator gpu-feature-discovery-f7fgs -- /bin/bash
Defaulted container "gpu-feature-discovery" out of: gpu-feature-discovery, config-manager, toolkit-validation (init), gpu-feature-discovery-imex-init (init), config-manager-init (init)
[root@gpu-feature-discovery-f7fgs /]# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Feb 11 09:24 /dev/nvidia-modeset
crw-rw-rw- 1 root root 505,   0 Feb 10 10:12 /dev/nvidia-uvm
crw-rw-rw- 1 root root 505,   1 Feb 10 10:12 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Feb 10 10:12 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Feb 10 10:12 /dev/nvidiactl

/dev/nvidia-caps:
total 0
cr-------- 1 root root 508, 1 Feb 11 09:24 nvidia-cap1
cr--r--r-- 1 root root 508, 2 Feb 11 09:24 nvidia-cap2
[root@gpu-feature-discovery-f7fgs /]# nvidia-smi
Wed Feb 12 10:42:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000 12GB          Off |   00000000:01:00.0  On |                    0 |
| 30%   56C    P8             13W /   70W |     531MiB /  11514MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@Shehjad-Ishan
Copy link
Author

I am trying to deploy this blueprint. Before time slicing I used to see insufficient GPU.

execing into the gpu-operator gpu-feature-discovery container I see this:

microk8s kubectl exec -it -n gpu-operator gpu-feature-discovery-f7fgs -- /bin/bash
Defaulted container "gpu-feature-discovery" out of: gpu-feature-discovery, config-manager, toolkit-validation (init), gpu-feature-discovery-imex-init (init), config-manager-init (init)
[root@gpu-feature-discovery-f7fgs /]# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Feb 11 09:24 /dev/nvidia-modeset
crw-rw-rw- 1 root root 505,   0 Feb 10 10:12 /dev/nvidia-uvm
crw-rw-rw- 1 root root 505,   1 Feb 10 10:12 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Feb 10 10:12 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Feb 10 10:12 /dev/nvidiactl

/dev/nvidia-caps:
total 0
cr-------- 1 root root 508, 1 Feb 11 09:24 nvidia-cap1
cr--r--r-- 1 root root 508, 2 Feb 11 09:24 nvidia-cap2
[root@gpu-feature-discovery-f7fgs /]# nvidia-smi
Wed Feb 12 10:42:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000 12GB          Off |   00000000:01:00.0  On |                    0 |
| 30%   56C    P8             13W /   70W |     531MiB /  11514MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@klueska anything?

@klueska
Copy link
Contributor

klueska commented Feb 13, 2025

I meant execing into the container where you are seeing the error that libnvidoa-ml.so isn't found.

@Shehjad-Ishan
Copy link
Author

I meant execing into the container where you are seeing the error that libnvidoa-ml.so isn't found.

I understand. But the pod is in a CrashLoopBackoff status. I can't exec into it.

microk8s kubectl get pod nemo-embedding-embedding-deployment-566969dc9-fdnfl -o jsonpath="{.spec.containers[*].name}"

embedding-container(base) sigmind@sigmind-survey:/media/sigmind/URSTP_HDD1416/vss$ microk8s kubectl exec -it nemo-embedding-embedding-deployment-566969dc9-fdnfl -c embedding-container -- /bin/bash
error: Internal error occurred: unable to upgrade connection: container not found ("embedding-container")

(base) sigmind@sigmind-survey:/media/sigmind/URSTP_HDD1416/vss$ microk8s kubectl get pod nemo-embedding-embedding-deployment-566969dc9-fdnfl -o wide
NAME                                                  READY   STATUS             RESTARTS        AGE    IP            NODE             NOMINATED NODE   READINESS GATES
nemo-embedding-embedding-deployment-566969dc9-fdnfl   0/1     CrashLoopBackOff   5 (2m35s ago)   6m1s   10.1.87.250   sigmind-survey   <none>           <none>

@klueska
Copy link
Contributor

klueska commented Feb 13, 2025

Can you change its entrypoint to just sleep 9999 and do it from there. Also, can you show the pods yaml from kubectl get?

@Shehjad-Ishan
Copy link
Author

Can you change its entrypoint to just sleep 9999 and do it from there. Also, can you show the pods yaml from kubectl get?

After sleep 9999

microk8s kubectl exec -it nemo-embedding-embedding-deployment-6c7567d84c-mrnrq -- /bin/sh
I have no name!@nemo-embedding-embedding-deployment-6c7567d84c-mrnrq:/opt/nim$ nvidia-smi
bash: nvidia-smi: command not found
I have no name!@nemo-embedding-embedding-deployment-6c7567d84c-mrnrq:/opt/nim$ ls /dev/nvidia*
/dev/nvidia-modeset  /dev/nvidia-uvm  /dev/nvidia-uvm-tools  /dev/nvidia0  /dev/nvidiactl
I have no name!@nemo-embedding-embedding-deployment-6c7567d84c-mrnrq:/opt/nim$ command terminated with exit code 137

microk8s kubectl get pod nemo-embedding-embedding-deployment-6c7567d84c-mrnrq -o yaml



apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 9a71fe2f4dd16b767c04eaaba05c56d39b469871d7895c738774357aec23556b
    cni.projectcalico.org/podIP: 10.1.87.205/32
    cni.projectcalico.org/podIPs: 10.1.87.205/32
  creationTimestamp: "2025-02-13T11:41:13Z"
  generateName: nemo-embedding-embedding-deployment-6c7567d84c-
  labels:
    app: nemo-embedding-embedding-deployment
    app.kubernetes.io/instance: vss-blueprint
    app.kubernetes.io/name: nemo-embedding
    generated_with: helm_builder
    hb_version: 1.0.0
    microservice_version: 2.1.0
    msb_version: 2.5.0
    pod-template-hash: 6c7567d84c
  name: nemo-embedding-embedding-deployment-6c7567d84c-mrnrq
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: nemo-embedding-embedding-deployment-6c7567d84c
    uid: ba375391-7e57-4de4-b7ab-ed4337b8f05a
  resourceVersion: "3885491"
  uid: 3ba9c06d-681d-408a-809f-a8916086a54a
spec:
  containers:
  - command:
    - sleep
    - "9999"
    env:
    - name: NGC_API_KEY
      valueFrom:
        secretKeyRef:
          key: NGC_API_KEY
          name: ngc-api-key-secret
    - name: LD_LIBRARY_PATH
      value: /usr/local/lib:/usr/lib/i386-linux-gnu:$LD_LIBRARY_PATH
    image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /v1/health/ready
        port: http
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 20
    name: embedding-container
    ports:
    - containerPort: 8000
      name: http
      protocol: TCP
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
    securityContext:
      capabilities:
        add:
        - SYS_ADMIN
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /opt/workload-config
      name: workload-cm-volume
    - mountPath: /opt/configs
      name: configs-volume
    - mountPath: /opt/scripts
      name: scripts-cm-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-827kj
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: ngc-docker-reg-secret
  nodeName: sigmind-survey
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsUser: 1000
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: nemo-embedding-workload-cm
    name: workload-cm-volume
  - configMap:
      defaultMode: 420
      name: nemo-embedding-configs-cm
    name: configs-volume
  - configMap:
      defaultMode: 420
      name: nemo-embedding-scripts-cm
    name: scripts-cm-volume
  - name: kube-api-access-827kj
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-02-13T11:41:18Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-02-13T11:41:13Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-02-13T11:41:18Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-02-13T11:41:18Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-02-13T11:41:13Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://782498ab5679e750a4d0d8dec67a188443d4010927d154f1d5b157c6f42a2457
    image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0
    imageID: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2@sha256:b7ff8b85e9661699f6803ad85a48ed41a5c52212284ca4632a3bd240aee61859
    lastState:
      terminated:
        containerID: containerd://602b74cb7ff00a6b95bdb1ee2647e3894a2a92d35450ab1458a1e5b4058361f8
        exitCode: 137
        finishedAt: "2025-02-13T11:45:13Z"
        reason: Error
        startedAt: "2025-02-13T11:43:15Z"
    name: embedding-container
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2025-02-13T11:45:15Z"
    volumeMounts:
    - mountPath: /opt/workload-config
      name: workload-cm-volume
    - mountPath: /opt/configs
      name: configs-volume
    - mountPath: /opt/scripts
      name: scripts-cm-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-827kj
      readOnly: true
      recursiveReadOnly: Disabled
  hostIP: 192.168.0.101
  hostIPs:
  - ip: 192.168.0.101
  phase: Running
  podIP: 10.1.87.205
  podIPs:
  - ip: 10.1.87.205
  qosClass: BestEffort
  startTime: "2025-02-13T11:41:13Z"


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants