Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODF operator on nerc-ocp-infra is degraded #688

Closed
larsks opened this issue Aug 19, 2024 · 8 comments
Closed

ODF operator on nerc-ocp-infra is degraded #688

larsks opened this issue Aug 19, 2024 · 8 comments
Assignees
Labels
bug Something isn't working openshift This issue pertains to NERC OpenShift

Comments

@larsks
Copy link
Contributor

larsks commented Aug 19, 2024

RH support case: https://access.redhat.com/support/cases/#/case/03908442

Thorsten and Chris were experiencing issues with the acm-metrics-backing-store. They deleted the pods associated with this backing store, and the pods failed to come back. Upon investigation, the odf-operator-controller-manager pod is in a failed state:

$ k get pod odf-operator-controller-manager-5d9ccf4488-w2jz6
NAME                                               READY   STATUS                       RESTARTS   AGE
odf-operator-controller-manager-5d9ccf4488-w2jz6   1/2     CreateContainerConfigError   0          6d16h

Inspecting the container statuses, we see:

$ k get pod odf-operator-controller-manager-5d9ccf4488-w2jz6 -o yaml | yq .status.containerStatuses
[
  {
    "containerID": "cri-o://6998dfb1f50bac0704f946db9e234b9236246b0945f63c1613e626622b9de813",
    "image": "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:77df668a9591bbaae675d0553f8dca5423c0f257317bc08fe821d965f44ed019",
    "imageID": "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:0bf40df05a3599b6ef8706e78bb1914b9f988946543a685449110aaf8b59e8bc",
    "lastState": {},
    "name": "kube-rbac-proxy",
    "ready": true,
    "restartCount": 0,
    "started": true,
    "state": {
      "running": {
        "startedAt": "2024-08-12T23:08:20Z"
      }
    }
  },
  {
    "image": "registry.redhat.io/odf4/odf-rhel9-operator@sha256:b569fbd91f664e952e646d940dd85727db8568658950cb33619a469737d1bbef",
    "imageID": "",
    "lastState": {},
    "name": "manager",
    "ready": false,
    "restartCount": 0,
    "started": false,
    "state": {
      "waiting": {
        "message": "configmap \"odf-operator-manager-config\" not found",
        "reason": "CreateContainerConfigError"
      }
    }
  }
]

And indeed, the odf-operator-manager-config ConfigMap does not exist.

@schwesig
Copy link
Member

/CC @computate @schwesig

@schwesig schwesig added bug Something isn't working openshift This issue pertains to NERC OpenShift labels Aug 19, 2024
@larsks
Copy link
Contributor Author

larsks commented Aug 19, 2024

It looks like the ODF operator is stuck installing:

$ k get csv odf-operator.v4.15.5-rhodf
NAME                         DISPLAY                     VERSION        REPLACES                     PHASE
odf-operator.v4.15.5-rhodf   OpenShift Data Foundation   4.15.5-rhodf   odf-operator.v4.15.4-rhodf   Installing

@larsks
Copy link
Contributor Author

larsks commented Aug 19, 2024

My theory is that we can grab the missing ConfigMap from the production cluster, where is has this data:

apiVersion: v1
data:
  CSIADDONS_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  CSIADDONS_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  CSIADDONS_SUBSCRIPTION_CHANNEL: stable-4.15
  CSIADDONS_SUBSCRIPTION_NAME: odf-csi-addons-operator
  CSIADDONS_SUBSCRIPTION_PACKAGE: odf-csi-addons-operator
  CSIADDONS_SUBSCRIPTION_STARTINGCSV: odf-csi-addons-operator.v4.15.5-rhodf
  IBM_SUBSCRIPTION_CATALOGSOURCE: certified-operators
  IBM_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  IBM_SUBSCRIPTION_CHANNEL: stable-v1.4
  IBM_SUBSCRIPTION_NAME: ibm-storage-odf-operator
  IBM_SUBSCRIPTION_PACKAGE: ibm-storage-odf-operator
  IBM_SUBSCRIPTION_STARTINGCSV: ibm-storage-odf-operator.v1.4.1
  NOOBAA_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  NOOBAA_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  NOOBAA_SUBSCRIPTION_CHANNEL: stable-4.15
  NOOBAA_SUBSCRIPTION_NAME: mcg-operator
  NOOBAA_SUBSCRIPTION_PACKAGE: mcg-operator
  NOOBAA_SUBSCRIPTION_STARTINGCSV: mcg-operator.v4.15.5-rhodf
  OCS_SUBSCRIPTION_CATALOGSOURCE: redhat-operators
  OCS_SUBSCRIPTION_CATALOGSOURCE_NAMESPACE: openshift-marketplace
  OCS_SUBSCRIPTION_CHANNEL: stable-4.15
  OCS_SUBSCRIPTION_NAME: ocs-operator
  OCS_SUBSCRIPTION_PACKAGE: ocs-operator
  OCS_SUBSCRIPTION_STARTINGCSV: ocs-operator.v4.15.5-rhodf
  controller_manager_config.yaml: |
    apiVersion: controller-runtime.sigs.k8s.io/v1alpha1
    kind: ControllerManagerConfig
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: 127.0.0.1:8080
    leaderElection:
      leaderElect: true
      resourceName: 4fd470de.openshift.io
kind: ConfigMap
metadata:
  labels:
    olm.managed: "true"
    operators.coreos.com/odf-operator.openshift-storage: ""
  name: odf-operator-manager-config
  namespace: openshift-storage

It looks like we will also need the 4fd470de.openshift.io configmap.

@larsks
Copy link
Contributor Author

larsks commented Aug 19, 2024

@schwesig is going to open a customer support case and ask (a) if they can help figure out how things go into this state in the first place, and (b) if the suggestion in my previous comment seems reasonable.

@larsks larsks changed the title ODF operator on nerc-ocp-infra is degraged ODF operator on nerc-ocp-infra is degraded Aug 19, 2024
@schwesig
Copy link
Member

schwesig commented Aug 19, 2024

problem from earlier: https://access.redhat.com/support/cases/#/case/03861871
was kind of trigger to get deeper into that
https://access.redhat.com/support/cases/#/case/03908442

@schwesig
Copy link
Member

odf operator update seems to be succesfull now (after maintenance restart Sept 5th).
Still the nooba acm-metrica backing store is causing issues.

Update on RH support ticket: the problem seems to be rarely known. (ODF update failing)

@schwesig schwesig self-assigned this Sep 11, 2024
@schwesig
Copy link
Member

icebox until this is solved
#745

schwesig added a commit to schwesig/OCP-on-NERC_nerc-ocp-config that referenced this issue Sep 26, 2024
partly fixing nerc-project/operations#745

- New ExternalSecret for ACM metrics 2Ti object storage in `open-cluster-management-observability` namespace.
- Updated the ACM observability bundle to include the new resource in the kustomization.
- Patch for the infra overlay to map the bucket, endpoint, access_key, and secret_key for ACM metrics 2Ti object storage.
- Manually, outside of github: added bucket, endpoint, access_key, and secret_key to vault infra.

This new storage is for testing our problems with nooba, scaling storage pools, and ODF operator.
nerc-project/operations#688
To reproduce the error or solve it.

Signed-off-by: ​/Thor(sten)?/ Schwesig <[email protected]>
schwesig added a commit to schwesig/OCP-on-NERC_nerc-ocp-config that referenced this issue Sep 26, 2024
partly fixing nerc-project/operations#745

- New ExternalSecret for ACM metrics 2Ti object storage in `open-cluster-management-observability` namespace.
- Updated the ACM observability bundle to include the new resource in the kustomization.
- Patch for the infra overlay to map the bucket, endpoint, access_key, and secret_key for ACM metrics 2Ti object storage.
- Manually, outside of github: added bucket, endpoint, access_key, and secret_key to vault infra.

This new storage is for testing our problems with nooba, scaling storage pools, and ODF operator.
nerc-project/operations#688
To reproduce the error or solve it.

Signed-off-by: ​/Thor(sten)?/ Schwesig <[email protected]>
schwesig added a commit to schwesig/OCP-on-NERC_nerc-ocp-config that referenced this issue Sep 26, 2024
partly fixing nerc-project/operations#745

Update the configuration for ACM metrics object storage to use
- the acm-metrics-2ti.yaml file
- and the acm-metrics-2ti-object-storage name.

This new storage is for testing our problems with nooba, scaling storage pools, and ODF operator.
nerc-project/operations#688
To reproduce the error or solve it.

Signed-off-by: ​/Thor(sten)?/ Schwesig <[email protected]>
schwesig added a commit to schwesig/OCP-on-NERC_nerc-ocp-config that referenced this issue Sep 26, 2024
partly fixing nerc-project/operations#745

Update the configuration for ACM metrics object storage to use
- the acm-metrics-2ti.yaml file
- and the acm-metrics-2ti-object-storage name.

This new storage is for testing our problems with nooba, scaling storage pools, and ODF operator.
nerc-project/operations#688
To reproduce the error or solve it.

Signed-off-by: ​/Thor(sten)?/ Schwesig <[email protected]>
@schwesig
Copy link
Member

schwesig commented Oct 1, 2024

the degradation part is solved.
we are focussed now on the nooba and scaling down problem.
RH support also wants us to open a new ticket because this problem is solved.
therefore closing this.
in case we get back to this problem after solving the others, we can reopen it/make a new one.

@schwesig schwesig closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working openshift This issue pertains to NERC OpenShift
Projects
None yet
Development

No branches or pull requests

2 participants