Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48340: skip OperatorHubSourceError metric checking when disableAllDefaultSources is true #29435

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jianzhangbjz
Copy link
Contributor

When disableAllDefaultSources is true, all default catalogSource are disabled, so there is no metric.

    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "OperatorHubSourceError",
          "alertstate": "firing",
          "container": "catalog-operator",
          "endpoint": "https-metrics",
          "exported_namespace": "openshift-marketplace",
          "instance": "[fd01:0:0:1::1a]:8443",
          "job": "catalog-operator-metrics",
          "name": "community-operators",
          "namespace": "openshift-operator-lifecycle-manager",
          "pod": "catalog-operator-6c446dcbbb-sxvjz",
          "prometheus": "openshift-monitoring/k8s",
          "service": "catalog-operator-metrics",
          "severity": "warning"
        },
        "value": [
          1736751753.045,
          "1"
        ]
      }
    ]

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 14, 2025
@openshift-ci-robot
Copy link

@jianzhangbjz: This pull request references Jira Issue OCPBUGS-48340, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When disableAllDefaultSources is true, all default catalogSource are disabled, so there is no metric.

   [
     {
       "metric": {
         "__name__": "ALERTS",
         "alertname": "OperatorHubSourceError",
         "alertstate": "firing",
         "container": "catalog-operator",
         "endpoint": "https-metrics",
         "exported_namespace": "openshift-marketplace",
         "instance": "[fd01:0:0:1::1a]:8443",
         "job": "catalog-operator-metrics",
         "name": "community-operators",
         "namespace": "openshift-operator-lifecycle-manager",
         "pod": "catalog-operator-6c446dcbbb-sxvjz",
         "prometheus": "openshift-monitoring/k8s",
         "service": "catalog-operator-metrics",
         "severity": "warning"
       },
       "value": [
         1736751753.045,
         "1"
       ]
     }
   ]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jianzhangbjz
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6

@simonpasquier
Copy link
Contributor

this isn't the correct approach, you should rather add the alert name to AllowedAlertNames in pkg/monitortestlibrary/allowedalerts/types.go (with a link to the bug) to unblock the situation. Then work to fix the alert expression so it doesn't fire in the first place when OperatorHub is disabled.

@jianzhangbjz
Copy link
Contributor Author

Hi @simonpasquier , thanks for the help! Could you help review it again? Thanks!

@jianzhangbjz
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6

@@ -47,4 +47,5 @@ var AllowedAlertNames = []string{
"CDIDefaultStorageClassDegraded", // Installing openshift virt with RWX storage fire an alarm, that is not relevant for most of the tests.
"VirtHandlerRESTErrorsHigh", // https://issues.redhat.com/browse/CNV-50418
"VirtControllerRESTErrorsHigh", // https://issues.redhat.com/browse/CNV-50418
"OperatorHubSourceError", // https://issues.redhat.com/browse/OCPBUGS-48340, should skip this metric checking when disableAllDefaultSources is true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"OperatorHubSourceError", // https://issues.redhat.com/browse/OCPBUGS-48340, should skip this metric checking when disableAllDefaultSources is true
"OperatorHubSourceError", // https://issues.redhat.com/browse/OCPBUGS-48340

Comment on lines 750 to 753
// https://issues.redhat.com/browse/OCPBUGS-48340
if SkipOperatorHubMetricsCheck(oc) {
allowedAlertNames = removeElement(allowedAlertNames, "OperatorHubSourceError")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this

Comment on lines 1021 to 1038

func SkipOperatorHubMetricsCheck(oc *exutil.CLI) bool {
stdout, stderr, err := oc.AsAdmin().Run("get").Args("operatorhub", "cluster", "-o=jsonpath={.spec.disableAllDefaultSources}").Outputs()
if err != nil {
fmt.Printf("command failed: %v\nstderr: %s\nstdout:%s", err, stderr, stdout)
}
return stdout == "true"
}

func removeElement(slice []string, element string) []string {
var result []string
for _, v := range slice {
if v != element {
result = append(result, v)
}
}
return result
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@jianzhangbjz
Copy link
Contributor Author

Hi @simonpasquier , could you help review it again? Thanks! The logic is that add OperatorHubSourceError to allowedAlertNames when SkipOperatorHubMetricsCheck is true.

@joelanford
Copy link
Member

@jianzhangbjz Can you explain the rationale here more? This alert is only firing 7% of the time right now. Where are the default sources being disabled in CI clusters?

@jianzhangbjz
Copy link
Contributor Author

Hi @joelanford , I believe they disable it in the baremetalds-devscripts-setup step, see the https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6/1877545344619778048/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/06_create_cluster-2025-01-09-205310.log

2025-01-09 21:45:15 +(utils.sh:754): add_local_certificate_as_trusted(): oc patch image.config.openshift.io/cluster --patch '{"spec":{"additionalTrustedCA":{"name":"registry-config"}}}' --type=merge
2025-01-09 21:45:15 image.config.openshift.io/cluster patched
2025-01-09 21:45:15 +(./06_create_cluster.sh:81): [[ -n true ]]
2025-01-09 21:45:15 +(./06_create_cluster.sh:81): [[ true != \f\a\l\s\e ]]
2025-01-09 21:45:15 +(./06_create_cluster.sh:82): oc patch OperatorHub cluster --type json -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'
2025-01-09 21:45:16 operatorhub.config.openshift.io/cluster patched
2025-01-09 21:45:16 +(./06_create_cluster.sh:86): [[ -n '' ]]

@jianzhangbjz
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6

@jianzhangbjz
Copy link
Contributor Author

/test e2e-aws-ovn-microshift-serial

@joelanford
Copy link
Member

So as soon as a catalog is disabled, the catalog operator will remove the catalogsource_ready time series from its metrics endpoint.

The rule for this alert is defined here: https://github.com/operator-framework/operator-marketplace/blob/5776ce8e796910c2dfc98a42062f97ed98e81b2c/manifests/12_prometheus_rule.yaml#L23

I'm not a prometheus expert, but I think the expression there will just not have results for the time ranges that the default catalogs are disabled. If there are no results, I would think that would mean the expression would not evaluate to true. But I haven't tested this theory.

One possibility to consider. If the catalog source is unhealthy before the catalog source is disabled, the the most recent sample for a given catalogsource_ready time series will be 0. In these clusters where catalog sources are not expected to work (and are thus disabled) it seems likely that they will be unhealthy prior to being disabled, right?

@jianzhangbjz
Copy link
Contributor Author

jianzhangbjz commented Jan 15, 2025

If there are no results, I would think that would mean the expression would not evaluate to true.

Yes, I think so.

In these clusters where catalog sources are not expected to work (and are thus disabled) it seems likely that they will be unhealthy prior to being disabled, right?

Yes, these catalogsource unhealthy due to the image pulling issue, so they disbaled operatorhub.

@jianzhangbjz
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025
@jan--f
Copy link
Contributor

jan--f commented Jan 16, 2025

/lgtm
Imho this here will work to unblock the readiness check. The alerting rule should and/or metric export can be then be fixed.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2025
Copy link
Contributor

openshift-ci bot commented Jan 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, jianzhangbjz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2025
@jianzhangbjz
Copy link
Contributor Author

/test e2e-aws-ovn-edge-zones

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD e356b61 and 2 for PR HEAD 0c45942 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD e356b61 and 2 for PR HEAD 0c45942 in total

@simonpasquier
Copy link
Contributor

The logic is that add OperatorHubSourceError to allowedAlertNames when SkipOperatorHubMetricsCheck is true.

Late to review but i think that it's simpler to exclude OperatorHubSourceError in all tests: the command checking whether OperatorHub is installed could fail or return unexpected results. The root cause for the alert firing despite OperatorHub not installed should be quickly addressed and the exclusion only a temporary counter-measure.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 2d391bf and 1 for PR HEAD 0c45942 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 3709116 and 0 for PR HEAD 0c45942 in total

@openshift-ci-robot
Copy link

/hold

Revision 0c45942 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2025
@jianzhangbjz
Copy link
Contributor Author

/retest-required
/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 20, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a92836d and 2 for PR HEAD 0c45942 in total

@jianzhangbjz
Copy link
Contributor Author

/retest-required

Copy link

openshift-trt bot commented Jan 20, 2025

Job Failure Risk Analysis for sha: 0c45942

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 96.88% of 352 runs on release 4.19 [Overall] in the last week.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 0f2d600 and 1 for PR HEAD 0c45942 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 4c63717 and 0 for PR HEAD 0c45942 in total

Copy link

openshift-trt bot commented Jan 20, 2025

Job Failure Risk Analysis for sha: 0c45942

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 96.94% of 359 runs on release 4.19 [Overall] in the last week.

1 similar comment
Copy link

openshift-trt bot commented Jan 21, 2025

Job Failure Risk Analysis for sha: 0c45942

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 96.94% of 359 runs on release 4.19 [Overall] in the last week.

@openshift-ci-robot
Copy link

/hold

Revision 0c45942 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2025
@jianzhangbjz
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2025
@jianzhangbjz
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 58ddec8 and 2 for PR HEAD 0c45942 in total

@jianzhangbjz
Copy link
Contributor Author

/retest

Copy link

openshift-trt bot commented Jan 21, 2025

Job Failure Risk Analysis for sha: 0c45942

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn High
[sig-arch] Only known images used by tests
This test has passed 100.00% of 129 runs on jobs [periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn] in the last 14 days.
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade IncompleteTests
Tests for this run (15) are below the historical average (1060): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 58ddec8 and 2 for PR HEAD 0c45942 in total

Copy link

openshift-trt bot commented Jan 21, 2025

Job Failure Risk Analysis for sha: 0c45942

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn High
[sig-arch] Only known images used by tests
This test has passed 100.00% of 129 runs on jobs [periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn] in the last 14 days.
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade IncompleteTests
Tests for this run (15) are below the historical average (1107): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 97.01% of 401 runs on release 4.19 [Overall] in the last week.

Copy link

openshift-trt bot commented Jan 21, 2025

Job Failure Risk Analysis for sha: 0c45942

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn High
[sig-arch] Only known images used by tests
This test has passed 100.00% of 129 runs on jobs [periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn] in the last 14 days.
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade IncompleteTests
Tests for this run (15) are below the historical average (1121): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 96.92% of 389 runs on release 4.19 [Overall] in the last week.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a87dfc8 and 2 for PR HEAD 0c45942 in total

Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

@jianzhangbjz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 0c45942 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-serial 0c45942 link true /test e2e-aws-ovn-serial
ci/prow/e2e-gcp-ovn-rt-upgrade 0c45942 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/okd-scos-e2e-aws-ovn 0c45942 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants