Enhance restart proxy feature to differentiate between user workloads and kyma workloads #1249

strekm · 2025-01-16T07:42:45Z

Description

Istio bump is related to the requirement of restarting the pod with Istio sidecar injected, to keep it up to date with the istiod. In a situation where the customer's workload is broken for any reason, the pods are not able to get up - we cannot reconcile the state of the data plane of the service mesh to the required state. This situation puts our Istio operator in a state of a constant retries of proxy restart, resulting in Istio CR in an infinite loop of processing -> error -> processing state. This kind of situation is independent of Istio Module team, but because it's reflected in the inability to restart a Pod, it's an Istio Module team that receives alerts, and therefore responsibility. As a solution for this issue, we should aim to limit a number of retries during the proxy restart. Inability to successfully restart a workload should end up in the Istio CR warning state indicating that the actions is on customer side, to fix the workload so the proxies can be updated.

Consider the case when there is a single Pod without parent Deployment, that we can not restart. We should decide whether to ignore it or Istio CR status must be warning.

Consider Istio CR status when node is draining and Deployment/Pod in evicted state. Should not be causing error status.

TODOs:

Implement logic that detects if the workload is customers' or Kyma's (try to look at the annotations/labels)
If the proxy cannot be restarted on the customer workload, put Istio CR in the Warning state
If the proxy cannot be restarted on the Kyma workload, put Istio CR in the Error state
Configure the reconciliation retry to the exponential time in case of failing proxy restart
Provide documentation
Provide release notes
Discuss in the team an adjustment of the pagination mechanism in the proxy restart to avoid paginating through requeueing --> we'll create a follow-up ticket for this

PRs

Refactor proxy reset #1253

ACs [PO]

issues with restarting kyma workloads should be reported as error
issues with restarting user workloads should be reported as warning
pod w/o parent should put Istio CR in warning
retries should increase waiting time in between up to max backoff time
cover the case with single Pod without deployment
documentation updated

DoD [Developer & Reviewer]

Provide unit and integration tests.
Provide documentation. https://github.com/kyma-project/istio/pull/1253/files#diff-e8d8116a9ba401406778788b086bcdf5060a630ccf2bd87f78dfae6739e0cbd3
Verify if the solution works for both open-source Kyma and SAP BTP, Kyma runtime.
If you changed the resource limits, explain why it was needed.
If the default configuration of Istio Operator has been changed, you performed a manual upgrade test to verify that the change can be rolled out correctly.
Verify that your contributions don't decrease code coverage. If they do, explain why this is the case. (coverage increased)
Add release notes: https://github.com/kyma-project/istio/pull/1253/files#diff-76e722c92cee2db6326bfb69b5bce037c6a4fbfcab19461914b169c8b6bd76e8

mluk-sap · 2025-01-30T10:30:48Z

Done
PR: #1253

strekm added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 16, 2025

strekm added this to the 1.15.0 milestone Jan 16, 2025

strekm modified the milestones: 1.15.0, 1.14.0 Jan 24, 2025

videlov mentioned this issue Jan 27, 2025

Refactor proxy reset #1253

Merged

1 task

mluk-sap closed this as completed Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance restart proxy feature to differentiate between user workloads and kyma workloads #1249

Enhance restart proxy feature to differentiate between user workloads and kyma workloads #1249

strekm commented Jan 16, 2025 •

edited by mluk-sap

Loading

mluk-sap commented Jan 30, 2025

Enhance restart proxy feature to differentiate between user workloads and kyma workloads #1249

Enhance restart proxy feature to differentiate between user workloads and kyma workloads #1249

Comments

strekm commented Jan 16, 2025 • edited by mluk-sap Loading

mluk-sap commented Jan 30, 2025

strekm commented Jan 16, 2025 •

edited by mluk-sap

Loading