Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

Merged

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jan 29, 2025

What does this PR do?

This PR refactors the implementation to eliminate logical race conditions of the kubernetes_secrets provider alongside with brand new unit-tests.

Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.

To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:

  • Using a Kubernetes reflector (watch-based mechanism) is not an option because it would require listing and watching all secrets, which is often a non-starter for users.
  • Instead, we must maintain our own caching mechanism that periodically refreshes only the referenced Kubernetes secrets.

With this in mind, the provider behaviour is now as follows:

Cache Disabled Mode:

  • When caching is disabled, the provider simply reads secrets directly from the Kubernetes API server.

Cache Enabled Mode:

  • When caching is enabled, the provider stores secrets in a cache where entries expire based on the configured TTL (time-to-live) and a lastAccess field of each cache entry.
  • The provider has two primary actors: cache actor and fetch actor, each with well-defined responsibilities.

Cache Actor Responsibilities:

  1. Signal expiration of items: When a secret expires, the cache actor signals that a fetch should occur to reinsert the key into the cache, ensuring continued refreshing.
  2. Detect secret updates and signal changes: When the cache actor detects a secret value change, it signals the ContextProviderComm.
  3. Conditionally update lastAccess:
    • If the secret has changed, update lastAccess to prevent premature expiration and give the fetch actor time to pick up the new value.
    • In any other case, do not update lastAccess and let the entry "age" as it should.

Fetch Actor Responsibilities:

  1. Retrieve secrets from the cache:
    • If present, return the value.
    • If missing, fetch from the Kubernetes API.
  2. Insert fetched secrets into the cache if there isn't a more recent version of the secret already in it (can happen by the cache actor or a parallel fetch actor).
  3. Always update lastAccess when an entry is accessed to prevent unintended expiration.

Considerations:

  • No global locks: Store operations are the only part of the critical path, ensuring that neither cache nor fetch actors block each other.
  • Conditional updates: Since cache state can change between the time an actor reads and writes, all updates use conditional store operations that are part of the critical path.
  • Custom store implementation: The existing ExpirationCache from k8s.io/client-go/tools/cache does not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.
  • Optimized memory management: The prior implementation copied the cache map on every update to prevent Golang map bucket retention. However, I believe this was a misunderstanding of Golang internals and a premature optimisation. If needed in the future, this can be revisited in a more controlled manner.

PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy

Why is it important?

This refactor significantly improves the correctness of the kubernetes_secrets provider by ensuring:

  • Secrets do not expire prematurely due to logical race conditions.
  • Updates are properly signaled to consuming components.
  • Performance is optimised with minimal locking and unnecessary memory allocations.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding changes to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the change-log tool
  • I have added an integration test or an E2E test

Disruptive User Impact

This change does not introduce breaking changes but ensures that the kubernetes_secrets provider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.

How to test this PR locally

  1. Run unit tests to validate the new caching behaviour:
    go test ./internal/pkg/composable/providers/kubernetessecrets/...

Related issues

@pkoutsovasilis pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.x Automated backport to the 8.x branch with mergify labels Jan 29, 2025
@pkoutsovasilis pkoutsovasilis self-assigned this Jan 29, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch 3 times, most recently from e001b10 to 9093b52 Compare January 29, 2025 09:08
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review January 30, 2025 07:05
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner January 30, 2025 07:05
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch from 9093b52 to 3e3788e Compare January 30, 2025 17:48
// no existing secret in the cache thus add it
return true
}
if existing.value != apiSecretValue && !existing.apiFetchTime.After(now) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we compare (just for readability)
sd.apiFetchTime.After(existing.apiFetchTime)
this reads better, update if current value is update after existing one was

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sd.apiFetchTime.After(existing.apiFetchTime) if I go with that, and the time between sd and existing are the same (I know it is way too hard but can happen) then I am gonna lose a secret even if the value has changed. makes sense? 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we also could reverse and do !before. but this is just a readability nit

@pkoutsovasilis pkoutsovasilis added the backport-9.0 Automated backport to the 9.0 branch label Feb 5, 2025
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, had some questions about the implementation details.

@pkoutsovasilis pkoutsovasilis requested a review from pchila February 7, 2025 15:51
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pkoutsovasilis
Copy link
Contributor Author

@cmacknz since I am not sure that this PR should be categorised as a bug fix; please advice on the backport labels we should go with 🙂

@cmacknz
Copy link
Member

cmacknz commented Feb 7, 2025

This looks great, and given it's an action item from a production incident, we should backport to the branches that allow it to be deployed to guard against those incidents occurring again. Backport to the oldest branch that infra observability will deploy and any newer branches.

@pkoutsovasilis pkoutsovasilis added backport-8.17 Automated backport with mergify backport-8.18 Automated backport to the 8.18 branch labels Feb 7, 2025
@pkoutsovasilis pkoutsovasilis enabled auto-merge (squash) February 10, 2025 15:20
@pkoutsovasilis pkoutsovasilis merged commit 6d4b91c into elastic:main Feb 10, 2025
14 checks passed
@pkoutsovasilis pkoutsovasilis deleted the k8s/secret_provider_cache_tmp branch February 10, 2025 20:36
mergify bot pushed a commit that referenced this pull request Feb 10, 2025
* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)
mergify bot pushed a commit that referenced this pull request Feb 10, 2025
* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)

# Conflicts:
#	internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
mergify bot pushed a commit that referenced this pull request Feb 10, 2025
* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)
mergify bot pushed a commit that referenced this pull request Feb 10, 2025
* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)
pkoutsovasilis added a commit that referenced this pull request Feb 11, 2025
…) (#6797)

* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Feb 11, 2025
…) (#6796)

* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Feb 13, 2025
…) (#6794)

* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Feb 14, 2025
…s_secrets provider (#6795)

* [k8s] Fix logical race conditions in kubernetes_secrets provider (#6623)

* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)

# Conflicts:
#	internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go

* fix: resolve conflicts

* fix: implicit memory aliasing in for loop

* fix: make Run not to fail if getK8sClientFunc returns an err which is appropriate only for 8.17.x

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.17 Automated backport with mergify backport-8.18 Automated backport to the 8.18 branch backport-9.0 Automated backport to the 9.0 branch bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deletion race condition when fetching secret from the Kubernetes secrets provider cache
7 participants