-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Fix logical race conditions in kubernetes_secrets provider #6623
[k8s] Fix logical race conditions in kubernetes_secrets provider #6623
Conversation
e001b10
to
9093b52
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
9093b52
to
3e3788e
Compare
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
// no existing secret in the cache thus add it | ||
return true | ||
} | ||
if existing.value != apiSecretValue && !existing.apiFetchTime.After(now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we compare (just for readability)
sd.apiFetchTime.After(existing.apiFetchTime)
this reads better, update if current value is update after existing one was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sd.apiFetchTime.After(existing.apiFetchTime)
if I go with that, and the time between sd and existing are the same (I know it is way too hard but can happen) then I am gonna lose a secret even if the value has changed. makes sense? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we also could reverse and do !before. but this is just a readability nit
internal/pkg/composable/providers/kubernetessecrets/expiration_cache.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/expiration_cache.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/store_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, had some questions about the implementation details.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@cmacknz since I am not sure that this PR should be categorised as a bug fix; please advice on the backport labels we should go with 🙂 |
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
This looks great, and given it's an action item from a production incident, we should backport to the branches that allow it to be deployed to guard against those incidents occurring again. Backport to the oldest branch that infra observability will deploy and any newer branches. |
|
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c)
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) # Conflicts: # internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c)
* fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c)
…) (#6797) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) Co-authored-by: Panos Koutsovasilis <[email protected]>
…) (#6796) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) Co-authored-by: Panos Koutsovasilis <[email protected]>
…) (#6794) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) Co-authored-by: Panos Koutsovasilis <[email protected]>
…s_secrets provider (#6795) * [k8s] Fix logical race conditions in kubernetes_secrets provider (#6623) * fix: refactor kubernetes_secrets provider to eliminate race conditions * fix: add changelog fragment and unit-tests for kubernetes_secrets provider * fix: replace RWMutex with Mutex * fix: rename newExpirationStore to newExpirationCache * fix: introduce kubernetes_secrets provider name as a const * fix: extend AddConditionally doc string to describe the case of condition is nil * fix: gosec lint (cherry picked from commit 6d4b91c) # Conflicts: # internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go * fix: resolve conflicts * fix: implicit memory aliasing in for loop * fix: make Run not to fail if getK8sClientFunc returns an err which is appropriate only for 8.17.x --------- Co-authored-by: Panos Koutsovasilis <[email protected]>
What does this PR do?
This PR refactors the implementation to eliminate logical race conditions of the
kubernetes_secrets
provider alongside with brand new unit-tests.Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.
To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:
With this in mind, the provider behaviour is now as follows:
Cache Disabled Mode:
Cache Enabled Mode:
Cache Actor Responsibilities:
lastAccess
:lastAccess
to prevent premature expiration and give the fetch actor time to pick up the new value.lastAccess
and let the entry "age" as it should.Fetch Actor Responsibilities:
lastAccess
when an entry is accessed to prevent unintended expiration.Considerations:
ExpirationCache
fromk8s.io/client-go/tools/cache
does not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy
Why is it important?
This refactor significantly improves the correctness of the
kubernetes_secrets
provider by ensuring:Checklist
./changelog/fragments
using the change-log toolDisruptive User Impact
This change does not introduce breaking changes but ensures that the
kubernetes_secrets
provider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.How to test this PR locally
go test ./internal/pkg/composable/providers/kubernetessecrets/...
Related issues