Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ClusterCacheTracker to address fundamental limitations #11272

Open
8 of 9 tasks
sbueringer opened this issue Oct 8, 2024 · 10 comments
Open
8 of 9 tasks

Improve ClusterCacheTracker to address fundamental limitations #11272

sbueringer opened this issue Oct 8, 2024 · 10 comments
Assignees
Labels
area/clustercache Issues or PRs related to the clustercachetracker kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

sbueringer commented Oct 8, 2024

Over time the ClusterCacheTracker evolved significantly. While it generally improved we have a few systematic issues that we should solve:

  • the current locking and client creation behavior is not good, whenever a reconciler calls GetClient it has a chance to be stuck for 10 seconds if the wl cluster is unreachable (because then GetClient might tries to create a client and then times out after 10s). While one reconciler tries to create a client other reconcilers will permanently requeue (currently with requeueAfter 1m). See also Improve ClusterCacheTracker TryLock behavior #10819
  • I think in general there is a huge potential to make the locking smarter
  • the current health checking code starts 1 go routine for every cluster (so 1k clusters => 1k health check goroutines)
  • ClusterCacheTracker requires to also add the ClusterCache reconcile to the manager. Because ClusterCache reconcile is a separate component that is easily forgotten

I probably missed a few :)

We also have some additional requirements:

Tasks:

Follow-up Tasks:

Follow-up ideas:

  • Consider exposing and using ReaderFailOnMissingInformer cache option
  • Consider more sophisticated backoff behavior for connection creation (unclear if necessary / wanted)
  • Consider making ClusterCache a generic cache per cluster, e.g. for:
    • etcd client certs (to replace GetClientCertificatePrivateKey)
      • Potentially also just use one private key for KCP instead of one per-Cluster (we can also consider rotating this one key if necessary)
    • but also maybe other stuff like drain cache
    • Important! Must be safe against deadlocks!
@k8s-ci-robot k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 8, 2024
@sbueringer sbueringer self-assigned this Oct 8, 2024
@sbueringer
Copy link
Member Author

/area clustercachetracker
/triage accepted

@k8s-ci-robot k8s-ci-robot added area/clustercache Issues or PRs related to the clustercachetracker triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 8, 2024
@sbueringer sbueringer added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 16, 2024
@k8s-ci-robot k8s-ci-robot removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Oct 16, 2024
@sbueringer sbueringer added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 16, 2024
@sbueringer
Copy link
Member Author

@cahillsf Are you interested in implementing some more metrics? Metrics for clustercache could be really nice.

(I didn't think about the format yet, maybe you can take a look and propose something)

@cahillsf
Copy link
Member

@sbueringer yeah sure 👍 will take a look

@cahillsf
Copy link
Member

@sbueringer to confirm this would be for the new clustercache introduced in this PR right #11247?

@sbueringer
Copy link
Member Author

Yes! The old ClusterCacheTracker will be removed soon

@cahillsf
Copy link
Member

@sbueringer so for the health check metrics i was thinking:

then for connection metrics we could do something like https://demo.promlabs.com/metrics

# HELP net_conntrack_dialer_conn_established_total Total number of connections successfully established by the given dialer a given name.
# TYPE net_conntrack_dialer_conn_established_total counter
net_conntrack_dialer_conn_established_total{dialer_name="alertmanager"} 1
...
# HELP net_conntrack_dialer_conn_failed_total Total number of connections failed to dial by the dialer a given name.
# TYPE net_conntrack_dialer_conn_failed_total counter
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="refused"} 0
...
  • labels would be something like cluster_name rather than dialer_name in this case
  • observation would take place right above that health check branch here

wdyt?

@sbueringer
Copy link
Member Author

sbueringer commented Jan 30, 2025

I think I would observe in the switch in the HealthCheck method, apart from that, sounds good

then for connection metrics we could do something like demo.promlabs.com/metrics

# HELP net_conntrack_dialer_conn_established_total Total number of connections successfully established by the given dialer a given name.
# TYPE net_conntrack_dialer_conn_established_total counter
net_conntrack_dialer_conn_established_total{dialer_name="alertmanager"} 1
...
# HELP net_conntrack_dialer_conn_failed_total Total number of connections failed to dial by the dialer a given name.
# TYPE net_conntrack_dialer_conn_failed_total counter
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="refused"} 0
...
  • labels would be something like cluster_name rather than dialer_name in this case
  • observation would take place right above that health check branch here

We only have 1 connection per cluster. So I think a counter with a dimension of cluster_name doesn't make sense.
I wonder if we should do something like the up metric in Prometheus.

capi_cluster_cache_connection_up{cluster=<>} 0 (or 1)

Maybe we should observe in the Connect / Disconnect methods. Just to be sure we cover all the cases where these are called

@cahillsf
Copy link
Member

cahillsf commented Feb 3, 2025

@sbueringer 👋 is there an issue with this defer statement?

defer func() {
if retErr != nil {
log.Error(retErr, "Connect failed")
ca.lockedState.lastConnectionCreationErrorTimestamp = time.Now()
}
}()

from what i can tell retErr is declared in the function signature but never actually assigned any error return parameters within the rest of the method

@sbueringer
Copy link
Member Author

sbueringer commented Feb 4, 2025

from what i can tell retErr is declared in the function signature but never actually assigned any error return parameters within the rest of the method

I don't see any issue there. We don't have to explicitly set retErr, every error returned by return err will be automatically assigned to retErr.

@cahillsf
Copy link
Member

cahillsf commented Feb 4, 2025

from what i can tell retErr is declared in the function signature but never actually assigned any error return parameters within the rest of the method

I don't see any issue there. We don't have to explicitly set retErr, every error returned by return err will be automatically assigned to retErr.

ah interesting -- i need to go back to my Go books 😄 . thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustercache Issues or PRs related to the clustercachetracker kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants