Improve ClusterCacheTracker to address fundamental limitations #11272

sbueringer · 2024-10-08T15:24:44Z

Over time the ClusterCacheTracker evolved significantly. While it generally improved we have a few systematic issues that we should solve:

the current locking and client creation behavior is not good, whenever a reconciler calls GetClient it has a chance to be stuck for 10 seconds if the wl cluster is unreachable (because then GetClient might tries to create a client and then times out after 10s). While one reconciler tries to create a client other reconcilers will permanently requeue (currently with requeueAfter 1m). See also Improve ClusterCacheTracker TryLock behavior #10819
I think in general there is a huge potential to make the locking smarter
the current health checking code starts 1 go routine for every cluster (so 1k clusters => 1k health check goroutines)
ClusterCacheTracker requires to also add the ClusterCache reconcile to the manager. Because ClusterCache reconcile is a separate component that is easily forgotten

I probably missed a few :)

We also have some additional requirements:

We want to expose some information about the health checking state so that controllers can tell when the connection to the wl cluster broke and e.g. set the RemoteConnectionProbe condition accordingly (xref: 📖 Proposal: Improving status in CAPI resources #10897)

Tasks:

Implement new version of ClusterCacheTracker (called ClusterCache): ✨ Introduce new ClusterCache #11247

Follow-up Tasks:

[P0] Add panic metric validation to envtest package so we have it everywhere: 🌱 Check for panics during test runs in envtest #11279
[P0] Deprecate the old ClusterCacheTracker: 🌱 Deprecate ClusterCacheTracker #11312
[P1] Remove requeues on "ErrClusterNotConnected" in other reconcilers: 🌱 clustercache: do not use RequeueAfter when hitting ErrClusterNotConnected #11736 @chrischdi
[Cleanup] Input validation & get rid of r.ClusterCache == nil checks
- Add input validation to all SetupWithManager funcs for stuff like Client, ClusterCache, SecretCachingClient: 🌱 Add input validations for controllers #11327
- Get rid of r.ClusterCache == nil checks machine_controller.go & machinehealthcheck_controller.go: 🌱 Remove clustercache nil checks #11336
✨ add typed watcher to ClusterCache #11331
Add metrics based on health checking data & connection state (previous issue: Add metrics for realtime investigation of disconnected wl scenarios #7147)

Follow-up ideas:

Consider exposing and using ReaderFailOnMissingInformer cache option
Consider more sophisticated backoff behavior for connection creation (unclear if necessary / wanted)
Consider making ClusterCache a generic cache per cluster, e.g. for:
- etcd client certs (to replace GetClientCertificatePrivateKey)
  - Potentially also just use one private key for KCP instead of one per-Cluster (we can also consider rotating this one key if necessary)
- but also maybe other stuff like drain cache
- Important! Must be safe against deadlocks!

sbueringer · 2024-10-08T15:25:10Z

/area clustercachetracker
/triage accepted

sbueringer · 2025-01-22T06:27:32Z

@cahillsf Are you interested in implementing some more metrics? Metrics for clustercache could be really nice.

(I didn't think about the format yet, maybe you can take a look and propose something)

cahillsf · 2025-01-22T14:29:00Z

@sbueringer yeah sure 👍 will take a look

cahillsf · 2025-01-25T17:56:55Z

@sbueringer to confirm this would be for the new clustercache introduced in this PR right #11247?

sbueringer · 2025-01-27T10:04:44Z

Yes! The old ClusterCacheTracker will be removed soon

cahillsf · 2025-01-30T02:37:43Z

@sbueringer so for the health check metrics i was thinking:

follow a similar pattern as kubernetes_healthcheck and kubernetes_healthchecks_totalhttps://kubernetes.io/docs/reference/instrumentation/metrics/
- here's where i think they're set up and they have a nice helper function for Observe
the observation in case of the ClusterCache would happen in this branch of logic

then for connection metrics we could do something like https://demo.promlabs.com/metrics

# HELP net_conntrack_dialer_conn_established_total Total number of connections successfully established by the given dialer a given name.
# TYPE net_conntrack_dialer_conn_established_total counter
net_conntrack_dialer_conn_established_total{dialer_name="alertmanager"} 1
...
# HELP net_conntrack_dialer_conn_failed_total Total number of connections failed to dial by the dialer a given name.
# TYPE net_conntrack_dialer_conn_failed_total counter
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="refused"} 0
...

labels would be something like cluster_name rather than dialer_name in this case
observation would take place right above that health check branch here

wdyt?

sbueringer · 2025-01-30T12:16:10Z

follow a similar pattern as kubernetes_healthcheck and kubernetes_healthchecks_totalkubernetes.io/docs/reference/instrumentation/metrics

here's where i think they're set up and they have a nice helper function for Observe

the observation in case of the ClusterCache would happen in this branch of logic

I think I would observe in the switch in the HealthCheck method, apart from that, sounds good

then for connection metrics we could do something like demo.promlabs.com/metrics

# HELP net_conntrack_dialer_conn_established_total Total number of connections successfully established by the given dialer a given name.
# TYPE net_conntrack_dialer_conn_established_total counter
net_conntrack_dialer_conn_established_total{dialer_name="alertmanager"} 1
...
# HELP net_conntrack_dialer_conn_failed_total Total number of connections failed to dial by the dialer a given name.
# TYPE net_conntrack_dialer_conn_failed_total counter
net_conntrack_dialer_conn_failed_total{dialer_name="alertmanager",reason="refused"} 0
...

labels would be something like cluster_name rather than dialer_name in this case
observation would take place right above that health check branch here

We only have 1 connection per cluster. So I think a counter with a dimension of cluster_name doesn't make sense.
I wonder if we should do something like the up metric in Prometheus.

capi_cluster_cache_connection_up{cluster=<>} 0 (or 1)

Maybe we should observe in the Connect / Disconnect methods. Just to be sure we cover all the cases where these are called

cahillsf · 2025-02-03T23:25:42Z

@sbueringer 👋 is there an issue with this defer statement?

cluster-api/controllers/clustercache/cluster_accessor.go

Lines 260 to 265 in 850efe7

    
           defer func() { 
        
           	if retErr != nil { 
        
           		log.Error(retErr, "Connect failed") 
        
           		ca.lockedState.lastConnectionCreationErrorTimestamp = time.Now() 
        
           	} 
        
           }()

from what i can tell retErr is declared in the function signature but never actually assigned any error return parameters within the rest of the method

sbueringer · 2025-02-04T10:14:45Z

from what i can tell retErr is declared in the function signature but never actually assigned any error return parameters within the rest of the method

I don't see any issue there. We don't have to explicitly set retErr, every error returned by return err will be automatically assigned to retErr.

cahillsf · 2025-02-04T15:25:54Z

from what i can tell retErr is declared in the function signature but never actually assigned any error return parameters within the rest of the method

I don't see any issue there. We don't have to explicitly set retErr, every error returned by return err will be automatically assigned to retErr.

ah interesting -- i need to go back to my Go books 😄 . thanks

k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 8, 2024

sbueringer mentioned this issue Oct 8, 2024

Improve ClusterCacheTracker TryLock behavior #10819

Closed

sbueringer self-assigned this Oct 8, 2024

k8s-ci-robot added area/clustercache Issues or PRs related to the clustercachetracker triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 8, 2024

sbueringer added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 16, 2024

k8s-ci-robot removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Oct 16, 2024

sbueringer added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 16, 2024

sbueringer mentioned this issue Oct 19, 2024

🌱 Deprecate ClusterCacheTracker #11312

Merged

Karthik-K-N mentioned this issue Oct 23, 2024

🌱 Add input validations for controllers #11327

Merged

sbueringer mentioned this issue Oct 25, 2024

🌱 Deprecate CCT ErrClusterLocked #11340

Merged

Sunnatillo mentioned this issue Nov 19, 2024

Tasks for v1.9 release cycle #11092

Closed

54 tasks

BrewTestBot mentioned this issue Dec 10, 2024

clusterctl 1.9.0 Homebrew/homebrew-core#200700

Merged

chrischdi mentioned this issue Jan 22, 2025

🌱 clustercache: do not use RequeueAfter when hitting ErrClusterNotConnected #11736

Merged

cahillsf mentioned this issue Feb 3, 2025

🌱 Add clustercache metrics #11789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ClusterCacheTracker to address fundamental limitations #11272

Improve ClusterCacheTracker to address fundamental limitations #11272

sbueringer commented Oct 8, 2024 •

edited by chrischdi

Loading

sbueringer commented Oct 8, 2024

sbueringer commented Jan 22, 2025

cahillsf commented Jan 22, 2025

cahillsf commented Jan 25, 2025

sbueringer commented Jan 27, 2025

cahillsf commented Jan 30, 2025

sbueringer commented Jan 30, 2025 •

edited

Loading

cahillsf commented Feb 3, 2025

sbueringer commented Feb 4, 2025 •

edited

Loading

cahillsf commented Feb 4, 2025

Improve ClusterCacheTracker to address fundamental limitations #11272

Improve ClusterCacheTracker to address fundamental limitations #11272

Comments

sbueringer commented Oct 8, 2024 • edited by chrischdi Loading

sbueringer commented Oct 8, 2024

sbueringer commented Jan 22, 2025

cahillsf commented Jan 22, 2025

cahillsf commented Jan 25, 2025

sbueringer commented Jan 27, 2025

cahillsf commented Jan 30, 2025

sbueringer commented Jan 30, 2025 • edited Loading

cahillsf commented Feb 3, 2025

sbueringer commented Feb 4, 2025 • edited Loading

cahillsf commented Feb 4, 2025

sbueringer commented Oct 8, 2024 •

edited by chrischdi

Loading

sbueringer commented Jan 30, 2025 •

edited

Loading

sbueringer commented Feb 4, 2025 •

edited

Loading