linkerd-proxy sometimes cannot connect to control plane on different node #12953

mspraggs · 2024-08-11T16:51:32Z

mspraggs
Aug 11, 2024

Hello,

I'm trying to use linkerd (version edge-24.7.5) on a two-node bare-metal k3s cluster. The linkerd control plane pods are running on the same node as the k3s control plane, and the workloads I'm trying to inject are running on the worker node. The pods I'm trying to inject get stuck at ContainerCreating before the PostStartHook times out, causing the pod to crash-loop. Once this happens, I see the following in the proxy logs (this is an extract, I can provide the entire log if needed):

[     0.081953s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:pool:endpoint{addr=10.42.0.140:8080}:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) }
[     0.082041s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:pool:endpoint{addr=10.42.0.140:8080}:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) }
[     0.082223s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:pool:endpoint{addr=10.42.0.140:8080}:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) }
[     0.082336s] ERROR ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_identity_client::certify: Failed to obtain identity error=invalid peer certificate: NotValidYet
[     0.082355s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_proxy_identity_client::certify: Waiting to refresh identity sleep=10s
[     0.082395s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:pool: linkerd_proxy_balance_queue::worker: Callers dropped
[     0.082492s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:pool:endpoint{addr=10.42.0.140:8080}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
[     0.082515s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:pool:endpoint{addr=10.42.0.140:8080}:h2:Connection{peer=Client}: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
[     0.082647s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:pool:endpoint{addr=10.42.0.140:8080}:h2:Connection{peer=Client}: rustls::common_state: Sending warning alert CloseNotify    
[     0.172395s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}: linkerd_tls::server: Peeked bytes from TCP stream sz=0
[     0.172442s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}: linkerd_tls::server: Attempting to buffer TLS ClientHello after incomplete peek
[     0.172456s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}: linkerd_tls::server: Reading bytes from TCP stream buf.capacity=8192
[     0.172482s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}: linkerd_tls::server: Read bytes from TCP stream buf.len=45
[     0.172593s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}: linkerd_detect: Detected protocol protocol=Some(HTTP/1) elapsed=23.71µs
[     0.172627s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}: linkerd_proxy_http::server: Creating HTTP service version=HTTP/1
[     0.172759s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}:http: linkerd_proxy_http::server: Handling as HTTP version=HTTP/1
[     0.172868s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}:http: hyper::proto::h1::io: parsed 1 headers
[     0.172884s] DEBUG ThreadId(02) daemon:admin{listen.addr=[::]:4191}:accept{client.addr=[::1]:56552 server.addr=[::]:4191}:http: hyper::proto::h1::conn: incoming body is empty

I have installed linkerd using Helm and configured it to rotate certs with cert-manager using the following values:

proxyInit.iptablesMode=nft
identity.issuer.scheme=kubernetes.io/tls
linkerdVersion=edge-24.7.5
identityTrustAnchorsPEM="<ca-cert>"
identity.issuer.tls.crtPEM="<issuer-cert>"
identity.issuer.tls.keyPEM="<issuer-key>"

Here is the output of linkerd check:

$ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs
√ cluster networks contains all pods
√ cluster networks contains all services

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2024-08-13T13:32:47Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    is running version 24.7.5 but the latest edge version is 24.8.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.7.5 but cli running edge-24.8.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-67cd4f85bb-2p97f (edge-24.7.5)
	* linkerd-identity-8b7845b5b-dq84t (edge-24.7.5)
	* linkerd-proxy-injector-cd698dc65-mvqmn (edge-24.7.5)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-67cd4f85bb-2p97f running edge-24.7.5 but cli running edge-24.8.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-extension-checks
------------------------
√ namespace configuration for extensions

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-579df7ff4b-mngsb (edge-24.7.5)
	* prometheus-584ff5f95-rc2wb (edge-24.7.5)
	* tap-5d7d9b778b-xvwsk (edge-24.7.5)
	* tap-injector-55ffd6cbf8-j8hdv (edge-24.7.5)
	* web-5cb7b99f5d-ftvlk (edge-24.7.5)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-579df7ff4b-mngsb running edge-24.7.5 but cli running edge-24.8.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
√ prometheus is installed and configured correctly
√ viz extension self-check

Status check results are √

The problem usually happens if the k3s control plane has been running for some time, and restarting the control plane sometimes fixes the issue. I haven't encountered the same problem when the linkerd control plane is run on the worker, but I'm assuming it's better to host the linkerd control plane on the k3s control plane node?

I've tested pod connectivity across nodes using iperf3 with both TCP and UDP, and in both cases the transfer rates are pretty good when compared to the values I get going direct from host to host outside k3s. I've also changed my Flannel backend from vxlan to host-gw, and although that improved the raw transfer speeds, it hasn't fixed this issue.

At this point I'm not sure what else to try. Is this an issue with linkerd, or should I do more to rule out the CNI?

Answered by mspraggs

Aug 17, 2024

Found the source of the problem: clock drift between the nodes (Raspberry Pis) exceeded the lifetime of the certificates issued by the control plane (from the logs looks like the order of 20 seconds?). This mean the workloads couldn't validate the generated certificates, resulting in the NotValidYet error in the logs above.

View full answer

mspraggs · 2024-08-17T20:46:11Z

mspraggs
Aug 17, 2024
Author

Found the source of the problem: clock drift between the nodes (Raspberry Pis) exceeded the lifetime of the certificates issued by the control plane (from the logs looks like the order of 20 seconds?). This mean the workloads couldn't validate the generated certificates, resulting in the NotValidYet error in the logs above.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linkerd-proxy sometimes cannot connect to control plane on different node #12953

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

linkerd-proxy sometimes cannot connect to control plane on different node #12953

mspraggs Aug 11, 2024

Replies: 1 comment

mspraggs Aug 17, 2024 Author

mspraggs
Aug 11, 2024

mspraggs
Aug 17, 2024
Author