-
Notifications
You must be signed in to change notification settings - Fork 510
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
EKS Hybrid Nodes best practices for network disconnections (#617)
- Loading branch information
Showing
9 changed files
with
530 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
//!!NODE_ROOT <chapter> | ||
[[hybrid,hybrid.title]] | ||
= Best Practices for Hybrid Deployments | ||
:doctype: book | ||
:sectnums: | ||
:toc: left | ||
:icons: font | ||
:experimental: | ||
:idprefix: | ||
:idseparator: - | ||
:sourcedir: . | ||
:info_doctype: chapter | ||
:info_title: Best Practices for Hybrid Deployments | ||
:info_abstract: Best Practices for Hybrid Deployments | ||
:info_titleabbrev: Hybrid | ||
:imagesdir: images/hybrid/ | ||
|
||
This guide provides guidance on running deployments in on-premise or edge environments with EKS Hybrid Nodes or EKS Anywhere. | ||
|
||
We currently have published guides for the following topics: | ||
|
||
- xref:hybrid-nodes-network-disconnections[Best Practices for EKS Hybrid Nodes and network disconnections] | ||
|
||
include::network-disconnections/index.adoc[leveloffset=+1] |
133 changes: 133 additions & 0 deletions
133
latest/bpg/hybrid/network-disconnections/app-network-traffic.adoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
[.topic] | ||
[[hybrid-nodes-app-network-traffic,hybrid-nodes-app-network-traffic.title]] | ||
= Application network traffic through network disconnections | ||
:info_doctype: section | ||
:info_title: Application network traffic through network disconnections | ||
:info_titleabbrev: Application network traffic | ||
:info_abstract: Application network traffic through network disconnections | ||
|
||
The topics on this page are related to Kubernetes cluster networking and the application traffic during network disconnections between nodes and the Kubernetes control plane. | ||
|
||
== Cilium | ||
|
||
Cilium has several modes for IP address management (IPAM), encapsulation, load balancing, and cluster routing. The modes validated in this guide used Cluster Scope IPAM, VXLAN overlay, BGP load balancing, and kube-proxy. Cilium was also used without BGP load balancing, replacing it with MetalLB L2 load balancing. | ||
|
||
The base of the Cilium install consists of the Cilium operator and Cilium agents. The Cilium operator runs as a Deployment and registers the Cilium Custom Resource Definitions (CRDs), manages IPAM, and synchronizes cluster objects with the Kubernetes API server among https://docs.cilium.io/en/stable/internals/cilium_operator/[other capabilities]. The Cilium agents run on each node as a DaemonSet and manage the eBPF programs to control the network rules for workloads running on the cluster. | ||
|
||
Generally, the in-cluster routing configured by Cilium remains available and in-place during network disconnections, which can be confirmed by observing the in-cluster traffic flows and IP table (iptables) rules for the pod network. | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
ip route show table all | grep cilium | ||
---- | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
10.86.2.0/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 | ||
10.86.2.64/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 | ||
10.86.2.128/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 | ||
10.86.2.192/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 | ||
10.86.3.0/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 | ||
10.86.3.16 dev cilium_host proto kernel scope link | ||
... | ||
---- | ||
|
||
However, during network disconnections, the Cilium operator and Cilium agents restart due to the coupling of their health checks with the health of the connection with the Kubernetes API server. It is expected to see the following in the logs of the Cilium operator and Cilium agents during network disconnections. During the network disconnections, you can use tools such as the `crictl` CLI to observe the restarts of these components including their logs. | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
msg="Started gops server" address="127.0.0.1:9890" subsys=gops | ||
msg="Establishing connection to apiserver" host="https://<k8s-cluster-ip>:443" subsys=k8s-client | ||
msg="Establishing connection to apiserver" host="https://<k8s-cluster-ip>:443" subsys=k8s-client | ||
msg="Unable to contact k8s api-server" error="Get \"https://<k8s-cluster-ip>:443/api/v1/namespaces/kube-system\": dial tcp <k8s-cluster-ip>:443: i/o timeout" ipAddr="https://<k8s-cluster-ip>:443" subsys=k8s-client | ||
msg="Start hook failed" function="client.(*compositeClientset).onStart (agent.infra.k8s-client)" error="Get \"https://<k8s-cluster-ip>:443/api/v1/namespaces/kube-system\": dial tcp <k8s-cluster-ip>:443: i/o timeout" | ||
msg="Start failed" error="Get \"https://<k8s-cluster-ip>:443/api/v1/namespaces/kube-system\": dial tcp <k8s-cluster-ip>:443: i/o timeout" duration=1m5.003834026s | ||
msg=Stopping | ||
msg="Stopped gops server" address="127.0.0.1:9890" subsys=gops | ||
msg="failed to start: Get \"https://<k8s-cluster-ip>:443/api/v1/namespaces/kube-system\": dial tcp <k8s-cluster-ip>:443: i/o timeout" subsys=daemon | ||
---- | ||
|
||
If you are using Cilium's BGP Control Plane capability for application load balancing, the BGP session for your pods and services might be down during network disconnections because the BGP speaker functionality is integrated with the Cilium agent, and the Cilium agent will continuously restart when disconnected from the Kubernetes control plane. For more information, see the Cilium BGP Control Plane Operation Guide in the Cilium documentation. Additionally, if you experience a simultaneous failure during a network disconnection such as a power cycle or machine reboot, the Cilium routes will not be preserved through these actions, though the routes are recreated when the node reconnects to the Kubernetes control plane and Cilium starts up again. | ||
|
||
== Calico | ||
|
||
_Coming soon_ | ||
|
||
== MetalLB | ||
|
||
MetalLB has two modes for load balancing: https://metallb.universe.tf/concepts/layer2/[L2 mode] and https://metallb.universe.tf/concepts/bgp/[BGP mode]. Reference the MetalLB documentation for details of how these load balancing modes work and their limitations. The validation for this guide used MetalLB in L2 mode, where one machine in the cluster takes ownership of the Kubernetes Service, and uses ARP for IPv4 to make the load balancer IP addresses reachable on the local network. When running MetalLB there is a controller that is responsible for the IP assignment and speakers that run on each node which are responsible for advertising services with assigned IP addresses. The MetalLB controller runs as a Deployment and the MetalLB speakers run as a DaemonSet. During network disconnections, the MetalLB controller and speakers fail to watch the Kubernetes API server for cluster resources but continue running. Most importantly, the Services that are using MetalLB for external connectivity remain available and accessible during network disconnections. | ||
|
||
== kube-proxy | ||
|
||
In EKS clusters, kube-proxy runs as a DaemonSet on each node and is responsible for managing network rules to enable communication between services and pods by translating service IP addresses to the IP addresses of the underlying pods. The IP tables (iptables) rules configured by kube-proxy are maintained during network disconnections and in-cluster routing continues to function and the kube-proxy pods continue to run. | ||
|
||
You can observe the kube-proxy rules with the following iptables commands. The first command shows packets going through the `PREROUTING` chain get directed to the `KUBE-SERVICES` chain. | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
iptables -t nat -L PREROUTING | ||
---- | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
Chain PREROUTING (policy ACCEPT) | ||
target prot opt source destination | ||
KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */ | ||
---- | ||
|
||
Inspecting the `KUBE-SERVICES` chain we can see the rules for the various cluster services. | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
Chain KUBE-SERVICES (2 references) | ||
target prot opt source destination | ||
KUBE-SVL-NZTS37XDTDNXGCKJ tcp -- anywhere 172.16.189.136 /* kube-system/hubble-peer:peer-service cluster IP */ | ||
KUBE-SVC-2BINP2AXJOTI3HJ5 tcp -- anywhere 172.16.62.72 /* default/metallb-webhook-service cluster IP */ | ||
KUBE-SVC-LRNEBRA3Z5YGJ4QC tcp -- anywhere 172.16.145.111 /* default/redis-leader cluster IP */ | ||
KUBE-SVC-I7SKRZYQ7PWYV5X7 tcp -- anywhere 172.16.142.147 /* kube-system/eks-extension-metrics-api:metrics-api cluster IP */ | ||
KUBE-SVC-JD5MR3NA4I4DYORP tcp -- anywhere 172.16.0.10 /* kube-system/kube-dns:metrics cluster IP */ | ||
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 172.16.0.10 /* kube-system/kube-dns:dns cluster IP */ | ||
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 172.16.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ | ||
KUBE-SVC-ENODL3HWJ5BZY56Q tcp -- anywhere 172.16.7.26 /* default/frontend cluster IP */ | ||
KUBE-EXT-ENODL3HWJ5BZY56Q tcp -- anywhere <LB-IP> /* default/frontend loadbalancer IP */ | ||
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 172.16.0.1 /* default/kubernetes:https cluster IP */ | ||
KUBE-SVC-YU5RV2YQWHLZ5XPR tcp -- anywhere 172.16.228.76 /* default/redis-follower cluster IP */ | ||
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ | ||
---- | ||
|
||
Inspecting the chain of the frontend service for the application we can see the pod IP addresses backing the service. | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
iptables -t nat -L KUBE-SVC-ENODL3HWJ5BZY56Q | ||
---- | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
Chain KUBE-SVC-ENODL3HWJ5BZY56Q (2 references) | ||
target prot opt source destination | ||
KUBE-SEP-EKXE7ASH7Y74BGBO all -- anywhere anywhere /* default/frontend -> 10.86.2.103:80 */ statistic mode random probability 0.33333333349 | ||
KUBE-SEP-GCY3OUXWSVMSEAR6 all -- anywhere anywhere /* default/frontend -> 10.86.2.179:80 */ statistic mode random probability 0.50000000000 | ||
KUBE-SEP-6GJJR3EF5AUP2WBU all -- anywhere anywhere /* default/frontend -> 10.86.3.47:80 */ | ||
---- | ||
|
||
The following kube-proxy log messages are expected during network disconnections as it attempts to watch the Kubernetes API server for updates to node and endpoint resources. | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
"Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://<k8s-endpoint>/api/v1/nodes?fieldSelector=metadata.name%3D<node-name>&resourceVersion=2241908\": dial tcp <k8s-ip>:443: i/o timeout" logger="UnhandledError" | ||
"Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get \"https://<k8s-endpoint>/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=2242090\": dial tcp <k8s-ip>:443: i/o timeout" logger="UnhandledError" | ||
---- | ||
|
||
== CoreDNS | ||
|
||
By default, pods in EKS clusters use the CoreDNS cluster IP address as the name server for in-cluster DNS queries. In EKS clusters, CoreDNS runs as a Deployment on nodes. With hybrid nodes, pods are able to continue communicating with the CoreDNS during network disconnections when there are CoreDNS replicas running locally on hybrid nodes. If you have an EKS cluster with nodes in the cloud and hybrid nodes in your on-premises environment, it is recommended to have at least one CoreDNS replica in each environment. CoreDNS continues serving DNS queries for records that were created before the network disconnection and continues running through the network reconnection for static stability. | ||
|
||
The following CoreDNS log messages are expected during network disconnections as it attempts to list objects from the Kubernetes API server. | ||
|
||
[source,bash,subs="verbatim,attributes,quotes"] | ||
---- | ||
Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://<k8s-cluster-ip>:443/api/v1/namespaces?resourceVersion=2263964": dial tcp <k8s-cluster-ip>:443: i/o timeout | ||
Failed to watch *v1.Service: failed to list *v1.Service: Get "https://<k8s-cluster-ip>:443/api/v1/services?resourceVersion=2263966": dial tcp <k8s-cluster-ip>:443: i/o timeout | ||
Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://<k8s-cluster-ip>:443/apis/discovery.k8s.io/v1/endpointslices?resourceVersion=2263896": dial tcp <k8s-cluster-ip>: i/o timeout | ||
---- |
Oops, something went wrong.