Skip to content

Commit

Permalink
Add ability to configure client-go's QPS and Burst settings (#11644)
Browse files Browse the repository at this point in the history
* Add ability to configure client-go's `QPS` and `Burst` settings

## Problem and Symptoms

When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this:

```
time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
```

## Solution

Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic.

## Observability

The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst`

## Testing

You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to:

- Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory.
- Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts.

```javascript
import { Client, Stream } from 'k6/experimental/grpc';
import { sleep } from 'k6';

const client = new Client();
client.load(['.'], 'identity.proto');

// This always holds:
// req_num = (1 / req_duration ) * duration * VUs
// Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we
// can solve for the required VUs:
// VUs = req_num * req_duration / duration
// VUs = 6000 * 0.5 / 60 = 50
export const options = {
  scenarios: {
    identity: {
      executor: 'constant-vus',
      vus: 50,
      duration: '1m',
    },
  },
};

export default () => {
  client.connect('localhost:8080', {
    plaintext: true,
  });

  const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify');

  // Replace with your own token
  let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw==";
  // Replace with your own CSR
  let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE=";

  const data = {
		identity:                     "web.emojivoto.serviceaccount.identity.linkerd.cluster.local",
    token:                        token,
		certificate_signing_request:  csr,
  };
  stream.write(data);

  // This request takes around 2ms, so this sleep will mostly determine its final duration
  sleep(0.5);
};
```

This results in the following report:

```
scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
           * identity: 50 looping VUs for 1m0s (gracefulStop: 30s)

     data_received................: 6.3 MB 104 kB/s
     data_sent....................: 9.4 MB 156 kB/s
     grpc_req_duration............: avg=2.14ms   min=873.93µs med=1.9ms    max=12.89ms  p(90)=3.13ms   p(95)=3.86ms
     grpc_streams.................: 6000   99.355331/s
     grpc_streams_msgs_received...: 6000   99.355331/s
     grpc_streams_msgs_sent.......: 6000   99.355331/s
     iteration_duration...........: avg=503.16ms min=500.8ms  med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms
     iterations...................: 6000   99.355331/s
     vus..........................: 50     min=50      max=50
     vus_max......................: 50     min=50      max=50

running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations
```

With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
  • Loading branch information
alpeb authored Nov 28, 2023
1 parent 6965162 commit 2d71629
Show file tree
Hide file tree
Showing 29 changed files with 176 additions and 13 deletions.
4 changes: 4 additions & 0 deletions charts/linkerd-control-plane/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,11 +175,15 @@ Kubernetes: `>=1.21.0-0`
| identity.issuer.tls | object | `{"crtPEM":"","keyPEM":""}` | Which scheme is used for the identity issuer secret format |
| identity.issuer.tls.crtPEM | string | `""` | Issuer certificate (ECDSA). It must be provided during install. |
| identity.issuer.tls.keyPEM | string | `""` | Key for the issuer certificate (ECDSA). It must be provided during install |
| identity.kubeAPI.clientBurst | int | `200` | Burst value over clientQPS |
| identity.kubeAPI.clientQPS | int | `100` | Maximum QPS sent to the kube-apiserver before throttling. See [token bucket rate limiter implementation](https://github.com/kubernetes/client-go/blob/v12.0.0/util/flowcontrol/throttle.go) |
| identity.serviceAccountTokenProjection | bool | `true` | Use [Service Account token Volume projection](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-token-volume-projection) for pod validation instead of the default token |
| identityTrustAnchorsPEM | string | `""` | Trust root certificate (ECDSA). It must be provided during install. |
| identityTrustDomain | string | clusterDomain | Trust domain used for identity |
| imagePullPolicy | string | `"IfNotPresent"` | Docker image pull policy |
| imagePullSecrets | list | `[]` | For Private docker registries, authentication is needed. Registry secrets are applied to the respective service accounts |
| kubeAPI.clientBurst | int | `200` | Burst value over clientQPS |
| kubeAPI.clientQPS | int | `100` | Maximum QPS sent to the kube-apiserver before throttling. See [token bucket rate limiter implementation](https://github.com/kubernetes/client-go/blob/v12.0.0/util/flowcontrol/throttle.go) |
| linkerdVersion | string | `"linkerdVersionValue"` | control plane version. See Proxy section for proxy version |
| networkValidator.connectAddr | string | `"1.1.1.1:20001"` | Address to which the network-validator will attempt to connect. we expect this to be rewritten |
| networkValidator.enableSecurityContext | bool | `true` | Include a securityContext in the network-validator pod spec |
Expand Down
2 changes: 2 additions & 0 deletions charts/linkerd-control-plane/templates/identity.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,8 @@ spec:
- -identity-clock-skew-allowance={{.Values.identity.issuer.clockSkewAllowance}}
- -identity-scheme={{.Values.identity.issuer.scheme}}
- -enable-pprof={{.Values.enablePprof | default false}}
- -kube-apiclient-qps={{.Values.identity.kubeAPI.clientQPS}}
- -kube-apiclient-burst={{.Values.identity.kubeAPI.clientBurst}}
{{- include "partials.linkerd.trace" . | nindent 8 -}}
env:
- name: LINKERD_DISABLED
Expand Down
10 changes: 10 additions & 0 deletions charts/linkerd-control-plane/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,13 @@ identityTrustAnchorsPEM: |
# -- Trust domain used for identity
# @default -- clusterDomain
identityTrustDomain: ""
kubeAPI: &kubeapi
# -- Maximum QPS sent to the kube-apiserver before throttling.
# See [token bucket rate limiter
# implementation](https://github.com/kubernetes/client-go/blob/v12.0.0/util/flowcontrol/throttle.go)
clientQPS: 100
# -- Burst value over clientQPS
clientBurst: 200
# -- Additional annotations to add to all pods
podAnnotations: {}
# -- Additional labels to add to all pods
Expand Down Expand Up @@ -335,6 +342,7 @@ identity:

# -- Use [Service Account token Volume projection](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-token-volume-projection) for pod validation instead of the default token
serviceAccountTokenProjection: true

issuer:
scheme: linkerd.io/tls

Expand All @@ -353,6 +361,8 @@ identity:
# install
keyPEM: |
kubeAPI: *kubeapi

# -|- CPU, Memory and Ephemeral Storage resources required by the identity controller (see `proxy.resources` for sub-fields)
#identityResources:
# -|- CPU, Memory and Ephemeral Storage resources required by proxy injected into identity pod (see `proxy.resources` for sub-fields)
Expand Down
5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_controlplane_tracing_output.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_custom_domain.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_custom_registry.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_default.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_default_override_dst_get_nets.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_default_token.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_ha_output.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_ha_with_overrides_output.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_heartbeat_disabled_output.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_helm_control_plane_output.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_helm_control_plane_output_ha.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_helm_output_ha_labels.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_no_init_container.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_output.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_proxy_ignores.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions cli/cmd/testdata/install_values_file.golden

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 2d71629

Please sign in to comment.