Azure disk snapshot 429 ratelimit with velero-plugin-for-microsoft-azure #7393

behroozam · 2024-02-06T09:10:52Z

We have a couple of AKS clusters in Azure and we enabled the CSI feature to snapshot/backup PVCs.
It's working fine for the clusters with fewer PVCs but for a cluster with 65 PVCs azure starts sending a 429 response code.

The issue is that whenever the snapshoter tries to take a snapshot or fetch the existing backup it tries to use Azure API to list the storage account keys and It exhausts Azure API.

This is the error code:

 failed with storage.FileSharesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code=\\\"TooManyRequests\\\" Message=\\\"The request is being throttled as the limit has been reached for operation type - Read_ObservationWindow_00:05:00. For more information, see - https://aka.ms/srpthrottlinglimits\\\", accountName: \\\"<storageaccountname>\\\"\"" backup=<veleronamespace>/<veleroPod>cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:259" pluginName=velero-plugin-for-csi

and this is the Azure activity on the storage account which is currently throttled.

A possible fix for the issue: caching the storage account key instead of listing keys each time for each request

Velero version: 1.13.0
Velero features: EnableCSI
Kubernetes version: v1.27.3
Cloud provider or hardware configuration: Azure AKS
OS : azure linux

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-02-06T09:43:57Z

Looks like it was the external-snapshotter called the Azure API, if so, it doesn't make any help even if Velero caches the storage account key. Anything I missed?

behroozam · 2024-02-06T10:14:02Z

Looks like it was the external-snapshotter called the Azure API, if so, it doesn't make any help even if Velero caches the storage account key. Anything I missed?

We are using velero snapshotclass

apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Retain
driver: disk.csi.azure.com
kind: VolumeSnapshotClass
metadata:
  generation: 1
  labels:
    velero.io/csi-volumesnapshot-class: "true"
  name: velero-csi-disk-volume-snapshot-class

Lyndon-Li · 2024-02-06T10:21:21Z

This snapshotclass is labeled for Velero with velero.io/csi-volumesnapshot-class, but it doesn't mean Velero drives the snapshot creation, and the underlying role to take the snapshot is still external-snapshotter and Azure Disk CSI driver (disk.csi.azure.com)

behroozam · 2024-02-06T12:34:42Z

Thank you for your replay @Lyndon-Li
Perhaps we could add a delay option for taking snapshots of each PVC, given that Valero is the controller that triggers the external snapshotter.

ywk253100 · 2024-02-19T07:28:51Z

You can set useAAD=true in the BSL config to avoid calling list storage account API in the Velero Azure plugin, search useAAD in https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure for more information

behroozam · 2024-02-19T15:34:22Z

I've tried both the useAAD and storageAccountKeyEnvVar for fallback, together and individually.
It seems that the snapshooter doesn't care about the configuration and tries to fetch the existing snapshots by listing the API keys on the backup storage account.
I'm also getting 429 error messages on the existing storage account on the AKS cluster for current PVCs

Read_ObservationWindow_00:05:00

which is fairly similar to this issue on Kubernetes.
And also this one on openshift platform.

ywk253100 · 2024-02-21T02:05:47Z

The useAAD doesn't impact the behavior of the snapshooter, it only impacts the Velero Azure plugin which also lists storage account access key if useAAD=false. I thought decreasing the requests made from Velero Azure plugin side would mitigate the throtting issue, but seems it doesn't. That makes sence because the Velero Azure plugin and the snapshooter use two different credentials.

Could you run the velero debug command and provide us the debug bundle?

As @Lyndon-Li said, this may be the issue of snapshooter and Azure CSI driver, we could do nothing on Velero side. But let's gather the debug bundle and check it again.

anshulahuja98 · 2024-07-04T05:00:17Z

If CSI is the flow used, this might be related to #7978

jfmulero · 2025-01-30T10:47:18Z

hi ¿any idea on this? Since we activated csi snapshot we are seeing these warnings in the velero backup:

message: /VolumeSnapshotContent snapcontent-92bf032f-b5ce-4fa9-a159 has error: Failed to check and update snapshot content: failed to take snapshot of the volume: "rpc error: code = Internal desc = failed to get file share quota: storage.FileSharesClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code=\"Unknown\" Message=\"Unknown service error\""

In azure portal we see these errors in the storage account:

anshulahuja98 · 2025-02-03T06:51:52Z

Are you using latest version of AzureFileShare CSI driver for taking VolumeSnapshot of AFS volumes?
Can you read through #7978 and check abnormal count of Volumesnapshot/ volumesnapshotcontent in your cluster?

FYI: @mayankagg9722, please see last comment only, AFS CSI seems to be throttling.

mayankagg9722 · 2025-02-03T11:18:41Z

Hello @behroozam @jfmulero this is a already known issue and for this you have to add useDataPlane: true in your volume snapshot class so that it will use data plane sdk for storage which doesn’t have any throttling limits as such and this will solve your concern.

Ref:
https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/driver-parameters.md

kubernetes-sigs/azurefile-csi-driver#1687

Azure/AKS#804

jfmulero · 2025-02-04T08:33:28Z

Hello @behroozam @jfmulero this is a already known issue and for this you have to add useDataPlane: true in your volume snapshot class so that it will use data plane sdk for storage which doesn’t have any throttling limits as such and this will solve your concern.

Ref: https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/driver-parameters.md

kubernetes-sigs/azurefile-csi-driver#1687

Azure/AKS#804

NAME                                STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
velero-devschedule-20250204010006   PartiallyFailed   2        0          2025-02-04 01:00:06 +0000 UTC   4d        veleroaksdev       <none>
velero-devschedule-20250203010033   PartiallyFailed   20       1218       2025-02-03 01:00:33 +0000 UTC   3d        veleroaksdev       <none>
velero-devschedule-20250202010032   PartiallyFailed   20       1200       2025-02-02 01:00:32 +0000 UTC   2d        veleroaksdev       <none>
velero-devschedule-20250201010031   PartiallyFailed   20       1217       2025-02-01 01:00:31 +0000 UTC   1d        veleroaksdev       <none>
velero-devschedule-20250131010028   PartiallyFailed   18       1247       2025-01-31 01:00:29 +0000 UTC   16h       veleroaksdev       <none>

Hi! In my case, I applied the indicated configuration and I have not had warnings or errors in my aks backup again. the errors are 2 orphaned disks from some test in the development cluster . thank you

Lyndon-Li added the Area/Cloud/Azure label Feb 6, 2024

Lyndon-Li assigned ywk253100 Feb 19, 2024

reasonerjt added the Needs investigation label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure disk snapshot 429 ratelimit with velero-plugin-for-microsoft-azure #7393

Azure disk snapshot 429 ratelimit with velero-plugin-for-microsoft-azure #7393

behroozam commented Feb 6, 2024 •

edited

Loading

Lyndon-Li commented Feb 6, 2024

behroozam commented Feb 6, 2024

Lyndon-Li commented Feb 6, 2024

behroozam commented Feb 6, 2024

ywk253100 commented Feb 19, 2024

behroozam commented Feb 19, 2024 •

edited

Loading

ywk253100 commented Feb 21, 2024

anshulahuja98 commented Jul 4, 2024 •

edited

Loading

jfmulero commented Jan 30, 2025

anshulahuja98 commented Feb 3, 2025

mayankagg9722 commented Feb 3, 2025

jfmulero commented Feb 4, 2025

Azure disk snapshot 429 ratelimit with velero-plugin-for-microsoft-azure #7393

Azure disk snapshot 429 ratelimit with velero-plugin-for-microsoft-azure #7393

Comments

behroozam commented Feb 6, 2024 • edited Loading

Lyndon-Li commented Feb 6, 2024

behroozam commented Feb 6, 2024

Lyndon-Li commented Feb 6, 2024

behroozam commented Feb 6, 2024

ywk253100 commented Feb 19, 2024

behroozam commented Feb 19, 2024 • edited Loading

ywk253100 commented Feb 21, 2024

anshulahuja98 commented Jul 4, 2024 • edited Loading

jfmulero commented Jan 30, 2025

anshulahuja98 commented Feb 3, 2025

mayankagg9722 commented Feb 3, 2025

jfmulero commented Feb 4, 2025

behroozam commented Feb 6, 2024 •

edited

Loading

behroozam commented Feb 19, 2024 •

edited

Loading

anshulahuja98 commented Jul 4, 2024 •

edited

Loading