-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error: concurrent map read and map write - CrashLoopBackOff #689
Comments
+1, seeing this error with v23.9.1 & v23.9.2 in Vanilla K8s |
+1, also seeing this error with v23.9.1
|
How many workers in the cluster ? |
About 400 GPUs |
@ujjwal @age9990 @CecileRobertMichon thanks for reporting this issue. A fix for this has been merged and will be included in our next release: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/commit/59802314ef1bb947ff45978c9163db5b7c9f7e93 If you would like to test the fix out beforehand, you can use the gpu-operator image built from this commit: |
CNT-4912 NVIDIA/gpu-operator#689 https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1057 Signed-off-by: Mike McKiernan <[email protected]>
Hi all -- GPU Operator 24.3.0 has been released and contains a fix for this issue. I am closing this issue. But please re-open if you still encountering this with 24.3.0. |
1. Quick Debug Information
2. Issue or feature description
GPU-Operator has been having reporting an error
fatal error: concurrent map read and map write
and crash looping. This is happening sporadically and preventing new GPUs nodes to be added into the cluster.The text was updated successfully, but these errors were encountered: