Fatal error: concurrent map read and map write - CrashLoopBackOff #689

ujjwal · 2024-03-28T20:06:38Z

1. Quick Debug Information

OS/Version - Ubuntu22.04
Kernel Version: 5.15.0-1045-gke
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
GPU Operator Version: 22.9.1

2. Issue or feature description

GPU-Operator has been having reporting an error fatal error: concurrent map read and map write and crash looping. This is happening sporadically and preventing new GPUs nodes to be added into the cluster.

{"level":"info","ts":1711652616.7035823,"logger":"controllers.ClusterPolicy","msg":"Reconciliate ClusterPolicies after node label update","nb":1}
{"level":"info","ts":1711652616.703655,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.27.10-gke.1055000"}
fatal error: concurrent map read and map write

goroutine 216 [running]:
k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0002401c0, {{0x1d7bcdf, 0xa}, {0x1d762e6, 0x2}, {0x1905e27, 0xd}})
	/workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x65
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc00049d710, {0x2073490?, 0xc0002cbb90})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:119 +0x3dd
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc00049d710, {0x206a408, 0xc00024cdc0}, {0x2073490, 0xc0002cbb90}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:75 +0x65
sigs.k8s.io/controller-runtime/pkg/client.(*client).List(0xc0004b86c0, {0x206a408, 0xc00024cdc0}, {0x2073490?, 0xc0002cbb90?}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:365 +0x4c5
github.com/NVIDIA/gpu-operator/controllers.addWatchNewGPUNode.func1({0x206a408, 0xc00024cdc0}, {0xc001a09e20?, 0x424f05?})
	/workspace/controllers/clusterpolicy_controller.go:264 +0x8c
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc00160db40?, {0x206a408?, 0xc00024cdc0?}, {0x2073cc0, 0xc0007463a0}, {0x20821a8?, 0xc000e2a440?}, 0xc00160dbc8?)
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x59
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x206a408?, {0x206a408, 0xc00024cdc0}, {{0x20821a8?, 0xc000e2a440?}}, {0x2073cc0, 0xc0007463a0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe5
sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc0003c4140, {0x1d402e0?, 0xc000e2a440})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x27c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/workspace/vendor/k8s.io/client-go/tools/cache/controller.go:243
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x13e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005ddf38?, {0x204fec0, 0xc001602000}, 0x1, 0xc001600000)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x3a6b222c7d7d7b3a?, 0x3b9aca00, 0x0, 0x69?, 0x227b3a227d225c67?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0005d4990)
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 181
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

goroutine 1 [select]:
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc000622820, {0x206a408, 0xc000482aa0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:509 +0x825
main.main()
	/workspace/main.go:176 +0xea8

The text was updated successfully, but these errors were encountered:

age9990 · 2024-03-29T02:35:14Z

+1, seeing this error with v23.9.1 & v23.9.2 in Vanilla K8s

CecileRobertMichon · 2024-04-03T16:46:17Z

+1, also seeing this error with v23.9.1

2024-04-03 16:29:36.737Z fatal error: concurrent map read and map write
2024-04-03 16:29:36.740Z 
2024-04-03 16:29:36.740Z goroutine 451 [running]:
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0003642a0, {{0x1d7bcdf, 0xa}, {0x1d762e6, 0x2}, {0x1905e27, 0xd}})
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x65
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc00080a340, {0x2073490?, 0xc00259f7a0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:119 +0x3dd
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc00080a340, {0x206a408, 0xc002523d60}, {0x2073490, 0xc00259f7a0}, {0x2f8bbc0, 0x0, 0x0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:75 +0x65
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/client.(*client).List(0xc0005345a0, {0x206a408, 0xc002523d60}, {0x2073490?, 0xc00259f7a0?}, {0x2f8bbc0, 0x0, 0x0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:365 +0x4c5
2024-04-03 16:29:36.740Z github.com/NVIDIA/gpu-operator/controllers.getClusterPoliciesToReconcile({0x206a408, 0xc002523d60}, {0x2075e60, 0xc0005345a0})
2024-04-03 16:29:36.740Z        /workspace/controllers/upgrade_controller.go:321 +0xaf
2024-04-03 16:29:36.740Z github.com/NVIDIA/gpu-operator/controllers.(*UpgradeReconciler).SetupWithManager.func1({0x206a408, 0xc002523d60}, {0xc001a66060?, 0x424f05?})
2024-04-03 16:29:36.740Z        /workspace/controllers/upgrade_controller.go:248 +0x3e
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc007a9bb40?, {0x206a408?, 0xc002523d60?}, {0x2073cc0, 0xc0004bbe00}, {0x20821a8?, 0xc005380dc0?}, 0xc007a9bbc8?)
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x59
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x206a408?, {0x206a408, 0xc002523d60}, {{0x20821a8?, 0xc005380dc0?}}, {0x2073cc0, 0xc0004bbe00})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe5
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc00001fae0, {0x1d402e0?, 0xc005380dc0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x27c
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/controller.go:243
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.(*processorListener).run.func1()
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x13e
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000acef38?, {0x204fec0, 0xc007a82000}, 0x1, 0xc007a80000)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.Until(...)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.(*processorListener).run(0xc0079e4090)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x69
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
2024-04-03 16:29:36.740Z created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 178
2024-04-03 16:29:36.741Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

Devin-Yue · 2024-04-05T07:22:24Z

How many workers in the cluster ?
From the error log it's similar to the large scale issue.

ujjwal · 2024-04-05T19:22:37Z

How many workers in the cluster ? From the error log it's similar to the large scale issue.

About 400 GPUs

cdesiniotis · 2024-04-10T16:15:59Z

@ujjwal @age9990 @CecileRobertMichon thanks for reporting this issue. A fix for this has been merged and will be included in our next release: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/commit/59802314ef1bb947ff45978c9163db5b7c9f7e93

If you would like to test the fix out beforehand, you can use the gpu-operator image built from this commit: registry.gitlab.com/nvidia/kubernetes/gpu-operator/staging/gpu-operator:59802314-ubi8

CNT-4912 NVIDIA/gpu-operator#689 https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1057 Signed-off-by: Mike McKiernan <[email protected]>

cdesiniotis · 2024-05-02T20:47:34Z

Hi all -- GPU Operator 24.3.0 has been released and contains a fix for this issue.
https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

cdesiniotis added the needs-triage issue or PR has not been assigned a priority-px label label Apr 3, 2024

willosa mentioned this issue Apr 4, 2024

Support custom image with race detector. Race condition fix. basetenlabs/gpu-operator#2

Open

cdesiniotis added bug Issue/PR to expose/discuss/fix a bug and removed needs-triage issue or PR has not been assigned a priority-px label labels Apr 4, 2024

cdesiniotis assigned tariq1890 Apr 4, 2024

mikemckiernan added a commit to NVIDIA/cloud-native-docs that referenced this issue Apr 30, 2024

Avoid concur map access

9c9b2b1

CNT-4912 NVIDIA/gpu-operator#689 https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1057 Signed-off-by: Mike McKiernan <[email protected]>

cdesiniotis closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal error: concurrent map read and map write - CrashLoopBackOff #689

Fatal error: concurrent map read and map write - CrashLoopBackOff #689

ujjwal commented Mar 28, 2024

age9990 commented Mar 29, 2024

CecileRobertMichon commented Apr 3, 2024

Devin-Yue commented Apr 5, 2024

ujjwal commented Apr 5, 2024

cdesiniotis commented Apr 10, 2024

cdesiniotis commented May 2, 2024

Fatal error: concurrent map read and map write - CrashLoopBackOff #689

Fatal error: concurrent map read and map write - CrashLoopBackOff #689

Comments

ujjwal commented Mar 28, 2024

1. Quick Debug Information

2. Issue or feature description

age9990 commented Mar 29, 2024

CecileRobertMichon commented Apr 3, 2024

Devin-Yue commented Apr 5, 2024

ujjwal commented Apr 5, 2024

cdesiniotis commented Apr 10, 2024

cdesiniotis commented May 2, 2024