Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error: concurrent map read and map write - CrashLoopBackOff #689

Closed
ujjwal opened this issue Mar 28, 2024 · 6 comments
Closed

Fatal error: concurrent map read and map write - CrashLoopBackOff #689

ujjwal opened this issue Mar 28, 2024 · 6 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@ujjwal
Copy link

ujjwal commented Mar 28, 2024

1. Quick Debug Information

  • OS/Version - Ubuntu22.04
  • Kernel Version: 5.15.0-1045-gke
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
  • GPU Operator Version: 22.9.1

2. Issue or feature description

GPU-Operator has been having reporting an error fatal error: concurrent map read and map write and crash looping. This is happening sporadically and preventing new GPUs nodes to be added into the cluster.

{"level":"info","ts":1711652616.7035823,"logger":"controllers.ClusterPolicy","msg":"Reconciliate ClusterPolicies after node label update","nb":1}
{"level":"info","ts":1711652616.703655,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.27.10-gke.1055000"}
fatal error: concurrent map read and map write

goroutine 216 [running]:
k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0002401c0, {{0x1d7bcdf, 0xa}, {0x1d762e6, 0x2}, {0x1905e27, 0xd}})
	/workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x65
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc00049d710, {0x2073490?, 0xc0002cbb90})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:119 +0x3dd
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc00049d710, {0x206a408, 0xc00024cdc0}, {0x2073490, 0xc0002cbb90}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:75 +0x65
sigs.k8s.io/controller-runtime/pkg/client.(*client).List(0xc0004b86c0, {0x206a408, 0xc00024cdc0}, {0x2073490?, 0xc0002cbb90?}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:365 +0x4c5
github.com/NVIDIA/gpu-operator/controllers.addWatchNewGPUNode.func1({0x206a408, 0xc00024cdc0}, {0xc001a09e20?, 0x424f05?})
	/workspace/controllers/clusterpolicy_controller.go:264 +0x8c
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc00160db40?, {0x206a408?, 0xc00024cdc0?}, {0x2073cc0, 0xc0007463a0}, {0x20821a8?, 0xc000e2a440?}, 0xc00160dbc8?)
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x59
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x206a408?, {0x206a408, 0xc00024cdc0}, {{0x20821a8?, 0xc000e2a440?}}, {0x2073cc0, 0xc0007463a0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe5
sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc0003c4140, {0x1d402e0?, 0xc000e2a440})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x27c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/workspace/vendor/k8s.io/client-go/tools/cache/controller.go:243
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x13e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005ddf38?, {0x204fec0, 0xc001602000}, 0x1, 0xc001600000)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x3a6b222c7d7d7b3a?, 0x3b9aca00, 0x0, 0x69?, 0x227b3a227d225c67?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0005d4990)
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 181
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

goroutine 1 [select]:
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc000622820, {0x206a408, 0xc000482aa0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:509 +0x825
main.main()
	/workspace/main.go:176 +0xea8
@age9990
Copy link

age9990 commented Mar 29, 2024

+1, seeing this error with v23.9.1 & v23.9.2 in Vanilla K8s

@CecileRobertMichon
Copy link

+1, also seeing this error with v23.9.1

2024-04-03 16:29:36.737Z fatal error: concurrent map read and map write
2024-04-03 16:29:36.740Z 
2024-04-03 16:29:36.740Z goroutine 451 [running]:
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0003642a0, {{0x1d7bcdf, 0xa}, {0x1d762e6, 0x2}, {0x1905e27, 0xd}})
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x65
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc00080a340, {0x2073490?, 0xc00259f7a0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:119 +0x3dd
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc00080a340, {0x206a408, 0xc002523d60}, {0x2073490, 0xc00259f7a0}, {0x2f8bbc0, 0x0, 0x0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:75 +0x65
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/client.(*client).List(0xc0005345a0, {0x206a408, 0xc002523d60}, {0x2073490?, 0xc00259f7a0?}, {0x2f8bbc0, 0x0, 0x0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:365 +0x4c5
2024-04-03 16:29:36.740Z github.com/NVIDIA/gpu-operator/controllers.getClusterPoliciesToReconcile({0x206a408, 0xc002523d60}, {0x2075e60, 0xc0005345a0})
2024-04-03 16:29:36.740Z        /workspace/controllers/upgrade_controller.go:321 +0xaf
2024-04-03 16:29:36.740Z github.com/NVIDIA/gpu-operator/controllers.(*UpgradeReconciler).SetupWithManager.func1({0x206a408, 0xc002523d60}, {0xc001a66060?, 0x424f05?})
2024-04-03 16:29:36.740Z        /workspace/controllers/upgrade_controller.go:248 +0x3e
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc007a9bb40?, {0x206a408?, 0xc002523d60?}, {0x2073cc0, 0xc0004bbe00}, {0x20821a8?, 0xc005380dc0?}, 0xc007a9bbc8?)
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x59
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x206a408?, {0x206a408, 0xc002523d60}, {{0x20821a8?, 0xc005380dc0?}}, {0x2073cc0, 0xc0004bbe00})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe5
2024-04-03 16:29:36.740Z sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc00001fae0, {0x1d402e0?, 0xc005380dc0})
2024-04-03 16:29:36.740Z        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x27c
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/controller.go:243
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.(*processorListener).run.func1()
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x13e
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000acef38?, {0x204fec0, 0xc007a82000}, 0x1, 0xc007a80000)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.Until(...)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
2024-04-03 16:29:36.740Z k8s.io/client-go/tools/cache.(*processorListener).run(0xc0079e4090)
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x69
2024-04-03 16:29:36.740Z k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
2024-04-03 16:29:36.740Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
2024-04-03 16:29:36.740Z created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 178
2024-04-03 16:29:36.741Z        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

@cdesiniotis cdesiniotis added the needs-triage issue or PR has not been assigned a priority-px label label Apr 3, 2024
@cdesiniotis cdesiniotis added bug Issue/PR to expose/discuss/fix a bug and removed needs-triage issue or PR has not been assigned a priority-px label labels Apr 4, 2024
@Devin-Yue
Copy link

How many workers in the cluster ?
From the error log it's similar to the large scale issue.

@ujjwal
Copy link
Author

ujjwal commented Apr 5, 2024

How many workers in the cluster ? From the error log it's similar to the large scale issue.

About 400 GPUs

@cdesiniotis
Copy link
Contributor

@ujjwal @age9990 @CecileRobertMichon thanks for reporting this issue. A fix for this has been merged and will be included in our next release: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/commit/59802314ef1bb947ff45978c9163db5b7c9f7e93

If you would like to test the fix out beforehand, you can use the gpu-operator image built from this commit: registry.gitlab.com/nvidia/kubernetes/gpu-operator/staging/gpu-operator:59802314-ubi8

mikemckiernan added a commit to NVIDIA/cloud-native-docs that referenced this issue Apr 30, 2024
@cdesiniotis
Copy link
Contributor

Hi all -- GPU Operator 24.3.0 has been released and contains a fix for this issue.
https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

6 participants