Document how to troubleshoot a crash loop backoff in Kubernetes (#322)

Co-authored-by: Rafal Korepta <[email protected]> Co-authored-by: Joyce Fee <[email protected]>
redpanda-data · Mar 5, 2024 · daa4ced · daa4ced
1 parent 2a1647f
commit daa4ced
Showing 1 changed file with 45 additions and 0 deletions.
diff --git a/modules/manage/pages/kubernetes/troubleshooting/k-troubleshoot.adoc b/modules/manage/pages/kubernetes/troubleshooting/k-troubleshoot.adoc
@@ -221,6 +221,51 @@ flux resume helmrelease <cluster-name> --namespace <namespace>
 ```
 //end::deployment-retries-exhausted[]
 
+//tag::crashloopbackoff[]
+=== Crash loop backoffs
+
+If a broker crashes after startup, or gets stuck in a crash loop, it could produce progressively more stored state that uses additional disk space and takes more time for each restart to process.
+
+To prevent infinite crash loops, the Redpanda Helm chart sets the `crash_loop_limit` node property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. After Redpanda reaches this limit, it will not start until its internal consecutive crash counter is reset to zero. In Kubernetes, the Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero.
+
+To troubleshoot a crash loop backoff:
+
+. Check the Redpanda logs from the most recent crashes:
++
+[,bash]
+----
+kubectl logs <pod-name> --namespace <namespace>
+----
++
+NOTE: Kubernetes retains logs only for the current and the previous instance of a container. This limitation makes it difficult to access logs from earlier crashes, which may contain vital clues about the root cause of the issue. Given these log retention limitations, setting up a centralized logging system is crucial. Systems such as https://grafana.com/docs/loki/latest/[Loki] or https://www.datadoghq.com/product/log-management/[Datadog] can capture and store logs from all containers, ensuring you have access to historical data.
+
+. Resolve the issue that led to the crash loop backoff.
+
+. Reset the crash counter to zero to allow Redpanda to restart. You can do any of the following to reset the counter:
++
+- Update the redpanda.yaml configuration file. You can make changes to any of the following sections in the Redpanda Helm chart to trigger an update:
+* `config.cluster`
+* `config.node`
+* `config.tunable`
+
+- Delete the `startup_log` file in the broker's data directory.
++
+[,bash]
+----
+kubectl exec <pod-name> --namespace <namespace> -- rm /var/lib/redpanda/data/startup_log
+----
++
+NOTE: It might be challenging to execute this command within a Pod that is in a `CrashLoopBackoff` state due to the limited time during which the Pod is available before it restarts. Wrapping the command in a loop might work.
+
+- Wait one hour since the last crash. The crash counter resets after one hour.
+
+To avoid future crash loop backoffs and manage the accumulation of small segments effectively:
+
+* xref:manage:kubernetes/monitoring/k-monitor-redpanda.adoc[Monitor] the size and number of segments regularly.
+* Optimize your Redpanda configuration for segment management.
+* Consider implementing xref:manage:kubernetes/storage/tiered-storage/k-tiered-storage.adoc[Tiered Storage] to manage data more efficiently.
+//end::crashloopbackoff[]
+
 //tag::deployment-pod-pending[]
 === StatefulSet never rolls out