-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore repair_run_by_cluster_v2 rows with no corresponding repair #1478
Ignore repair_run_by_cluster_v2 rows with no corresponding repair #1478
Conversation
When bad things (like Reaper filling the drive with logs) happen, it's possible to end up with repair_run_by_cluster_v2 rows with no corresponding repair row, which breaks Reaper. So, just skip them.
No linked issues found. Please add the corresponding issues in the pull request description. |
I'm not actually sure if anything will clean up the repair_run_by_cluster_v2 rows, but I'm not sure it's a problem. The mismatch between repair_run and repair_run_by_cluster_v2 is frequent enough to be a pain (basically, a percentage of the time when anything goes wrong with the cluster and Reaper logs until the drive fills), but not often enough to generate measurable load. |
src/server/src/main/java/io/cassandrareaper/storage/repairrun/CassandraRepairRunDao.java
Outdated
Show resolved
Hide resolved
Per comment, rewrite to stick a filter between instead of shoving a ternary into the map. Cleaner and easier to read.
Looks like there's an issue for this already as well: #1463 |
I'm looking at this, having some issues with my local setup as far as testing goes so bear with me while I try to work around them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of testing, I've followed the procedure below:
Try to repro the issue from main
by:
- First commenting out the
testSSLHotReload
(which is a problem to run locally) then building the repo into a docker image,kind load
ing it into akind
cluster, spinning up a K8ssandraCluster with the new Reaper image. - Creating a new repair via the UI.
- Confirming the repair exists in both
repair_run_by_cluster_v2
andrepair_run
- Deleting from
repair_run
viadelete from repair_run where id = d03c24a0-e59e-11ee-9a67-5dd395ae716d
When reloading the UI I then see:

Which I think is what we want based on the error repro'd here.
I then go and rebuild the image and re-deploy everything using the branch from this PR. Following the same procedure, I no longer see the NPE visible in Reaper's logs. Instead, the UI simply returns no results for repair runs in the DONE or RUNNING state.
I'm going to chat to @adejanovski about whether we need some more work to ensure that repair runs are purged from repair_run_by_cluster_v2
, and whether this data inconsistency should be logged as a warning. But I'll approve this for now on the basis that it eliminates the NPE.
When bad things (like Reaper filling the drive with logs) happen, it's possible to end up with repair_run_by_cluster_v2 rows with no corresponding repair row, which breaks Reaper. So, just skip them.