-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16876 vos: Improve error reporting on corruption crash #15747
Conversation
If the records referenced by a DTX have been removed from the tree, the data will no longer match. We can't detect this in all cases but we can avoid overwriting when we do detect it. Signed-off-by: Jeff Olivier <[email protected]>
Ticket title is 'LRZ: m02r01s07dao engine coredumps with vos EMRG src/vos/ilog.c:411 ilog_open() Assertion' |
Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/ |
Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/ |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/ |
DP_DTI(&DAE_XID(dae)), dtx_umoff_flag2type(rec)); | ||
d_tm_inc_counter(vos_tls_get(false)->vtl_dtx_rec_missing, 1); | ||
rc = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user may miss the warning message. As we discussed, there is risk for silent data corruption. So when we hit recognisable corruption, I would suggest to trigger assertion that will make the user to be aware of potential data corruption in stead of make the system to go ahead without explicitly blocking the system.
On the other hand, in your current patch, if we do not access some corrupted DTX entry, the corruption will be always there, longer time then more dangerous since related area maybe reused by others as time going.
We can move your patch logic into some new DDB functionality to repair the dangling DTX records via removing them. Since it is inconvenient to directly access "epoch" via DTX record reference, then we can do that as your patch does with checking lid+magic (if have). That is some kind of "quick mode" or "quick scan".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case, we would monitor the metric and alert someone that it happened. I still don't think quick scan buys us anything. A full scan would be able to find all of the missing entries but a quick scan can only find entries like the ones in this patch.
I think maybe having this behavior be configurable would be better. Until we have full scan, this patch is really all we can really do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the main concern for this patch is that we passively handle the corrupted DTX entries when we access them, that may leave more windows as to more risks of data corruption. @janekmi is working on the "quick scan" via ddb as offline mode. Maybe we need both (part).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem since we can't be 100% sure the root cause of the missing records we can't actually tell the user everything is ok. TBH "some corruption" sounds really bad and this, in my understanding, is the state of the system since this discrepancies are being discovered. It seems not fit to continue normal operation.
IMHO we should not automatically decide to continue despite the DTX transaction is obviously incomplete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I tend to agree with @janekmi and @Nasf-Fan .
- If we don't know exactly the root case, It's not safe to keep engine running, otherwise, more severe data corruption could be introduced;
- If we know the root case for sure, we should fix the bug;
If the original assert can only be triggered by bug (it was a proper assert), it looks to me there is no way of getting round of it (I don't think software is able to gracefully handle it's own bug).
src/vos/vos_obj_index.c
Outdated
@@ -321,6 +322,7 @@ vos_oi_find_alloc(struct vos_container *cont, daos_unit_oid_t oid, | |||
if (log) { | |||
vos_ilog_desc_cbs_init(&cbs, vos_cont2hdl(cont)); | |||
rc = ilog_open(vos_cont2umm(cont), &obj->vo_ilog, &cbs, dth == NULL, &loh); | |||
D_ASSERTF(rc == -DER_NONEXIST, "Uncorrectable incarnation log corruption detected"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rc != -DER_NONEXIST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, I fixed the other one when a unit test failed
Signed-off-by: Jeff Olivier <[email protected]>
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/4/testReport/ |
Signed-off-by: Jeff Olivier <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15747/5/execution/node/1199/log |
@@ -207,6 +208,7 @@ class TelemetryUtils(): | |||
_gen_stats_metrics("engine_io_dtx_committable") | |||
ENGINE_IO_DTX_COMMITTED_METRICS = \ | |||
_gen_stats_metrics("engine_io_dtx_committed") | |||
ENGINE_IO_DTX_RECORD_MISSING_METRICS=["engine_io_dtx_record_missing"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E225 missing whitespace around operator
ENGINE_IO_DTX_RECORD_MISSING_METRICS=["engine_io_dtx_record_missing"] | |
ENGINE_IO_DTX_RECORD_MISSING_METRICS = ["engine_io_dtx_record_missing"] |
@@ -310,6 +312,7 @@ class TelemetryUtils(): | |||
ENGINE_IO_METRICS = ENGINE_IO_DTX_ASYNC_CMT_LAT_METRICS +\ | |||
ENGINE_IO_DTX_COMMITTABLE_METRICS +\ | |||
ENGINE_IO_DTX_COMMITTED_METRICS +\ | |||
ENGINE_IO_DTX_RECORD_MISSING_METRIC +\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undefined-variable, Undefined variable 'ENGINE_IO_DTX_RECORD_MISSING_METRIC'
Missing a list definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed the S at the end
Signed-off-by: Jeff Olivier <[email protected]>
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15747/6/execution/node/1477/log |
src/vos/vos_dtx.c
Outdated
@@ -1,5 +1,6 @@ | |||
/** | |||
/**RY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clean it up.
DP_DTI(&DAE_XID(dae)), dtx_umoff_flag2type(rec)); | ||
d_tm_inc_counter(vos_tls_get(false)->vtl_dtx_rec_missing, 1); | ||
rc = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem since we can't be 100% sure the root cause of the missing records we can't actually tell the user everything is ok. TBH "some corruption" sounds really bad and this, in my understanding, is the state of the system since this discrepancies are being discovered. It seems not fit to continue normal operation.
IMHO we should not automatically decide to continue despite the DTX transaction is obviously incomplete.
src/vos/vos_dtx.c
Outdated
D_WARN("record no longer exists, may indicate some corruption: " DF_DTI | ||
" type %u\n", | ||
DP_DTI(&DAE_XID(dae)), dtx_umoff_flag2type(rec)); | ||
d_tm_inc_counter(vos_tls_get(false)->vtl_dtx_rec_missing, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything the user can do with this counter? How many potential corruptions the user can accept before felt not comfortable storing data in the system?
It seems pointless for the end user. Did you envision it as a feature for developers to track the issue when the system is online?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will probably remove the metric since we decided abort is better but, in general, metrics are for admins. We use them all the time to try to diagnose issues with running systems.
src/vos/ilog.c
Outdated
@@ -1213,7 +1221,7 @@ ilog_fetch(struct umem_instance *umm, struct ilog_df *root_df, | |||
int rc = 0; | |||
bool retry; | |||
|
|||
ILOG_ASSERT_VALID(root_df); | |||
ILOG_CHECK_VALID(root_df); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just a warning in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think we have to assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will assert in the caller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, not sure why I did that.
@@ -571,7 +572,7 @@ dtx_ilog_rec_release(struct umem_instance *umm, struct vos_container *cont, | |||
|
|||
ilog_close(loh); | |||
|
|||
if (rc != 0) | |||
if (rc != 0 && rc != -DER_NONEXIST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the DTX transaction still can't be committed nor aborted in this case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is good point. Considering the scenarios of generating the dangling DTX reference: the modification was made against some akey, but before its committed/aborted, related dkey was removed; or the modification was made against some dkey, but related object was removed. So under these scenarios, the subsequent modifications under related {d,a}key became meaningless, but we may cannot detect all dangling DTX records if there is no magic information (such as for SV) or "lid" maybe reused. So once we detect and discard the dangling DTX record for ilog, then all the subsequent modification(s) after such ilog DTX record should be discarded (until another valid ilog DTX record was found for distributed transaction case in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it too late to check DER_NONEXIST here? It should have returned error on ilog_open() above, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we want to avoid error messages on DER_NONEXIST errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it too late to check DER_NONEXIST here? It should have returned error on ilog_open() above, right?
The caller - do_dtx_rec_release()
will print the log message.
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/12/testReport/ |
Signed-off-by: Jeff Olivier <[email protected]>
Allow-unstable-test: true Signed-off-by: Jeff Olivier <[email protected]>
When commit or abort a DTX, we will check whether it is a valid entry or not. For invalid case, we will discard it with warning message and increase related metrics counter. It may be not perfect solution, but it is efficient to help the user to cleanup system efficiently. Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Fan Yong <[email protected]>
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/18/testReport/ |
Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Fan Yong <[email protected]>
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect |
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect |
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/21/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/21/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/21/display/redirect |
Ping reviewers, Thanks! |
Ping @NiuYawei , thanks! |
@@ -363,7 +366,8 @@ ilog_open(struct umem_instance *umm, struct ilog_df *root, | |||
struct ilog_context *lctx; | |||
int rc; | |||
|
|||
ILOG_ASSERT_VALID(root); | |||
if (!ILOG_CHECK_VALID(root)) | |||
return -DER_NONEXIST; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add an error message.
(opc == ILOG_OP_PERSIST || opc == ILOG_OP_ABORT)) { | ||
if (rc == 0 && opc != ILOG_OP_UPDATE) { | ||
if (version == ilog_mag2ver(lctx->ic_root->lr_magic)) { | ||
D_WARN("ilog entry on %s doesn't exist\n", opc_str[opc]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D_ERROR() instead of D_WARN().
@@ -571,7 +572,7 @@ dtx_ilog_rec_release(struct umem_instance *umm, struct vos_container *cont, | |||
|
|||
ilog_close(loh); | |||
|
|||
if (rc != 0) | |||
if (rc != 0 && rc != -DER_NONEXIST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we want to avoid error messages on DER_NONEXIST errors.
If the records referenced by a DTX have been removed from the tree, the data will no longer match.
We can't detect this in all cases but we can avoid overwriting when we do detect it.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: