Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16876 vos: Improve error reporting on corruption crash #15747

Merged
merged 21 commits into from
Feb 19, 2025

Conversation

jolivier23
Copy link
Contributor

@jolivier23 jolivier23 commented Jan 20, 2025

If the records referenced by a DTX have been removed from the tree, the data will no longer match.
We can't detect this in all cases but we can avoid overwriting when we do detect it.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

If the records referenced by a DTX have been removed from the
tree, the data will no longer match. We can't detect this in
all cases but we can avoid overwriting when we do detect it.

Signed-off-by: Jeff Olivier <[email protected]>
@jolivier23 jolivier23 requested review from a team as code owners January 20, 2025 04:26
Copy link

github-actions bot commented Jan 20, 2025

Ticket title is 'LRZ: m02r01s07dao engine coredumps with vos EMRG src/vos/ilog.c:411 ilog_open() Assertion'
Status is 'Awaiting Verification'
Labels: 'ALCF,LRZ,alcf_track,google-cloud-daos'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-16876

@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Jan 20, 2025
Signed-off-by: Jeff Olivier <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/3/testReport/

DP_DTI(&DAE_XID(dae)), dtx_umoff_flag2type(rec));
d_tm_inc_counter(vos_tls_get(false)->vtl_dtx_rec_missing, 1);
rc = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user may miss the warning message. As we discussed, there is risk for silent data corruption. So when we hit recognisable corruption, I would suggest to trigger assertion that will make the user to be aware of potential data corruption in stead of make the system to go ahead without explicitly blocking the system.

On the other hand, in your current patch, if we do not access some corrupted DTX entry, the corruption will be always there, longer time then more dangerous since related area maybe reused by others as time going.

We can move your patch logic into some new DDB functionality to repair the dangling DTX records via removing them. Since it is inconvenient to directly access "epoch" via DTX record reference, then we can do that as your patch does with checking lid+magic (if have). That is some kind of "quick mode" or "quick scan".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case, we would monitor the metric and alert someone that it happened. I still don't think quick scan buys us anything. A full scan would be able to find all of the missing entries but a quick scan can only find entries like the ones in this patch.

I think maybe having this behavior be configurable would be better. Until we have full scan, this patch is really all we can really do

Copy link
Contributor

@Nasf-Fan Nasf-Fan Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main concern for this patch is that we passively handle the corrupted DTX entries when we access them, that may leave more windows as to more risks of data corruption. @janekmi is working on the "quick scan" via ddb as offline mode. Maybe we need both (part).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem since we can't be 100% sure the root cause of the missing records we can't actually tell the user everything is ok. TBH "some corruption" sounds really bad and this, in my understanding, is the state of the system since this discrepancies are being discovered. It seems not fit to continue normal operation.

IMHO we should not automatically decide to continue despite the DTX transaction is obviously incomplete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I tend to agree with @janekmi and @Nasf-Fan .

  • If we don't know exactly the root case, It's not safe to keep engine running, otherwise, more severe data corruption could be introduced;
  • If we know the root case for sure, we should fix the bug;

If the original assert can only be triggered by bug (it was a proper assert), it looks to me there is no way of getting round of it (I don't think software is able to gracefully handle it's own bug).

@@ -321,6 +322,7 @@ vos_oi_find_alloc(struct vos_container *cont, daos_unit_oid_t oid,
if (log) {
vos_ilog_desc_cbs_init(&cbs, vos_cont2hdl(cont));
rc = ilog_open(vos_cont2umm(cont), &obj->vo_ilog, &cbs, dth == NULL, &loh);
D_ASSERTF(rc == -DER_NONEXIST, "Uncorrectable incarnation log corruption detected");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rc != -DER_NONEXIST

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I fixed the other one when a unit test failed

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/4/testReport/

Signed-off-by: Jeff Olivier <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15747/5/execution/node/1199/log

@@ -207,6 +208,7 @@ class TelemetryUtils():
_gen_stats_metrics("engine_io_dtx_committable")
ENGINE_IO_DTX_COMMITTED_METRICS = \
_gen_stats_metrics("engine_io_dtx_committed")
ENGINE_IO_DTX_RECORD_MISSING_METRICS=["engine_io_dtx_record_missing"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E225 missing whitespace around operator

Suggested change
ENGINE_IO_DTX_RECORD_MISSING_METRICS=["engine_io_dtx_record_missing"]
ENGINE_IO_DTX_RECORD_MISSING_METRICS = ["engine_io_dtx_record_missing"]

@@ -310,6 +312,7 @@ class TelemetryUtils():
ENGINE_IO_METRICS = ENGINE_IO_DTX_ASYNC_CMT_LAT_METRICS +\
ENGINE_IO_DTX_COMMITTABLE_METRICS +\
ENGINE_IO_DTX_COMMITTED_METRICS +\
ENGINE_IO_DTX_RECORD_MISSING_METRIC +\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undefined-variable, Undefined variable 'ENGINE_IO_DTX_RECORD_MISSING_METRIC'

Missing a list definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed the S at the end

Signed-off-by: Jeff Olivier <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15747/6/execution/node/1477/log

@@ -1,5 +1,6 @@
/**
/**RY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean it up.

DP_DTI(&DAE_XID(dae)), dtx_umoff_flag2type(rec));
d_tm_inc_counter(vos_tls_get(false)->vtl_dtx_rec_missing, 1);
rc = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem since we can't be 100% sure the root cause of the missing records we can't actually tell the user everything is ok. TBH "some corruption" sounds really bad and this, in my understanding, is the state of the system since this discrepancies are being discovered. It seems not fit to continue normal operation.

IMHO we should not automatically decide to continue despite the DTX transaction is obviously incomplete.

D_WARN("record no longer exists, may indicate some corruption: " DF_DTI
" type %u\n",
DP_DTI(&DAE_XID(dae)), dtx_umoff_flag2type(rec));
d_tm_inc_counter(vos_tls_get(false)->vtl_dtx_rec_missing, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything the user can do with this counter? How many potential corruptions the user can accept before felt not comfortable storing data in the system?

It seems pointless for the end user. Did you envision it as a feature for developers to track the issue when the system is online?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will probably remove the metric since we decided abort is better but, in general, metrics are for admins. We use them all the time to try to diagnose issues with running systems.

src/vos/ilog.c Outdated
@@ -1213,7 +1221,7 @@ ilog_fetch(struct umem_instance *umm, struct ilog_df *root_df,
int rc = 0;
bool retry;

ILOG_ASSERT_VALID(root_df);
ILOG_CHECK_VALID(root_df);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just a warning in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we have to assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will assert in the caller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, not sure why I did that.

@@ -571,7 +572,7 @@ dtx_ilog_rec_release(struct umem_instance *umm, struct vos_container *cont,

ilog_close(loh);

if (rc != 0)
if (rc != 0 && rc != -DER_NONEXIST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the DTX transaction still can't be committed nor aborted in this case, right?

Copy link
Contributor

@Nasf-Fan Nasf-Fan Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good point. Considering the scenarios of generating the dangling DTX reference: the modification was made against some akey, but before its committed/aborted, related dkey was removed; or the modification was made against some dkey, but related object was removed. So under these scenarios, the subsequent modifications under related {d,a}key became meaningless, but we may cannot detect all dangling DTX records if there is no magic information (such as for SV) or "lid" maybe reused. So once we detect and discard the dangling DTX record for ilog, then all the subsequent modification(s) after such ilog DTX record should be discarded (until another valid ilog DTX record was found for distributed transaction case in the future).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it too late to check DER_NONEXIST here? It should have returned error on ilog_open() above, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we want to avoid error messages on DER_NONEXIST errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it too late to check DER_NONEXIST here? It should have returned error on ilog_open() above, right?

The caller - do_dtx_rec_release() will print the log message.

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/12/testReport/

Signed-off-by: Jeff Olivier <[email protected]>
Allow-unstable-test: true

Signed-off-by: Jeff Olivier <[email protected]>
@jolivier23 jolivier23 requested a review from NiuYawei February 12, 2025 15:54
@jolivier23
Copy link
Contributor Author

@Nasf-Fan @janekmi can you have another look?

When commit or abort a DTX, we will check whether it is a valid
entry or not. For invalid case, we will discard it with warning
message and increase related metrics counter.

It may be not perfect solution, but it is efficient to help the
user to cleanup system efficiently.

Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Fan Yong <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15747/18/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/20/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/21/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/21/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15747/21/display/redirect

@Nasf-Fan
Copy link
Contributor

Ping reviewers, Thanks!

@Nasf-Fan
Copy link
Contributor

Ping @NiuYawei , thanks!

@@ -363,7 +366,8 @@ ilog_open(struct umem_instance *umm, struct ilog_df *root,
struct ilog_context *lctx;
int rc;

ILOG_ASSERT_VALID(root);
if (!ILOG_CHECK_VALID(root))
return -DER_NONEXIST;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add an error message.

(opc == ILOG_OP_PERSIST || opc == ILOG_OP_ABORT)) {
if (rc == 0 && opc != ILOG_OP_UPDATE) {
if (version == ilog_mag2ver(lctx->ic_root->lr_magic)) {
D_WARN("ilog entry on %s doesn't exist\n", opc_str[opc]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D_ERROR() instead of D_WARN().

@@ -571,7 +572,7 @@ dtx_ilog_rec_release(struct umem_instance *umm, struct vos_container *cont,

ilog_close(loh);

if (rc != 0)
if (rc != 0 && rc != -DER_NONEXIST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we want to avoid error messages on DER_NONEXIST errors.

@Nasf-Fan Nasf-Fan requested review from a team and Nasf-Fan and removed request for Nasf-Fan February 17, 2025 02:21
@jolivier23 jolivier23 merged commit e98c5ac into master Feb 19, 2025
56 of 57 checks passed
@jolivier23 jolivier23 deleted the jvolivie/corrupt branch February 19, 2025 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority Ticket has high priority (automatically managed)
Development

Successfully merging this pull request may close these issues.

7 participants