Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14969 container: retry IV might cause deadlock #13632

Merged
merged 3 commits into from
Jan 21, 2024
Merged

Conversation

wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented Jan 18, 2024

OID IV entry lock might be required again for retry case.

Test-repeat: 10
Test-tag: test_daos_oid_allocator test_daos_management

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

OID IV entry lock might be required again for retry
case.

Test-repeat: 10
Test-tag: test_daos_oid_allocator test_daos_management

Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
Copy link

github-actions bot commented Jan 18, 2024

Bug-tracker data:
Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_oid_allocator - test timeout'
Status is 'In Review'
Labels: 'ci_impact,closed-master,pr_test'
https://daosio.atlassian.net/browse/DAOS-14969

liuxuezhao
liuxuezhao previously approved these changes Jan 19, 2024
ABT_mutex_lock(entry->lock);
rc = ABT_mutex_trylock(entry->lock);
/* For retry requests, from _iv_op(), the lock may not be released
* in some cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"some cases" for example the oid_iv_ent_update() hold the lock and return DER_IVCB_FORWARD, but expected oid_iv_ent_refresh() is not executed to release the lock? (for example timeout and retry, or pm ver/IV_tree changed).

maybe better to don't hold lock between network operation, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but expected oid_iv_ent_refresh() is not executed to release the lock? (for example timeout and retry, or pm ver/IV_tree changed).

If the RPC to its parent is sent, then refresh will be called in any case. But if the failure happened before RPC is sent, for example get_parent() failed by -1036, then refresh will not be called before another retry. that is why it caused the deadlock here.

yeah, removing this lock(serialization) might need some complicate change. Yes, you can replace the lock with the flag etc, but no difference if we change this to trylock, since it is not recursive lock anyway.

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13632/2/display/redirect

Fix debug information

Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
@@ -1070,7 +1070,7 @@ _iv_op(struct ds_iv_ns *ns, struct ds_iv_key *key, d_sg_list_t *value,
* but in-flight fetch request return IVCB_FORWARD, then queued RPC will
* reply IVCB_FORWARD.
*/
D_WARN("ns %u retry for class %d opc %d rank %u/%u: " DF_RC "\n", ns->iv_ns_id,
D_INFO("ns %u retry for class %d opc %d rank %u/%u: " DF_RC "\n", ns->iv_ns_id,
Copy link
Contributor

@mchaarawi mchaarawi Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info does not sounds appropriate for this message?
from object layer i can see most of those are debug messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, mostly this will help us figure out what happened, since debug is not always enabled when sth happened.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand that, but this will be in the production server log. so far retry messages that i could are debug, so maybe just the test need to be updated to enable debug logs for this component?
anyway, not requesting a change for now and just asking the reasoning for this, which sounds only for testing purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this change is only for satisfying the CI failure injection tests, if I do not change it INFO, then PR will generate extra WARNING during the test, which will cause test failure.

On the hand, this seems making more sense as an INFO, instead of warning. since it is about the some running status in some corner cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well actually i got confused sorry. this was originally warn, so your change is actually making this better. still think it should be debug, but can be done later.

@mchaarawi mchaarawi merged commit 1946ef3 into master Jan 21, 2024
33 of 34 checks passed
@mchaarawi mchaarawi deleted the wangdi/daos_14969 branch January 21, 2024 16:21
daltonbohning added a commit that referenced this pull request Apr 24, 2024
This reverts commit 1946ef3.

Conflicts:
  src/engine/server_iv.c

Required-githooks: true
daltonbohning added a commit that referenced this pull request Apr 24, 2024
This reverts commit 1946ef3.

Conflicts:
  src/engine/server_iv.c

Required-githooks: true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants