-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14969 container: retry IV might cause deadlock #13632
Conversation
OID IV entry lock might be required again for retry case. Test-repeat: 10 Test-tag: test_daos_oid_allocator test_daos_management Required-githooks: true Signed-off-by: Di Wang <[email protected]>
Bug-tracker data: |
ABT_mutex_lock(entry->lock); | ||
rc = ABT_mutex_trylock(entry->lock); | ||
/* For retry requests, from _iv_op(), the lock may not be released | ||
* in some cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"some cases" for example the oid_iv_ent_update() hold the lock and return DER_IVCB_FORWARD, but expected oid_iv_ent_refresh() is not executed to release the lock? (for example timeout and retry, or pm ver/IV_tree changed).
maybe better to don't hold lock between network operation, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but expected oid_iv_ent_refresh() is not executed to release the lock? (for example timeout and retry, or pm ver/IV_tree changed).
If the RPC to its parent is sent, then refresh will be called in any case. But if the failure happened before RPC is sent, for example get_parent() failed by -1036, then refresh will not be called before another retry. that is why it caused the deadlock here.
yeah, removing this lock(serialization) might need some complicate change. Yes, you can replace the lock with the flag etc, but no difference if we change this to trylock, since it is not recursive lock anyway.
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13632/2/display/redirect |
Fix debug information Required-githooks: true Signed-off-by: Di Wang <[email protected]>
Required-githooks: true
2313202
to
3c33102
Compare
@@ -1070,7 +1070,7 @@ _iv_op(struct ds_iv_ns *ns, struct ds_iv_key *key, d_sg_list_t *value, | |||
* but in-flight fetch request return IVCB_FORWARD, then queued RPC will | |||
* reply IVCB_FORWARD. | |||
*/ | |||
D_WARN("ns %u retry for class %d opc %d rank %u/%u: " DF_RC "\n", ns->iv_ns_id, | |||
D_INFO("ns %u retry for class %d opc %d rank %u/%u: " DF_RC "\n", ns->iv_ns_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info does not sounds appropriate for this message?
from object layer i can see most of those are debug messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, mostly this will help us figure out what happened, since debug is not always enabled when sth happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand that, but this will be in the production server log. so far retry messages that i could are debug, so maybe just the test need to be updated to enable debug logs for this component?
anyway, not requesting a change for now and just asking the reasoning for this, which sounds only for testing purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this change is only for satisfying the CI failure injection tests, if I do not change it INFO, then PR will generate extra WARNING during the test, which will cause test failure.
On the hand, this seems making more sense as an INFO, instead of warning. since it is about the some running status in some corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well actually i got confused sorry. this was originally warn, so your change is actually making this better. still think it should be debug, but can be done later.
This reverts commit 1946ef3. Conflicts: src/engine/server_iv.c Required-githooks: true
This reverts commit 1946ef3. Conflicts: src/engine/server_iv.c Required-githooks: true
OID IV entry lock might be required again for retry case.
Test-repeat: 10
Test-tag: test_daos_oid_allocator test_daos_management
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: