-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16766 container: rebuid and container destruction race #15971
base: master
Are you sure you want to change the base?
Conversation
daos_lru_ref_evict_wait() may yield, potentially creating a race condition with rebuild operations. During rebuild migration, the container could be reopened and restarted, which could result in EBUSY errors from subsequent vos_cont_destroy() calls. To resolve this issue: 1. We avoid container eviction during waiting periods 2. Container lookup failures are guaranteed by checking the @sc_destroying flag before proceeding This design ensures consistency by preventing concurrent access to containers marked for destruction. Test-tag: test_ec_single_target_rank_failure pr Signed-off-by: Wang Shilong <[email protected]>
Ticket title is 'erasurecode/multiple_rank_failure.py:EcodOnlineMultiRankFail.test_ec_multiple_rank_failure - daos container destroy DER_TIMEDOUT' |
* to containers marked for destruction. | ||
*/ | ||
daos_lru_ref_noevict_wait(tls->dt_cont_cache, &cont->sc_list); | ||
daos_lru_ref_evict(tls->dt_cont_cache, &cont->sc_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check "if (!llink->ll_evicted)" before calling daos_lru_ref_evict()
since we may yield for wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this is possible,but currently there is no other caller which might evict container, and calling daos_lru_ref_evict() again here is not harmful now. will add extra check if PR need be refreshed for any reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the check inside daos_lru_ref_evict()
Test-tag: test_ec_single_target_rank_failure pr Signed-off-by: Wang Shilong <[email protected]>
15713dc
Removing gatekeeper until there are 2 approvals |
daos_lru_ref_evict_wait() may yield, potentially creating a race condition with rebuild operations. During rebuild migration, the container could be reopened and restarted, which could result in EBUSY errors from subsequent vos_cont_destroy() calls.
To resolve this issue:
Test-tag: test_ec_single_target_rank_failure pr
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: