DAOS-16766 container: rebuid and container destruction race #15971

wangshilong · 2025-02-25T08:52:05Z

daos_lru_ref_evict_wait() may yield, potentially creating a race condition with rebuild operations. During rebuild migration, the container could be reopened and restarted, which could result in EBUSY errors from subsequent vos_cont_destroy() calls.

To resolve this issue:

We avoid container eviction during waiting periods
Container lookup failures are guaranteed by checking the @sc_destroying flag before proceeding This design ensures consistency by preventing concurrent access to containers marked for destruction.

Test-tag: test_ec_single_target_rank_failure pr

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

daos_lru_ref_evict_wait() may yield, potentially creating a race condition with rebuild operations. During rebuild migration, the container could be reopened and restarted, which could result in EBUSY errors from subsequent vos_cont_destroy() calls. To resolve this issue: 1. We avoid container eviction during waiting periods 2. Container lookup failures are guaranteed by checking the @sc_destroying flag before proceeding This design ensures consistency by preventing concurrent access to containers marked for destruction. Test-tag: test_ec_single_target_rank_failure pr Signed-off-by: Wang Shilong <[email protected]>

github-actions · 2025-02-25T08:52:23Z

Ticket title is 'erasurecode/multiple_rank_failure.py:EcodOnlineMultiRankFail.test_ec_multiple_rank_failure - daos container destroy DER_TIMEDOUT'
Status is 'In Review'
Labels: '2.6.3rc2,2.6.3rc3,2.7.101tb,md_on_ssd,weekly_test'
https://daosio.atlassian.net/browse/DAOS-16766

Nasf-Fan · 2025-02-27T09:37:45Z

src/container/srv_target.c

+	 * to containers marked for destruction.
+	 */
+	daos_lru_ref_noevict_wait(tls->dt_cont_cache, &cont->sc_list);
+	daos_lru_ref_evict(tls->dt_cont_cache, &cont->sc_list);


Do we need to check "if (!llink->ll_evicted)" before calling daos_lru_ref_evict() since we may yield for wait?

In theory this is possible，but currently there is no other caller which might evict container, and calling daos_lru_ref_evict() again here is not harmful now. will add extra check if PR need be refreshed for any reason.

Add the check inside daos_lru_ref_evict()

…S-16766

Test-tag: test_ec_single_target_rank_failure pr Signed-off-by: Wang Shilong <[email protected]>

daltonbohning · 2025-02-27T19:32:19Z

Removing gatekeeper until there are 2 approvals

wangshilong marked this pull request as ready for review February 26, 2025 01:20

wangshilong requested review from a team as code owners February 26, 2025 01:20

wangshilong requested review from NiuYawei, gnailzenh and Nasf-Fan February 26, 2025 01:20

jolivier23 previously approved these changes Feb 26, 2025

View reviewed changes

NiuYawei approved these changes Feb 27, 2025

View reviewed changes

NiuYawei previously approved these changes Feb 27, 2025

View reviewed changes

wangshilong requested a review from a team February 27, 2025 02:34

Nasf-Fan reviewed Feb 27, 2025

View reviewed changes

Nasf-Fan self-requested a review February 27, 2025 09:38

wangshilong added 2 commits February 27, 2025 10:26

Merge branch 'master' of github.com:daos-stack/daos into shilongw/DAO…

375b80c

…S-16766

address comments

15713dc

Test-tag: test_ec_single_target_rank_failure pr Signed-off-by: Wang Shilong <[email protected]>

wangshilong dismissed stale reviews from NiuYawei and jolivier23 via 15713dc February 27, 2025 15:38

wangshilong requested review from NiuYawei and jolivier23 February 27, 2025 15:38

Nasf-Fan approved these changes Feb 27, 2025

View reviewed changes

daltonbohning removed the request for review from a team February 27, 2025 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16766 container: rebuid and container destruction race #15971

DAOS-16766 container: rebuid and container destruction race #15971

wangshilong commented Feb 25, 2025

github-actions bot commented Feb 25, 2025 •

edited

Loading

Nasf-Fan Feb 27, 2025

wangshilong Feb 27, 2025

wangshilong Feb 27, 2025

daltonbohning commented Feb 27, 2025

DAOS-16766 container: rebuid and container destruction race #15971

Are you sure you want to change the base?

DAOS-16766 container: rebuid and container destruction race #15971

Conversation

wangshilong commented Feb 25, 2025

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Feb 25, 2025 • edited Loading

Nasf-Fan Feb 27, 2025

Choose a reason for hiding this comment

wangshilong Feb 27, 2025

Choose a reason for hiding this comment

wangshilong Feb 27, 2025

Choose a reason for hiding this comment

daltonbohning commented Feb 27, 2025

github-actions bot commented Feb 25, 2025 •

edited

Loading