DAOS-17001 rebuild: when self_heal is set to delay_rebuild, do not re… #15809

cdavis28 · 2025-01-29T06:14:45Z

…build on exclude

delay_rebuild mode should delay the rebuild in all scenarios and not have an exception for target exclusion. Also changed an error message to warn on shard update failure. Shard update failure is normal during a failure, and the message was too frequent.

Testing:
dmg pool exclude default-pool --rank 0 --target-idx 4 while write/read workflow was running against a cluster

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2025-01-29T06:15:03Z

Ticket title is 'Target exclusion with delay_rebuild set'
Status is 'In Progress'
Labels: 'google-cloud-daos'
https://daosio.atlassian.net/browse/DAOS-17001

…build on exclude delay_rebuild mode should delay the rebuild in all scenarios and not have an exception for target exclusion. Also changed an error message to warn on shard update failure. Shard update failure is normal during a failure, and the message was too frequent. Testing: `dmg pool exclude default-pool --rank 0 --target-idx 4` while write/read workflow was running against a cluster Signed-off-by: Chris Davis <[email protected]>

src/pool/srv_pool.c

Features: rebuild Signed-off-by: Chris Davis <[email protected]>

jolivier23 · 2025-01-29T23:13:17Z

src/pool/srv_pool.c

@@ -7360,7 +7361,7 @@ pool_svc_update_map(struct pool_svc *svc, crt_opcode_t opc, bool exclude_rank,
 		D_GOTO(out, rc);
 	}

-	if ((entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD) && exclude_rank)
+	if (entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD)


Only question I have....do we need any extra check here? For stuff like reintegrate or drain or extend, we actually do want rebuild. But I will defer to Di who is infinitely more familiar with the logic here

Yeah, I am also curious about this, too. It seems another check is needed.
So the goal of the patch is to make sure to have a delayed rebuild for any type of exclusion whether initiated by SWIM detection of a lost engine (already handled), or administratively excluded targets (e.g., via dmg pool exclude command).

Maybe logic such as the following would do what is needed here, to not impact reintegrate, drain, and extend?

if ((entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD) && (opc == MAP_EXCLUDE))

Now that I look at this, it seems maybe there is a bug in the DAOS_REINT_MODE_NO_DATA_SYNC branch above when it tests opc to see if it is POOL_EXCLUDE or POOL_DRAIN. I think instead the test there should be for MAP_EXCLUDE or MAP_DRAIN, since the caller converts the RPC opcodes (POOL_) into pool map opcodes (MAP_) that have different values (!).

daosbuild1 · 2025-01-30T13:47:55Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15809/3/execution/node/1467/log

daosbuild1 · 2025-01-30T13:59:56Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15809/3/execution/node/1457/log

Features: rebuild Signed-off-by: Chris Davis <[email protected]>

cdavis28 force-pushed the chrd/delay_rebuild branch from f5fb8d7 to cbddbb5 Compare January 29, 2025 06:18

cdavis28 marked this pull request as ready for review January 29, 2025 06:30

cdavis28 requested review from a team as code owners January 29, 2025 06:30

cdavis28 requested review from jolivier23 and wangdi1 January 29, 2025 06:30

wangdi1 reviewed Jan 29, 2025

View reviewed changes

src/pool/srv_pool.c Outdated Show resolved Hide resolved

apply review comments

401d508

Features: rebuild Signed-off-by: Chris Davis <[email protected]>

jolivier23 previously approved these changes Jan 29, 2025

View reviewed changes

jolivier23 reviewed Jan 29, 2025

View reviewed changes

wangdi1 previously approved these changes Jan 30, 2025

View reviewed changes

Check for MAP_EXCLUDE/MAP_DRAIN instead of POOL_EXCLUDE

adb8d21

Features: rebuild Signed-off-by: Chris Davis <[email protected]>

cdavis28 dismissed stale reviews from wangdi1 and jolivier23 via adb8d21 January 30, 2025 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-17001 rebuild: when self_heal is set to delay_rebuild, do not re… #15809

DAOS-17001 rebuild: when self_heal is set to delay_rebuild, do not re… #15809

cdavis28 commented Jan 29, 2025

github-actions bot commented Jan 29, 2025

jolivier23 Jan 29, 2025

kccain Jan 30, 2025

daosbuild1 commented Jan 30, 2025

daosbuild1 commented Jan 30, 2025

DAOS-17001 rebuild: when self_heal is set to delay_rebuild, do not re… #15809

Are you sure you want to change the base?

DAOS-17001 rebuild: when self_heal is set to delay_rebuild, do not re… #15809

Conversation

cdavis28 commented Jan 29, 2025

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Jan 29, 2025

jolivier23 Jan 29, 2025

Choose a reason for hiding this comment

kccain Jan 30, 2025

Choose a reason for hiding this comment

daosbuild1 commented Jan 30, 2025

daosbuild1 commented Jan 30, 2025