-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-17001 rebuild: when self_heal is set to delay_rebuild, do not re… #15809
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'Target exclusion with delay_rebuild set' |
…build on exclude delay_rebuild mode should delay the rebuild in all scenarios and not have an exception for target exclusion. Also changed an error message to warn on shard update failure. Shard update failure is normal during a failure, and the message was too frequent. Testing: `dmg pool exclude default-pool --rank 0 --target-idx 4` while write/read workflow was running against a cluster Signed-off-by: Chris Davis <[email protected]>
f5fb8d7
to
cbddbb5
Compare
Features: rebuild Signed-off-by: Chris Davis <[email protected]>
src/pool/srv_pool.c
Outdated
@@ -7360,7 +7361,7 @@ pool_svc_update_map(struct pool_svc *svc, crt_opcode_t opc, bool exclude_rank, | |||
D_GOTO(out, rc); | |||
} | |||
|
|||
if ((entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD) && exclude_rank) | |||
if (entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only question I have....do we need any extra check here? For stuff like reintegrate or drain or extend, we actually do want rebuild. But I will defer to Di who is infinitely more familiar with the logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am also curious about this, too. It seems another check is needed.
So the goal of the patch is to make sure to have a delayed rebuild for any type of exclusion whether initiated by SWIM detection of a lost engine (already handled), or administratively excluded targets (e.g., via dmg pool exclude command).
Maybe logic such as the following would do what is needed here, to not impact reintegrate, drain, and extend?
if ((entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD) && (opc == MAP_EXCLUDE))
Now that I look at this, it seems maybe there is a bug in the DAOS_REINT_MODE_NO_DATA_SYNC branch above when it tests opc to see if it is POOL_EXCLUDE or POOL_DRAIN. I think instead the test there should be for MAP_EXCLUDE or MAP_DRAIN, since the caller converts the RPC opcodes (POOL_) into pool map opcodes (MAP_) that have different values (!).
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15809/3/execution/node/1467/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15809/3/execution/node/1457/log |
Features: rebuild Signed-off-by: Chris Davis <[email protected]>
…build on exclude
delay_rebuild mode should delay the rebuild in all scenarios and not have an exception for target exclusion. Also changed an error message to warn on shard update failure. Shard update failure is normal during a failure, and the message was too frequent.
Testing:
dmg pool exclude default-pool --rank 0 --target-idx 4
while write/read workflow was running against a clusterBefore requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: