-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16111 rebuild: enhance leader update_and_warn_for_slow_engines() #15778
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'rebuild enhancement: uniform identifier in log messages' |
6a8ff39
to
fce3be5
Compare
thanks for the enhancement, basically looks fine to me. a small point is in rebuild_iv_ent_update() can add an INFO or even ERR log if one engine reports failure "src_iv->riv_status != 0". |
{"REBUILD35: destroy container then reintegrate", rebuild_cont_destroy_and_reintegrate, | ||
rebuild_sub_6nodes_rf1_setup, rebuild_sub_teardown}, | ||
{"REBUILD36: single engine scan lengthy hang", rebuild_long_scan_hang, rebuild_sub_setup, | ||
rebuild_sub_teardown}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added another UT for incr reint, may need rebase later if my PR land first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what PR is it that added a test to this file? Latest master shows REBUILD35 is the last test in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#15782 I can rebase later if your PR land first.
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15778/2/execution/node/1514/log |
e23a290
to
95dbcc6
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15778/8/testReport/ |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15778/8/execution/node/1149/log |
With this change, the existing warning logic invoked by the PS leader engine during rebuild_leader_status_check() is enhanced to check for "almost done" and likely hung rebuilds. When 95% or more engines have completed their scan or pull phase, this logic will set a deadline of 2 minutes. If the relevant rebuild phase has not finished by the deadline, warning messages will be logged to indicate a potentially stuck rebuild, including a list of engines that have not completed. The determination of "almost done" is such that, depending on the scale of the system, the number of remaining engines that are being waited for is in a reasonable range of 1-20 engines. The daos_test -r rebuild tests include a new rebuild_long_scan_hang() to inject a single-engine scan hang with a > 2 minute delay, to exercise the new warning logic. Features: rebuild Signed-off-by: Kenneth Cain <[email protected]>
daos_sys_logscan.py - detect update_and_warn_for_slow_engines() WARN daos_sys_logscan.py - detect "failed" rebuild, print rc, and fail_rank Features: rebuild Signed-off-by: Kenneth Cain <[email protected]>
Features: rebuild Allow-unstable-test: true Signed-off-by: Kenneth Cain <[email protected]>
95dbcc6
to
bee9215
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15778/10/execution/node/928/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
need code review on engine logic changes. Have discussed with @liuxuezhao , and just added @wangdi1 as an alternate reviewer, if Xuezhao is not available. |
@knard38 FYI on the updates to daos_sys_logscan.py (since I know you are looking to make some updates to that tool in other areas) |
With this change, the existing warning logic invoked by the PS leader engine during rebuild_leader_status_check() is enhanced to check for "almost done" and likely hung rebuilds. When 95% or more engines have completed their scan or pull phase, this logic will set a deadline of 2 minutes. If the relevant rebuild phase has not finished by the deadline, warning messages will be logged to indicate a potentially stuck rebuild, including a list of engines that have not completed.
The determination of "almost done" is such that, depending on the scale of the system, the number of remaining engines that are being waited for is in a reasonable range of 1-20 engines.
Also the PS leader logic in rebuild_global_stauts_update() adds a warning for when a particular engine IV update indicates a nonzero RC, to give hints to the developer where to go look for additional detail for failed/stalled rebuilds.
The daos_test -r rebuild tests include a new rebuild_long_scan_hang() to inject a single-engine scan hang with a > 2 minute delay, to exercise the new warning logic.
And finally, the daos_sys_logscan.py utility is updated to look for the new warning in
update_and_warn_for_slow_engines(), and to report any failed rebuilds (along with the
corresponding nonzero rc value and the [first] failed_rank value).
Features: rebuild
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: