-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16111 rebuild: uniform identifier in logs #14726
Draft
kccain
wants to merge
3
commits into
release/2.6
Choose a base branch
from
kccain/daos_16031_debug_rel2p6
base: release/2.6
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
To the extent possible in the rebuild code execution flow, when rebuild emits log messages, include a uniform rebuild operation identifier in those messages. This covers activities across all pool storage engines (including the pool service leader), system and per-target threads/xstreams, and dynamically spawned user-level threads. The motivation is to enable some amount of automated searching through logfiles for all (or specific) rebuilds that occurred during execution, and speed up DAOS engineer analysis/interpretation of the logs. The baseline format (defined in the DF_RB macro) is: "rb=" DF_UUID "/%u/%u/%s" and corresponds to: <pool_uuid>/<rebuild_ver>/<rebuild_gen>/ A verbose format (defined in the DF_RBF macro) adds the following (for <leader_rank>/) " ld=%u/" DF_U64 Various DP_RB_* and DP_RBF_* macros are defined to specify the arguments to go with the DF_RB and DF_RBF formats, given some common rebuild implementation structures such as: struct rebuild_global_pool_tracker struct rebuild_tgt_pool_tracker struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input) struct migrate_query_arg This initial patch covers the pool service leader execution in functions (and those that they invoke) such as: rebuild_ults() rebuild_task_ult() rebuild_leader_start() rebuild_leader_status_check() rebuild_leader_status_notify(). And this patch covers "scan side" execution in all pool storage engines (including the leader), in functions such as: rebuild_tgt_scan_handler() rebuild_tgt_status_check_ult() ds_migrate_query_status() migrate_check_one() dss_rebuild_check_one() rebuild_scan_leader() rebuild_scanner() rebuild_objects_send_ult() rebuild_scan_done() Signed-off-by: Kenneth Cain <[email protected]>
When rebuild emits log messages, include a uniform rebuild operation identifier. This change adjusts existing logging for rebuild migrate activities. A previous patch added the same operation identifier in log messages by the PS leader and storage engine scan activities. Reminder, the baseline format (defined in DF_RB) for the uniform identifier is: "rb=" DF_UUID "/%u/%u/%s" that corresponds to: <pool_uuid>/<rebuild_ver>/<rebuild_gen>/<opcode_string> This change adds DP_RB_OMI, DP_RB_MPT, and DP_RB_MRO macros that accept struct obj_migrate_in *omi, struct migrate_pool_tls *mpt, and struct migrate_one *mro, respectively, to provide the values needed for the DF_RB logging format. And the patch applies them throughout the existing logging performed in migrate activities. Also in this change is some logic added to rebuild_leader_status_check(): a new function, warn_for_slow_engine_updates(). This allows a PS leader engine to emit warnings when an engine is not reporting its rebuild progress (via IV) for a long amount of time, making it easier for an engineer to identify what engine(s) may be causing a stuck rebuild. The warning messages are throttled to avoid too many log file entries. Skip-unit-tests: true Skip-fault-injection-test: true Skip-test: true Signed-off-by: Kenneth Cain <[email protected]>
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14726/1/execution/node/359/log |
Ticket title is 'rebuild enhancement: uniform identifier in log messages' |
cherry-pick of master commit for debug branch purposes Disable warning for deprecated support for python version so it doesn't fail the build. Skip-unit-tests: true Skip-fault-injection-test: true Skip-test: true Signed-off-by: Jeff Olivier <[email protected]> Signed-off-by: Kenneth Cain <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To the extent possible in the rebuild code execution flow, when rebuild
emits log messages, include a uniform rebuild operation identifier in
those messages. This covers activities across all pool storage engines
(including the pool service leader), system and per-target
threads/xstreams, and dynamically spawned user-level threads.
The motivation is to enable some amount of automated searching through
logfiles for all (or specific) rebuilds that occurred during execution,
and speed up DAOS engineer analysis/interpretation of the logs.
The baseline format (defined in the DF_RB macro) is:
"rb=" DF_UUID "/%u/%u/%s"
and corresponds to:
<pool_uuid>/<rebuild_ver>/<rebuild_gen>/
A verbose format (defined in the DF_RBF macro) adds the following
(for <leader_rank>/)
" ld=%u/" DF_U64
Various DP_RB_* and DP_RBF_* macros are defined to specify the
arguments to go with the DF_RB and DF_RBF formats, given some
common rebuild implementation structures such as:
struct rebuild_global_pool_tracker
struct rebuild_tgt_pool_tracker
struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input)
struct migrate_query_arg
This initial patch covers the pool service leader execution in
functions (and those that they invoke) such as:
rebuild_ults()
rebuild_task_ult()
rebuild_leader_start()
rebuild_leader_status_check()
rebuild_leader_status_notify().
And this patch covers "scan side" execution in all pool storage engines
(including the leader), in functions such as:
rebuild_tgt_scan_handler()
rebuild_tgt_status_check_ult()
ds_migrate_query_status()
migrate_check_one()
dss_rebuild_check_one()
rebuild_scan_leader()
rebuild_scanner()
rebuild_objects_send_ult()
rebuild_scan_done()
Rebuild migrate activities are also modified to use the new format.
Macros DP_RB_OMI, DP_RB_MPT, and DP_RB_MRO accept
struct obj_migrate_in *omi, struct migrate_pool_tls *mpt, and
struct migrate_one *mro, respectively, to provide the values needed for
the DF_RB logging format.
Also in this change is some logic added to rebuild_leader_status_check():
a new function, warn_for_slow_engine_updates(). This allows a PS leader
engine to emit warnings when an engine is not reporting its rebuild
progress (via IV) for a long amount of time, making it easier for an
engineer to identify what engine(s) may be causing a stuck rebuild.
The warning messages are throttled to avoid too many log file entries.
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: