Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16111 rebuild: uniform identifier in logs #14726

Draft
wants to merge 3 commits into
base: release/2.6
Choose a base branch
from

Conversation

kccain
Copy link
Contributor

@kccain kccain commented Jul 8, 2024

To the extent possible in the rebuild code execution flow, when rebuild
emits log messages, include a uniform rebuild operation identifier in
those messages. This covers activities across all pool storage engines
(including the pool service leader), system and per-target
threads/xstreams, and dynamically spawned user-level threads.

The motivation is to enable some amount of automated searching through
logfiles for all (or specific) rebuilds that occurred during execution,
and speed up DAOS engineer analysis/interpretation of the logs.

The baseline format (defined in the DF_RB macro) is:
"rb=" DF_UUID "/%u/%u/%s"
and corresponds to:
<pool_uuid>/<rebuild_ver>/<rebuild_gen>/

A verbose format (defined in the DF_RBF macro) adds the following
(for <leader_rank>/)
" ld=%u/" DF_U64

Various DP_RB_* and DP_RBF_* macros are defined to specify the
arguments to go with the DF_RB and DF_RBF formats, given some
common rebuild implementation structures such as:
struct rebuild_global_pool_tracker
struct rebuild_tgt_pool_tracker
struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input)
struct migrate_query_arg

This initial patch covers the pool service leader execution in
functions (and those that they invoke) such as:
rebuild_ults()
rebuild_task_ult()
rebuild_leader_start()
rebuild_leader_status_check()
rebuild_leader_status_notify().

And this patch covers "scan side" execution in all pool storage engines
(including the leader), in functions such as:
rebuild_tgt_scan_handler()
rebuild_tgt_status_check_ult()
ds_migrate_query_status()
migrate_check_one()
dss_rebuild_check_one()
rebuild_scan_leader()
rebuild_scanner()
rebuild_objects_send_ult()
rebuild_scan_done()

Rebuild migrate activities are also modified to use the new format.
Macros DP_RB_OMI, DP_RB_MPT, and DP_RB_MRO accept
struct obj_migrate_in *omi, struct migrate_pool_tls *mpt, and
struct migrate_one *mro, respectively, to provide the values needed for
the DF_RB logging format.

Also in this change is some logic added to rebuild_leader_status_check():
a new function, warn_for_slow_engine_updates(). This allows a PS leader
engine to emit warnings when an engine is not reporting its rebuild
progress (via IV) for a long amount of time, making it easier for an
engineer to identify what engine(s) may be causing a stuck rebuild.
The warning messages are throttled to avoid too many log file entries.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

kccain added 2 commits July 8, 2024 16:21
To the extent possible in the rebuild code execution flow, when rebuild emits log messages, include a uniform rebuild operation identifier in those messages. This covers activities across all pool storage engines (including the pool service leader), system and per-target threads/xstreams, and dynamically spawned user-level threads.

The motivation is to enable some amount of automated searching through logfiles for all (or specific) rebuilds that occurred during execution, and speed up DAOS engineer analysis/interpretation of the logs.

The baseline format (defined in the DF_RB macro) is:
"rb=" DF_UUID "/%u/%u/%s"
and corresponds to:
<pool_uuid>/<rebuild_ver>/<rebuild_gen>/

A verbose format (defined in the DF_RBF macro) adds the following (for <leader_rank>/)
" ld=%u/" DF_U64

Various DP_RB_* and DP_RBF_* macros are defined to specify the arguments to go with the DF_RB and DF_RBF formats, given some common rebuild implementation structures such as:
struct rebuild_global_pool_tracker
struct rebuild_tgt_pool_tracker
struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input)
struct migrate_query_arg

This initial patch covers the pool service leader execution in functions (and those that they invoke) such as:
rebuild_ults()
rebuild_task_ult()
rebuild_leader_start()
rebuild_leader_status_check()
rebuild_leader_status_notify().

And this patch covers "scan side" execution in all pool storage engines (including the leader), in functions such as:
rebuild_tgt_scan_handler()
rebuild_tgt_status_check_ult()
ds_migrate_query_status()
migrate_check_one()
dss_rebuild_check_one()
rebuild_scan_leader()
rebuild_scanner()
rebuild_objects_send_ult()
rebuild_scan_done()

Signed-off-by: Kenneth Cain <[email protected]>
When rebuild emits log messages, include a uniform rebuild operation
identifier. This change adjusts existing logging for rebuild migrate
activities. A previous patch added the same operation identifier
in log messages by the PS leader and storage engine scan activities.

Reminder, the baseline format (defined in DF_RB) for the uniform
identifier is:
"rb=" DF_UUID "/%u/%u/%s"
that corresponds to:
<pool_uuid>/<rebuild_ver>/<rebuild_gen>/<opcode_string>

This change adds DP_RB_OMI, DP_RB_MPT, and DP_RB_MRO macros that accept
struct obj_migrate_in *omi, struct migrate_pool_tls *mpt, and
struct migrate_one *mro, respectively, to provide the values needed for
the DF_RB logging format. And the patch applies them throughout the
existing logging performed in migrate activities.

Also in this change is some logic added to rebuild_leader_status_check():
a new function, warn_for_slow_engine_updates(). This allows a PS leader
engine to emit warnings when an engine is not reporting its rebuild
progress (via IV) for a long amount of time, making it easier for an
engineer to identify what engine(s) may be causing a stuck rebuild.
The warning messages are throttled to avoid too many log file entries.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test: true

Signed-off-by: Kenneth Cain <[email protected]>
@daosbuild1
Copy link
Collaborator

Copy link

github-actions bot commented Jul 8, 2024

Ticket title is 'rebuild enhancement: uniform identifier in log messages'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-16111

cherry-pick of master commit for debug branch purposes

Disable warning for deprecated support for python
version so it doesn't fail the build.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test: true

Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Kenneth Cain <[email protected]>
@daos-stack daos-stack deleted a comment from daosbuild1 Jul 8, 2024
@daos-stack daos-stack deleted a comment from daosbuild1 Jul 8, 2024
@daos-stack daos-stack deleted a comment from daosbuild1 Jul 8, 2024
@daos-stack daos-stack deleted a comment from daosbuild1 Jul 8, 2024
@daos-stack daos-stack deleted a comment from daosbuild1 Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants