Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16312 control: Always use --force for dmg system stop #15799

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Jan 27, 2025

Whenever stopping an engine process from within the control-plane, use
SIGKILL rather than asking nicely (SIGTERM). This has been requested
to try to avoid situations that could result in dataloss.

This change preserves the behaviour where ds_mgmt_drpc_prep_shutdown()
and then ds_pool_disable_exclude() will be called during a controlled
shutdown where dmg system stop is called without other arguments.

Notable behavior changes with this PR:

  • Always performs SIGKILL on dmg system stop regardless of command
    options supplied.
  • Will attempt prepare shutdown to disable exclusions across cluster
    during “controlled” shutdown where dmg system stop is called without
    options.
  • It is now recommended to call dmg system stop without options if
    attempting to shutdown an entire cluster without triggering rebuilds.
  • Force option can be used to skip “disable exclusions” prepare
    shutdown dRPC to each rank during dmg system stop

Allow-unstable-test: true
Features: control

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Allow-unstable-test: true
Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr requested review from a team as code owners January 27, 2025 23:08
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-16312

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/

@@ -177,7 +177,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error {
// systemStopCmd is the struct representing the command to shutdown DAOS system.
type systemStopCmd struct {
baseRankListCmd
Force bool `long:"force" description:"Force stop DAOS system members"`
Force bool `long:"force" description:"Currently ignored"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this changed line isn't helpful. It will have to be changed back, and it doesn't really tell the admin anything useful. "Oh, it's ignored? So then force stop doesn't work? Well then, how do I forcibly stop the system?"

You can see how this change may have the opposite effect to what you intended... I think the description should be the same and the flag should just be a no-op so that everyone doesn't have to change their scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will revert the change

@@ -191,7 +191,8 @@ func (cmd *systemStopCmd) Execute(_ []string) (errOut error) {
if err := cmd.validateHostsRanks(); err != nil {
return err
}
req := &control.SystemStopReq{Force: cmd.Force}
// DAOS-16312: Always use force when stopping ranks.
req := &control.SystemStopReq{Force: true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the best place to make this change. It means that only dmg users will benefit from it. Control API users will not. Better to just set it in the SystemStop RPC invoker. As an added benefit, changing it there will minimize the blast radius of this change, so that you don't have to modify the dmg tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr requested a review from mjmac January 28, 2025 13:38
Comment on lines -170 to -173
signal := syscall.SIGINT
if req.Force {
signal = syscall.SIGKILL
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make all of these changes? You are (or someone else is) just going to have to change everything back later. This could have been a one-line change, maybe with some extra comments. What you could do is define a const, e.g. DefaultStopSignal, and then when things change back you only need to change it in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm not convinced it should be that simple. If we simply force all the time then we also break the call to "ds_pool_disable_exclude()" which is required for controlled shutdown as discussed here. I'm waiting for response from those that initially requested that feature and in the meantime will push both solutions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple version that you suggest is: #15803

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is the version @gnailzenh is in favour of where prep_shutdown/disable_exclude behaviour is preserved for the non-force and no-ranks-specified dmg system stop controlled shutdown case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmac can we go with this one as an urgent fix please?

@tanabarr
Copy link
Contributor Author

build 4 triggered at P2 with allow unstable pragma after build 3 failed NLT memcheck with unrelated issues

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/

knard38
knard38 previously approved these changes Jan 29, 2025
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tanabarr
Copy link
Contributor Author

Not all hardware stages started in run 4, restart from stage hardware test -> run 5.

@tanabarr
Copy link
Contributor Author

Gatekeeper please use PR title and description in commit message when landing, TIA.

@mjmac
Copy link
Contributor

mjmac commented Jan 29, 2025

@tanabarr: It looks like run #4 had a ftest failure that needs to be addressed (can't copy/paste from the remote desktop session), in the FTEST_control.ControlLogEntry test.

Test-tag: vm,ControlLogEntry
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@tanabarr
Copy link
Contributor Author

build 7 test only fix should be sufficient with build 5&6 results to get this landed, what say you @daltonbohning
@mjmac @kjacque @knard38 can I get reviews please? urgent priority issue

@daltonbohning
Copy link
Contributor

build 7 test only fix should be sufficient with build 5&6 results to get this landed, what say you @daltonbohning @mjmac @kjacque @knard38 can I get reviews please? urgent priority issue

  • Build 4
    • NLT valgrind failures - I don't know if related
    • Failed ControlLogEntry - being fixed + ran in Build 7
    • Failed HW Medium - Build 5 passed this
  • Build 5 - only ran HW Medium with pr and control - all pass
  • Build 6 - Aborted before build stages
  • Build 7 - Presumably will only run the ControlLogEntry test

Assuming Build 7 passes, that just leaves the NLT valgrind failure in Build 4:
https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15799/4/pipeline
Which probably isn't related?

@tanabarr tanabarr self-assigned this Jan 29, 2025
@tanabarr tanabarr added control-plane work on the management infrastructure of the DAOS Control Plane usability Changes specific to user facing tools or behaviour. labels Jan 29, 2025
@tanabarr
Copy link
Contributor Author

PR should be nearly ready to land after build 7, Cedric approved and only one small test fix since he did. mike has also reviewed. NLT memcheck definitely unrelated as no C code changed. @daltonbohning @phender thoughts?

@tanabarr tanabarr requested a review from a team January 30, 2025 10:23
@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 30, 2025
@tanabarr
Copy link
Contributor Author

@phender @daltonbohning ControlLogEntry passed on build 7, NLT failures all seem to be existing and not contributed to by this PR https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/7/NLT_server/
Can you see if this can be landed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. usability Changes specific to user facing tools or behaviour.
Development

Successfully merging this pull request may close these issues.

6 participants