-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16312 control: Always use --force for dmg system stop #15799
base: master
Are you sure you want to change the base?
Conversation
Allow-unstable-test: true Features: control Signed-off-by: Tom Nabarro <[email protected]>
Errors are Unable to load ticket data |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/ |
src/control/cmd/dmg/system.go
Outdated
@@ -177,7 +177,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error { | |||
// systemStopCmd is the struct representing the command to shutdown DAOS system. | |||
type systemStopCmd struct { | |||
baseRankListCmd | |||
Force bool `long:"force" description:"Force stop DAOS system members"` | |||
Force bool `long:"force" description:"Currently ignored"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this changed line isn't helpful. It will have to be changed back, and it doesn't really tell the admin anything useful. "Oh, it's ignored? So then force stop doesn't work? Well then, how do I forcibly stop the system?"
You can see how this change may have the opposite effect to what you intended... I think the description should be the same and the flag should just be a no-op so that everyone doesn't have to change their scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will revert the change
src/control/cmd/dmg/system.go
Outdated
@@ -191,7 +191,8 @@ func (cmd *systemStopCmd) Execute(_ []string) (errOut error) { | |||
if err := cmd.validateHostsRanks(); err != nil { | |||
return err | |||
} | |||
req := &control.SystemStopReq{Force: cmd.Force} | |||
// DAOS-16312: Always use force when stopping ranks. | |||
req := &control.SystemStopReq{Force: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the best place to make this change. It means that only dmg users will benefit from it. Control API users will not. Better to just set it in the SystemStop RPC invoker. As an added benefit, changing it there will minimize the blast radius of this change, so that you don't have to modify the dmg tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/ |
Features: control Signed-off-by: Tom Nabarro <[email protected]>
signal := syscall.SIGINT | ||
if req.Force { | ||
signal = syscall.SIGKILL | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make all of these changes? You are (or someone else is) just going to have to change everything back later. This could have been a one-line change, maybe with some extra comments. What you could do is define a const, e.g. DefaultStopSignal, and then when things change back you only need to change it in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm not convinced it should be that simple. If we simply force all the time then we also break the call to "ds_pool_disable_exclude()" which is required for controlled shutdown as discussed here. I'm waiting for response from those that initially requested that feature and in the meantime will push both solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple version that you suggest is: #15803
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is the version @gnailzenh is in favour of where prep_shutdown/disable_exclude behaviour is preserved for the non-force and no-ranks-specified dmg system stop
controlled shutdown case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjmac can we go with this one as an urgent fix please?
build 4 triggered at P2 with allow unstable pragma after build 3 failed NLT memcheck with unrelated issues |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Not all hardware stages started in run 4, restart from stage hardware test -> run 5. |
Gatekeeper please use PR title and description in commit message when landing, TIA. |
Test-tag: vm,ControlLogEntry Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
build 7 test only fix should be sufficient with build 5&6 results to get this landed, what say you @daltonbohning |
Assuming Build 7 passes, that just leaves the NLT valgrind failure in Build 4: |
PR should be nearly ready to land after build 7, Cedric approved and only one small test fix since he did. mike has also reviewed. NLT memcheck definitely unrelated as no C code changed. @daltonbohning @phender thoughts? |
@phender @daltonbohning ControlLogEntry passed on build 7, NLT failures all seem to be existing and not contributed to by this PR https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/7/NLT_server/ |
Whenever stopping an engine process from within the control-plane, use
SIGKILL rather than asking nicely (SIGTERM). This has been requested
to try to avoid situations that could result in dataloss.
This change preserves the behaviour where ds_mgmt_drpc_prep_shutdown()
and then ds_pool_disable_exclude() will be called during a controlled
shutdown where dmg system stop is called without other arguments.
Notable behavior changes with this PR:
options supplied.
during “controlled” shutdown where dmg system stop is called without
options.
attempting to shutdown an entire cluster without triggering rebuilds.
shutdown dRPC to each rank during dmg system stop
Allow-unstable-test: true
Features: control
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: