-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14225 control: Ignore duplicate call to SetRank #13169
Conversation
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/2/testReport/ |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/3/testReport/ |
afe722e
to
7f9d5c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage NLT on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/4/execution/node/839/log |
38721ba
to
484562b
Compare
Squashed @knard-intel 's #13308 into this so we can get CI PoolCreateCapacityTests passing with the combined changes. GATEKEEPER: Please use the PR title and description as the commit message when merging with master, thanks in advance. |
src/control/server/instance.go
Outdated
if ei.IsReady() { | ||
ei.log.Errorf("SetupRank called on an already set-up instance %d", ei.Index()) | ||
return nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR address the race condition, or do we just log an error on duplicate calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate SetRank() calls get ignored and logged as an error to enable further debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be an error or warning? I'm just thinking from the perspective of an Admin wondering why they see this error if it's not really an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change to debug, @daltonbohning would you mind if I moved the fix for this to #13385 so as to not hold up CI progress, it's been running for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, not an issue from me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the log level for this entry to Debug in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go and doc changes LGTM
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/5/execution/node/1420/log |
@phender please could you review the functional test part of this PR. |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/6/execution/node/354/log |
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/6/execution/node/446/log |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/15/testReport/ |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/16/execution/node/686/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/16/execution/node/732/log |
My last commits were lost with your force push (from what I understand). |
Integrate reviewers comments: - Add message in exception - fix use of min() - Fix spelling issue Features: pool Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Increase test timeout with the time needed to destory all the pools. Features: pool Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Integrate reviewers comments: - Strengthen and simplify the pools creation step - Check minimal quantity of pools Skip-func-hw-medium-md-on-ssd: false Skip-func-hw-medium-verbs-provider-md-on-ssd: false Skip-func-hw-large-md-on-ssd: false Features: control pool Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix spelling and coding standard issue. Skip-func-hw-medium-md-on-ssd: false Skip-func-hw-medium-verbs-provider-md-on-ssd: false Skip-func-hw-large-md-on-ssd: false Features: control pool Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
I'm very sorry about that, I'm not sure how that happened. |
No problem, hopefully the CI will report less errors ;) |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/17/execution/node/1479/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/17/execution/node/1417/log |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/17/testReport/ |
CI results for run no. 17 with control and pool features failed for the following known issues:
Requesting forced landing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GO part LGTM
…trank-twice-fail-simple Required-githooks: true
Integrating reviewers comments: - Remove useless custom tearDown() - Prefer local variable over class attributes - Code refactoring of the pools creation loop Skip-func-hw-medium-md-on-ssd: false Skip-func-hw-medium-verbs-provider-md-on-ssd: false Skip-func-hw-large-md-on-ssd: false Features: control pool Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix pylint and speel checks. Skip-func-hw-medium-md-on-ssd: false Skip-func-hw-medium-verbs-provider-md-on-ssd: false Skip-func-hw-large-md-on-ssd: false Features: control pool Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Removing useless constant. Skip-func-hw-medium-md-on-ssd: false Skip-func-hw-medium-verbs-provider-md-on-ssd: false Skip-func-hw-large-md-on-ssd: false Features: control pool Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/20/testReport/ |
Tests are occasionally failing during server start-up due to duplicate
SetRank calls in the control-plane. The reason for the offending call
is related to a race existing on rank 0 which bootstraps the MS and
manually triggers SetRank rather than waiting for the NotifyReady
callback mechanism triggered by a dRPC from the engine.
The previous attempt to remove the special case for rank 0 resulted in
the failure that is presumed to be the reason for implementing the
special case in the first place. This subsequent fix attempt simply
returns early from SetRank function if a rank has already been set on
the engine instance but retains the special case rank setting process
for the bootstrapping rank.
DAOS-14528 test: Fix PoolCreateCapacityTests for md-on-ssd (#13308)
Miscellaneous fixes allowing to run the test for md-on-scm or md-on-ssd.
Co-authored-by: Cedric Koch-Hofer [email protected]
Signed-off-by: Tom Nabarro [email protected]
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: