Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14225 control: Ignore duplicate call to SetRank #13169

Merged
merged 12 commits into from
Jan 24, 2024

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Oct 12, 2023

Tests are occasionally failing during server start-up due to duplicate
SetRank calls in the control-plane. The reason for the offending call
is related to a race existing on rank 0 which bootstraps the MS and
manually triggers SetRank rather than waiting for the NotifyReady
callback mechanism triggered by a dRPC from the engine.

The previous attempt to remove the special case for rank 0 resulted in
the failure that is presumed to be the reason for implementing the
special case in the first place. This subsequent fix attempt simply
returns early from SetRank function if a rank has already been set on
the engine instance but retains the special case rank setting process
for the bootstrapping rank.

DAOS-14528 test: Fix PoolCreateCapacityTests for md-on-ssd (#13308)

Miscellaneous fixes allowing to run the test for md-on-scm or md-on-ssd.

Co-authored-by: Cedric Koch-Hofer [email protected]
Signed-off-by: Tom Nabarro [email protected]

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@tanabarr tanabarr requested review from a team as code owners October 12, 2023 11:09
@tanabarr tanabarr requested review from mjmac and removed request for a team October 12, 2023 11:09
@tanabarr tanabarr self-assigned this Oct 12, 2023
@github-actions
Copy link

github-actions bot commented Oct 12, 2023

Bug-tracker data:
Ticket title is 'pool/create_capacity.py: engine rank 0 killed by local daos_server unexpectly'
Status is 'Resolved'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-14225

@tanabarr tanabarr added bug control-plane work on the management infrastructure of the DAOS Control Plane labels Oct 12, 2023
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/3/testReport/

@tanabarr tanabarr force-pushed the tanabarr/control-setrank-twice-fail-simple branch from afe722e to 7f9d5c6 Compare November 29, 2023 13:29
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

@tanabarr tanabarr requested a review from a team as a code owner November 30, 2023 11:52
@tanabarr tanabarr force-pushed the tanabarr/control-setrank-twice-fail-simple branch from 38721ba to 484562b Compare November 30, 2023 12:03
@tanabarr tanabarr changed the title DAOS-14225 control: Ignore duplicate call to SetRank DAOS-14225 control,test: Ignore duplicate call to SetRank Nov 30, 2023
@tanabarr tanabarr requested review from kjacque and knard38 November 30, 2023 12:04
@tanabarr
Copy link
Contributor Author

Squashed @knard-intel 's #13308 into this so we can get CI PoolCreateCapacityTests passing with the combined changes.
Have pushed with features pool to ensure coverage of the SetRank change as well as PoolCreateCapacityTests.

GATEKEEPER: Please use the PR title and description as the commit message when merging with master, thanks in advance.

Comment on lines 277 to 281
if ei.IsReady() {
ei.log.Errorf("SetupRank called on an already set-up instance %d", ei.Index())
return nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR address the race condition, or do we just log an error on duplicate calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate SetRank() calls get ignored and logged as an error to enable further debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be an error or warning? I'm just thinking from the perspective of an Admin wondering why they see this error if it's not really an issue

Copy link
Contributor Author

@tanabarr tanabarr Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change to debug, @daltonbohning would you mind if I moved the fix for this to #13385 so as to not hold up CI progress, it's been running for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not an issue from me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the log level for this entry to Debug in this PR

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr tanabarr changed the title DAOS-14225 control,test: Ignore duplicate call to SetRank DAOS-14225 control: Ignore duplicate call to SetRank Dec 4, 2023
kjacque
kjacque previously approved these changes Dec 5, 2023
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go and doc changes LGTM

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/5/execution/node/1420/log

@knard38 knard38 requested a review from phender December 7, 2023 15:11
@knard38
Copy link
Contributor

knard38 commented Dec 7, 2023

@phender please could you review the functional test part of this PR.
From what I understood, you historically significantly contribute to the fixed test.
Thanks in advance.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/6/execution/node/354/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/6/execution/node/446/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/15/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/16/execution/node/686/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/16/execution/node/732/log

@knard38
Copy link
Contributor

knard38 commented Jan 15, 2024

@phender @knard-intel I've rebased this PR on #13594 to attempt to fix the pool create failures which I believe are due to the NVMe storage query key lookup issue. Also running with MD-on-SSD coverage and pool features.

My last commits were lost with your force push (from what I understand).
I am going to add them and restart the CI.

Integrate reviewers comments:
- Add message in exception
- fix use of min()
- Fix spelling issue

Features: pool
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <[email protected]>
Increase test timeout with the time needed to destory all the pools.

Features: pool
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <[email protected]>
Integrate reviewers comments:
- Strengthen and simplify the pools creation step
- Check minimal quantity of pools

Skip-func-hw-medium-md-on-ssd: false
Skip-func-hw-medium-verbs-provider-md-on-ssd: false
Skip-func-hw-large-md-on-ssd: false
Features: control pool
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix spelling and coding standard issue.

Skip-func-hw-medium-md-on-ssd: false
Skip-func-hw-medium-verbs-provider-md-on-ssd: false
Skip-func-hw-large-md-on-ssd: false
Features: control pool
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <[email protected]>
@tanabarr
Copy link
Contributor Author

@phender @knard-intel I've rebased this PR on #13594 to attempt to fix the pool create failures which I believe are due to the NVMe storage query key lookup issue. Also running with MD-on-SSD coverage and pool features.

My last commits were lost with your force push (from what I understand). I am going to add them and restart the CI.

I'm very sorry about that, I'm not sure how that happened.

@knard38
Copy link
Contributor

knard38 commented Jan 15, 2024

@phender @knard-intel I've rebased this PR on #13594 to attempt to fix the pool create failures which I believe are due to the NVMe storage query key lookup issue. Also running with MD-on-SSD coverage and pool features.

My last commits were lost with your force push (from what I understand). I am going to add them and restart the CI.

I'm very sorry about that, I'm not sure how that happened.

No problem, hopefully the CI will report less errors ;)

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/17/execution/node/1479/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13169/17/execution/node/1417/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/17/testReport/

@tanabarr
Copy link
Contributor Author

CI results for run no. 17 with control and pool features failed for the following known issues:

Requesting forced landing

@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 16, 2024
knard38
knard38 previously approved these changes Jan 19, 2024
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GO part LGTM

…trank-twice-fail-simple

Required-githooks: true
Integrating reviewers comments:
- Remove useless custom tearDown()
- Prefer local variable over class attributes
- Code refactoring of the pools creation loop

Skip-func-hw-medium-md-on-ssd: false
Skip-func-hw-medium-verbs-provider-md-on-ssd: false
Skip-func-hw-large-md-on-ssd: false
Features: control pool
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix pylint and speel checks.

Skip-func-hw-medium-md-on-ssd: false
Skip-func-hw-medium-verbs-provider-md-on-ssd: false
Skip-func-hw-large-md-on-ssd: false
Features: control pool
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <[email protected]>
Removing useless constant.

Skip-func-hw-medium-md-on-ssd: false
Skip-func-hw-medium-verbs-provider-md-on-ssd: false
Skip-func-hw-large-md-on-ssd: false
Features: control pool
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13169/20/testReport/

@tanabarr tanabarr removed the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 24, 2024
@mjmac mjmac merged commit 54c54af into master Jan 24, 2024
50 of 51 checks passed
@mjmac mjmac deleted the tanabarr/control-setrank-twice-fail-simple branch January 24, 2024 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug control-plane work on the management infrastructure of the DAOS Control Plane
Development

Successfully merging this pull request may close these issues.

8 participants