Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14654 test: simplify ior_per_rank.py #13346

Merged
merged 13 commits into from
Jan 10, 2024
Merged

DAOS-14654 test: simplify ior_per_rank.py #13346

merged 13 commits into from
Jan 10, 2024

Conversation

daltonbohning
Copy link
Contributor

Test-tag: test_ior_per_rank
Skip-unit-tests: true
Skip-fault-injection-test: true

  • Only run with transfer size 1M
  • Reduce stonewall to 15s

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@daltonbohning daltonbohning self-assigned this Nov 15, 2023
Copy link

Bug-tracker data:
Ticket title is 'ftest: simplify ior_per_rank.py'
Status is 'In Progress'
Labels: 'tds'
https://daosio.atlassian.net/browse/DAOS-14654

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.4 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13346/1/execution/node/384/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13346/1/execution/node/340/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13346/1/execution/node/335/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.4 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13346/1/execution/node/332/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13346/1/execution/node/408/log

Test-tag: test_ior_per_rank
Skip-unit-tests: true
Skip-fault-injection-test: true

- Only run with transfer size 1M
- Reduce stonewall to 15s

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Test-tag: test_ior_per_rank
Skip-unit-tests: true
Skip-fault-injection-test: true

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daltonbohning daltonbohning marked this pull request as ready for review November 20, 2023 23:07
@daltonbohning daltonbohning requested a review from a team as a code owner November 20, 2023 23:07
shimizukko
shimizukko previously approved these changes Nov 20, 2023
transfer_sizes:
- 1M
- 256B
transfer_size: 1M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the wiki page: https://daosio.atlassian.net/wiki/spaces/DAOS/pages/11136040981/Running+Rack+Group+Level+Tests
This will have reference to 256B.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I will update the wiki after this merges

good_node = self.server_managers[0].get_host(rank)
if ((good_node not in self.good_nodes)
and (good_node not in self.failed_nodes)):
self.good_nodes.append(good_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we print the failed nodes performance? The output has failed nodes information without the node performance information. Someone has to go over the job.log to find the write/read performance now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I might as well while I'm here. Will push an update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added better reporting, but I'll need to verify it works

@@ -33,52 +33,46 @@ def execute_ior_per_rank(self, rank):
# create the pool on specified rank.
self.add_pool(connect=False, target_list=[rank])
Copy link
Contributor

@rpadma2 rpadma2 Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that creating a container outside the IOR doesn't cause pool destroy failures. I think I tried something like this:

self.add_pool(connect=False, target_list=[rank])
self.add_container(self.pool)

.... On IOR commands create_cont is always set False.
try:
self.ior_cmd.transfer_size.update(transfer_size)
self.ior_cmd.flags.update(self.write_flags)
dfs_out = self.run_ior_with_pool(create_cont=False,fail_on_warning=self.log.info)
dfs_perf_write = IorCommand.get_ior_metrics(dfs_out)
self.ior_cmd.flags.update(self.read_flags)
dfs_out = self.run_ior_with_pool(create_cont=False, fail_on_warning=self.log.info)

In this way, I didn't notice pool_destroy having problem on large scale. It looks like creating containers within IOR for each rank and destroying them and pool has some issues on large scale testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Seems another issue with pydaos handling. If we let run_ior_with_pool create the container, it calls pool.connect(). I'll have the test create it here, which I think is a better workflow anyway. Thanks for finding that!

Test-tag: test_ior_per_rank
Test-repeat: 2
Skip-unit-tests: true
Skip-fault-injection-test: true

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Test-tag: test_ior_per_rank
Test-repeat: 2
Skip-unit-tests: true
Skip-fault-injection-test: true

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13346/5/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13346/5/testReport/

@daltonbohning
Copy link
Contributor Author

@rpadma2 This is an example of what failures look like now. It gives the host, rank, bandwidth, and percent diff
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13346/6/artifact/Functional%20Hardware%20Medium/deployment/ior_per_rank.py/repeat001/job.log

2023-11-29 09:31:02,415 ior_per_rank     L0137 INFO | List of failed nodes with corresponding ranks
2023-11-29 09:31:02,415 ior_per_rank     L0140 INFO | wolf-138: rank 1 low write perf. BW: 2105.71/2250.00; percent diff: 0.68/0.50
2023-11-29 09:31:02,415 ior_per_rank     L0140 INFO | wolf-139: oops. test

Test-tag: test_ior_per_rank
Test-repeat: 2
Skip-unit-tests: true
Skip-fault-injection-test: true

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13346/7/testReport/

@daltonbohning
Copy link
Contributor Author

Test passed, except for the MD on SSD stage which doesn't have any coverage outside of PRs.
https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13346/7/tests

@daltonbohning daltonbohning marked this pull request as ready for review December 1, 2023 14:38
@daltonbohning
Copy link
Contributor Author

@rpadma2 Do the current changes look okay to you?

@daltonbohning daltonbohning requested a review from a team December 6, 2023 23:33
@daltonbohning daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 6, 2023
Test-tag: test_ior_per_rank
Test-repeat: 2
Skip-unit-tests: true
Skip-fault-injection-test: true

Required-githooks: true
@daltonbohning daltonbohning requested review from rpadma2 and shimizukko and removed request for a team January 3, 2024 18:04
@daltonbohning daltonbohning requested a review from a team January 4, 2024 15:10
Test-tag: test_ior_per_rank
Test-repeat: 2
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <[email protected]>
Test-tag: test_ior_per_rank
Test-repeat: 2
Skip-unit-tests: true
Skip-fault-injection-test: true

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
Test-tag: test_ior_per_rank
Test-repeat: 2
Skip-unit-tests: true
Skip-fault-injection-test: true

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
@phender phender merged commit 4bbfefb into master Jan 10, 2024
30 checks passed
@phender phender deleted the dbohning/daos-14654 branch January 10, 2024 20:57
daltonbohning added a commit that referenced this pull request Jan 23, 2024
- Only run with transfer size 1M
- Reduce stonewall to 15s

Required-githooks: true

Signed-off-by: Dalton Bohning <[email protected]>
phender pushed a commit that referenced this pull request Jan 24, 2024
- Only run with transfer size 1M
- Reduce stonewall to 15s

Signed-off-by: Dalton Bohning <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

5 participants