Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-15066 test: Update dfuse/bash and more to vms. #13631

Merged
merged 9 commits into from
Feb 2, 2024

Conversation

ashleypittman
Copy link
Contributor

Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs
Test-repeat: 5

Required-githooks: true

Signed-off-by: Ashley Pittman [email protected]

Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs
Test-repeat: 5

Required-githooks: true

Signed-off-by: Ashley Pittman <[email protected]>
Copy link

github-actions bot commented Jan 18, 2024

Bug-tracker data:
Ticket title is 'Overhaul dfuse/bash test and move to VMs.'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-15066

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13631/1/execution/node/1180/log

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To work on VMs, you'd want to update the yaml similar to

server_config:
name: daos_server
engines_per_host: 1
engines:
0:
targets: 4
nr_xs_helpers: 0
storage:
0:
class: ram
scm_mount: /mnt/daos
system_ram_reserved: 1

Skip-unit-tests: true

Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs
Test-repeat: 5
Required-githooks: true
Signed-off-by: Ashley Pittman <[email protected]>
Required-githooks: true

Signed-off-by: Ashley Pittman <[email protected]>
@ashleypittman ashleypittman marked this pull request as ready for review January 19, 2024 10:00
@ashleypittman ashleypittman requested a review from a team as a code owner January 19, 2024 10:00
@ashleypittman
Copy link
Contributor Author

The only "slow" version of this test was the non-il version which takes about 10 minutes due to caching being disabled. The two il versions take about a minute each.

Moving from hardware to vms doesn't affect run-time and there are more slots available in CI for it. I left the ioil test enabled for PRs but moved the slow test from prs to daily, as well as moving it from hw to vms.

kccain
kccain previously approved these changes Jan 19, 2024
@@ -207,8 +207,8 @@ def test_bashcmd_pil4dfs(self):
commands.

:avocado: tags=all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question only (no change suggested): is the idea to (at a future point) specify a pr or daily_regression or full_regression tag to this pil4dfs test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This test should probably be weekly I'd assume, there's good coverage here, it also only takes a minute so we could easily make it daily as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wiliamhuang Was this an oversight in #13257? Currently this test doesn't run in pr, daily, or weekly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend daily to be inline with the other cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daltonbohning Thank you very much! I will add "daily" in my other PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to update this PR anyway so I've added the tag.

@daltonbohning
Copy link
Contributor

The only "slow" version of this test was the non-il version which takes about 10 minutes due to caching being disabled. The two il versions take about a minute each.

Moving from hardware to vms doesn't affect run-time and there are more slots available in CI for it. I left the ioil test enabled for PRs but moved the slow test from prs to daily, as well as moving it from hw to vms.

One thing to keep in mind is that servers aren't restarted between each test case. So part of the test time for the first test includes server start/stop. (Maybe a minute?) If one of the other test cases were ran in isolation, they'd probably take another minute. This is mostly a minor consideration when splitting tests in the same file amongst pr, daily, and weekly

daltonbohning
daltonbohning previously approved these changes Jan 19, 2024
@ashleypittman ashleypittman requested a review from a team January 19, 2024 18:00
jolivier23
jolivier23 previously approved these changes Jan 19, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13631/3/testReport/

@daltonbohning
Copy link
Contributor

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed.
Since this only touches test code, we could run with

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true
@ashleypittman
Copy link
Contributor Author

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed. Since this only touches test code, we could run with

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true

I thought build #3 had that already but I must have missed the tag, re-pushed.

@daltonbohning
Copy link
Contributor

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed. Since this only touches test code, we could run with

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true

I thought build #3 had that already but I must have missed the tag, re-pushed.

A previous commit did, but after e29092b there weren't tags specified.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13631/5/testReport/

@daltonbohning
Copy link
Contributor

Looks the fio command timed out
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13631/5/artifact/Functional%20on%20EL%208.8/dfuse/bash.py/job.log

2024-01-22 19:10:11,200 general_utils    L0503 INFO | Command: fio --readwrite=randwrite --name=test --size="2M" --directory /tmp/7ABB47A6-72CD-4AE3-A52D-C7C03D6A75F8_daos_dfuse0/ --bs=1M --numjobs="4" --ioengine=psync --group_reporting --exitall_on_error --continue_on_error=none
Results:
  wolf-104vm1: exit_status=None, interrupted=True:    test: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
    ...
    fio-3.20
    Starting 4 processes
    test: Laying out IO file (1 file / 2MiB)
    test: Laying out IO file (1 file / 2MiB)
    test: Laying out IO file (1 file / 2MiB)
    test: Laying out IO file (1 file / 2MiB)

2024-01-22 19:10:11,200 bash             L0160 ERROR| BashCmd Test Failed: Error running 'fio --readwrite=randwrite --name=test --size="2M" --directory /tmp/7ABB47A6-72CD-4AE3-A52D-C7C03D6A75F8_daos_dfuse0/ --bs=1M --numjobs="4" --ioengine=psync --group_reporting --exitall_on_error --continue_on_error=none' on the following hosts: wolf-104vm1
2024-01-22 19:10:11,201 test             L1362 INFO | Test has failed, dumping ULT stacks

The interrupted=True part meaning it timed out.
IIRC from past experience, the network on VMs can often drop to a trickle of KBs. So 2M very well could take a long time. That might be part of why this test was on HW

@ashleypittman
Copy link
Contributor Author

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed. Since this only touches test code, we could run with

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true

I thought build #3 had that already but I must have missed the tag, re-pushed.

A previous commit did, but after e29092b there weren't tags specified.

It was build #2 that ran and passed the tests, although it then got killed as I re-pushed before other testing had finished.

The latest build failed, it looks to be in less than a minute but no stderr or stack trace info so I'm not sure the cause.

@ashleypittman
Copy link
Contributor Author

The interrupted=True part meaning it timed out. IIRC from past experience, the network on VMs can often drop to a trickle of KBs. So 2M very well could take a long time. That might be part of why this test was on HW

When it runs then it seems to pass in 4-5 seconds and the timeout is set to 30.

If the network on the VMs is that bad then we should take them out of service until we can identify the issue.

Looking at this test it appears to loop for 5 pools and 5 containers, but not if testing ioil, I'm not sure why it would do that but it explains why the dfuse variant is so much slower.

@daltonbohning
Copy link
Contributor

If the network on the VMs is that bad then we should take them out of service until we can identify the issue

It's been a while, so I don't have pointers to past discussion, but I encountered this with some of the datamover tests and ultimately the only "solution" we had was to move to HW.

Maybe we could just reduce the blocksize for that fio command? Since the intention of the test appears to be just testing bash commands over dfuse.

Bump command timeout from 30 to 120 seconds.

Reformat code as removing loop changed the indentation level anyway.

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true
Required-githooks: true

Signed-off-by: Ashley Pittman <[email protected]>
@ashleypittman ashleypittman changed the title DAOS-623 test: Move daos_build test vms. DAOS-623 test: Move dfuse/bash test to vms. Jan 23, 2024
@ashleypittman ashleypittman changed the title DAOS-623 test: Move dfuse/bash test to vms. DAOS-15066 test: Update dfuse/bash and more to vms. Jan 23, 2024
@ashleypittman
Copy link
Contributor Author

Test updated and now passes with all three variants. The total non-il test takes 1m6s vs 8m55s on latest daily on top of moving from hw to vm.

kccain
kccain previously approved these changes Jan 23, 2024
daltonbohning
daltonbohning previously approved these changes Jan 23, 2024
self.pool.destroy()
self.add_pool(connect=False)
self.add_container(self.pool)
mount_dir = f"/tmp/{self.pool.uuid}_daos_dfuse"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing so not a blocker, but we prefer to use identifier so the label is used

Suggested change
mount_dir = f"/tmp/{self.pool.uuid}_daos_dfuse"
mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse"

Comment on lines 137 to 150
ret_code = general_utils.pcmd(self.hostlist_clients, env_str + cmd, timeout=120)
if 0 not in ret_code:
error_hosts = NodeSet(
",".join(
[
str(node_set)
for code, node_set in list(ret_code.items())
if code != 0
]
)
)
raise CommandFailure(
f"Error running '{cmd}' on the following hosts: {error_hosts}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing, but FYI we have newer, better function called run_remote that works like

result = run_remote(self.log, self.hostlist_clients, cmd)
if not result.passed:
    self.fail(f"... failed on {result.failed_hosts}")

Which eliminates this nasty logic to get a nodeset

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true
Required-githooks: true
Signed-off-by: Ashley Pittman <[email protected]>
@ashleypittman ashleypittman dismissed stale reviews from daltonbohning and kccain via b354e2b January 23, 2024 18:01
Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true
Required-githooks: true
Signed-off-by: Ashley Pittman <[email protected]>
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment on lines +59 to +66
mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse"
self.start_dfuse(self.hostlist_clients, self.pool, self.container, mount_dir=mount_dir)
if il_lib is not None:
# unmount dfuse and mount again with caching disabled
self.dfuse.unmount(tries=1)
self.dfuse.update_params(disable_caching=True)
self.dfuse.update_params(disable_wb_cache=True)
self.dfuse.run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, pre-existing, so not a blocker. But if we only use one pool now, this can be simplified like this. I don't think there was any real need to unmount dfuse and remount with different params. We could just disable caching to begin with.

Suggested change
mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse"
self.start_dfuse(self.hostlist_clients, self.pool, self.container, mount_dir=mount_dir)
if il_lib is not None:
# unmount dfuse and mount again with caching disabled
self.dfuse.unmount(tries=1)
self.dfuse.update_params(disable_caching=True)
self.dfuse.update_params(disable_wb_cache=True)
self.dfuse.run()
if il_lib is None:
self.start_dfuse(self.hostlist_clients, self.pool, self.container)
else:
self.start_dfuse(self.hostlist_clients, self.pool, self.container, disable_caching=True, disable_wb_cache=True)

@ashleypittman ashleypittman requested a review from a team January 29, 2024 19:15
@daltonbohning daltonbohning merged commit 132b00d into master Feb 2, 2024
40 checks passed
@daltonbohning daltonbohning deleted the amd/bash-move branch February 2, 2024 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants