DAOS-15066 test: Update dfuse/bash and more to vms. #13631

ashleypittman · 2024-01-18T19:29:52Z

Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs
Test-repeat: 5

Required-githooks: true

Signed-off-by: Ashley Pittman [email protected]

Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs Test-repeat: 5 Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

github-actions · 2024-01-18T19:30:13Z

Bug-tracker data:
Ticket title is 'Overhaul dfuse/bash test and move to VMs.'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-15066

daosbuild1 · 2024-01-18T21:01:05Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13631/1/execution/node/1180/log

daltonbohning

To work on VMs, you'd want to update the yaml similar to

daos/src/tests/ftest/dfuse/mu_perms.yaml

Lines 5 to 16 in cf10b98

    
           server_config: 
        
             name: daos_server 
        
             engines_per_host: 1 
        
             engines: 
        
               0: 
        
                 targets: 4 
        
                 nr_xs_helpers: 0 
        
                 storage: 
        
                   0: 
        
                     class: ram 
        
                     scm_mount: /mnt/daos 
        
             system_ram_reserved: 1

Skip-unit-tests: true Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs Test-repeat: 5 Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman · 2024-01-19T10:02:31Z

The only "slow" version of this test was the non-il version which takes about 10 minutes due to caching being disabled. The two il versions take about a minute each.

Moving from hardware to vms doesn't affect run-time and there are more slots available in CI for it. I left the ioil test enabled for PRs but moved the slow test from prs to daily, as well as moving it from hw to vms.

kccain · 2024-01-19T13:57:45Z

src/tests/ftest/dfuse/bash.py

@@ -207,8 +207,8 @@ def test_bashcmd_pil4dfs(self):
            commands.

        :avocado: tags=all


question only (no change suggested): is the idea to (at a future point) specify a pr or daily_regression or full_regression tag to this pil4dfs test?

Good question. This test should probably be weekly I'd assume, there's good coverage here, it also only takes a minute so we could easily make it daily as well.

@wiliamhuang Was this an oversight in #13257? Currently this test doesn't run in pr, daily, or weekly

I'd recommend daily to be inline with the other cases

@daltonbohning Thank you very much! I will add "daily" in my other PR.

I had to update this PR anyway so I've added the tag.

daltonbohning · 2024-01-19T16:52:57Z

The only "slow" version of this test was the non-il version which takes about 10 minutes due to caching being disabled. The two il versions take about a minute each.

Moving from hardware to vms doesn't affect run-time and there are more slots available in CI for it. I left the ioil test enabled for PRs but moved the slow test from prs to daily, as well as moving it from hw to vms.

One thing to keep in mind is that servers aren't restarted between each test case. So part of the test time for the first test includes server start/stop. (Maybe a minute?) If one of the other test cases were ran in isolation, they'd probably take another minute. This is mostly a minor consideration when splitting tests in the same file amongst pr, daily, and weekly

daosbuild1 · 2024-01-21T01:42:06Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13631/3/testReport/

Required-githooks: true

daltonbohning · 2024-01-22T16:35:14Z

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed.
Since this only touches test code, we could run with

Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true

Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true

ashleypittman · 2024-01-22T17:03:47Z

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed. Since this only touches test code, we could run with
Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true

I thought build #3 had that already but I must have missed the tag, re-pushed.

daltonbohning · 2024-01-22T17:19:38Z

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed. Since this only touches test code, we could run with
Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true
I thought build #3 had that already but I must have missed the tag, re-pushed.

A previous commit did, but after e29092b there weren't tags specified.

daosbuild1 · 2024-01-22T19:14:53Z

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13631/5/testReport/

daltonbohning · 2024-01-22T19:50:07Z

Looks the fio command timed out
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13631/5/artifact/Functional%20on%20EL%208.8/dfuse/bash.py/job.log

2024-01-22 19:10:11,200 general_utils    L0503 INFO | Command: fio --readwrite=randwrite --name=test --size="2M" --directory /tmp/7ABB47A6-72CD-4AE3-A52D-C7C03D6A75F8_daos_dfuse0/ --bs=1M --numjobs="4" --ioengine=psync --group_reporting --exitall_on_error --continue_on_error=none
Results:
  wolf-104vm1: exit_status=None, interrupted=True:    test: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
    ...
    fio-3.20
    Starting 4 processes
    test: Laying out IO file (1 file / 2MiB)
    test: Laying out IO file (1 file / 2MiB)
    test: Laying out IO file (1 file / 2MiB)
    test: Laying out IO file (1 file / 2MiB)

2024-01-22 19:10:11,200 bash             L0160 ERROR| BashCmd Test Failed: Error running 'fio --readwrite=randwrite --name=test --size="2M" --directory /tmp/7ABB47A6-72CD-4AE3-A52D-C7C03D6A75F8_daos_dfuse0/ --bs=1M --numjobs="4" --ioengine=psync --group_reporting --exitall_on_error --continue_on_error=none' on the following hosts: wolf-104vm1
2024-01-22 19:10:11,201 test             L1362 INFO | Test has failed, dumping ULT stacks

The interrupted=True part meaning it timed out.
IIRC from past experience, the network on VMs can often drop to a trickle of KBs. So 2M very well could take a long time. That might be part of why this test was on HW

ashleypittman · 2024-01-22T19:50:17Z

@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed. Since this only touches test code, we could run with
Test-tag: dfuse,Cmd
Skip-unit-tests: true
Skip-fault-injection-test: true
I thought build #3 had that already but I must have missed the tag, re-pushed.
A previous commit did, but after e29092b there weren't tags specified.

It was build #2 that ran and passed the tests, although it then got killed as I re-pushed before other testing had finished.

The latest build failed, it looks to be in less than a minute but no stderr or stack trace info so I'm not sure the cause.

ashleypittman · 2024-01-22T20:03:58Z

The interrupted=True part meaning it timed out. IIRC from past experience, the network on VMs can often drop to a trickle of KBs. So 2M very well could take a long time. That might be part of why this test was on HW

When it runs then it seems to pass in 4-5 seconds and the timeout is set to 30.

If the network on the VMs is that bad then we should take them out of service until we can identify the issue.

Looking at this test it appears to loop for 5 pools and 5 containers, but not if testing ioil, I'm not sure why it would do that but it explains why the dfuse variant is so much slower.

daltonbohning · 2024-01-22T20:19:37Z

If the network on the VMs is that bad then we should take them out of service until we can identify the issue

It's been a while, so I don't have pointers to past discussion, but I encountered this with some of the datamover tests and ultimately the only "solution" we had was to move to HW.

Maybe we could just reduce the blocksize for that fio command? Since the intention of the test appears to be just testing bash commands over dfuse.

Bump command timeout from 30 to 120 seconds. Reformat code as removing loop changed the indentation level anyway. Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman · 2024-01-23T14:06:46Z

Test updated and now passes with all three variants. The total non-il test takes 1m6s vs 8m55s on latest daily on top of moving from hw to vm.

src/tests/ftest/dfuse/bash.py

daltonbohning · 2024-01-23T17:05:37Z

src/tests/ftest/dfuse/bash.py

-            self.pool.destroy()
+        self.add_pool(connect=False)
+        self.add_container(self.pool)
+        mount_dir = f"/tmp/{self.pool.uuid}_daos_dfuse"


Pre-existing so not a blocker, but we prefer to use identifier so the label is used

Suggested change

mount_dir = f"/tmp/{self.pool.uuid}_daos_dfuse"

mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse"

daltonbohning · 2024-01-23T17:08:41Z

src/tests/ftest/dfuse/bash.py

+                ret_code = general_utils.pcmd(self.hostlist_clients, env_str + cmd, timeout=120)
+                if 0 not in ret_code:
+                    error_hosts = NodeSet(
+                        ",".join(
+                            [
+                                str(node_set)
+                                for code, node_set in list(ret_code.items())
+                                if code != 0
+                            ]
+                        )
+                    )
+                    raise CommandFailure(
+                        f"Error running '{cmd}' on the following hosts: {error_hosts}"
+                    )


Pre-existing, but FYI we have newer, better function called run_remote that works like

result = run_remote(self.log, self.hostlist_clients, cmd) if not result.passed: self.fail(f"... failed on {result.failed_hosts}")

Which eliminates this nasty logic to get a nodeset

Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

daltonbohning

LGTM.

daltonbohning · 2024-01-29T16:58:20Z

src/tests/ftest/dfuse/bash.py

+        mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse"
+        self.start_dfuse(self.hostlist_clients, self.pool, self.container, mount_dir=mount_dir)
+        if il_lib is not None:
+            # unmount dfuse and mount again with caching disabled
+            self.dfuse.unmount(tries=1)
+            self.dfuse.update_params(disable_caching=True)
+            self.dfuse.update_params(disable_wb_cache=True)
+            self.dfuse.run()


Again, pre-existing, so not a blocker. But if we only use one pool now, this can be simplified like this. I don't think there was any real need to unmount dfuse and remount with different params. We could just disable caching to begin with.

Suggested change

mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse"

self.start_dfuse(self.hostlist_clients, self.pool, self.container, mount_dir=mount_dir)

if il_lib is not None:

# unmount dfuse and mount again with caching disabled

self.dfuse.unmount(tries=1)

self.dfuse.update_params(disable_caching=True)

self.dfuse.update_params(disable_wb_cache=True)

self.dfuse.run()

if il_lib is None:

self.start_dfuse(self.hostlist_clients, self.pool, self.container)

else:

self.start_dfuse(self.hostlist_clients, self.pool, self.container, disable_caching=True, disable_wb_cache=True)

DAOS-623 test: Move daos_build test vms.

431e737

Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs Test-repeat: 5 Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

daltonbohning reviewed Jan 18, 2024

View reviewed changes

ashleypittman added 3 commits January 19, 2024 08:19

Update test yaml

492d94a

Skip-unit-tests: true Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs Test-repeat: 5 Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

Merge branch 'master' into amd/bash-move

a169457

Add dfs tag and move slow test to daily.

e29092b

Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman marked this pull request as ready for review January 19, 2024 10:00

ashleypittman requested a review from a team as a code owner January 19, 2024 10:00

ashleypittman requested review from phender and kccain January 19, 2024 10:00

kccain previously approved these changes Jan 19, 2024

View reviewed changes

daltonbohning previously approved these changes Jan 19, 2024

View reviewed changes

ashleypittman requested a review from a team January 19, 2024 18:00

jolivier23 previously approved these changes Jan 19, 2024

View reviewed changes

Merge branch 'master' into amd/bash-move

8117f64

Required-githooks: true

Merge branch 'master' into amd/bash-move

2f5c619

Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true

ashleypittman dismissed kccain’s stale review via bf86733 January 23, 2024 12:25

ashleypittman dismissed stale reviews from daltonbohning and jolivier23 via bf86733 January 23, 2024 12:25

ashleypittman changed the title ~~DAOS-623 test: Move daos_build test vms.~~ DAOS-623 test: Move dfuse/bash test to vms. Jan 23, 2024

ashleypittman changed the title ~~DAOS-623 test: Move dfuse/bash test to vms.~~ DAOS-15066 test: Update dfuse/bash and more to vms. Jan 23, 2024

kccain previously approved these changes Jan 23, 2024

View reviewed changes

daltonbohning previously approved these changes Jan 23, 2024

View reviewed changes

Apply review feedback.

b354e2b

Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman dismissed stale reviews from daltonbohning and kccain via b354e2b January 23, 2024 18:01

Change the import path.

05cbd48

Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman requested review from daltonbohning, kccain, jolivier23 and wiliamhuang January 29, 2024 10:00

kccain approved these changes Jan 29, 2024

View reviewed changes

daltonbohning approved these changes Jan 29, 2024

View reviewed changes

ashleypittman requested a review from a team January 29, 2024 19:15

daltonbohning merged commit 132b00d into master Feb 2, 2024
40 checks passed

daltonbohning deleted the amd/bash-move branch February 2, 2024 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-15066 test: Update dfuse/bash and more to vms. #13631

DAOS-15066 test: Update dfuse/bash and more to vms. #13631

ashleypittman commented Jan 18, 2024

github-actions bot commented Jan 18, 2024 •

edited

Loading

daosbuild1 commented Jan 18, 2024

daltonbohning left a comment

ashleypittman commented Jan 19, 2024

kccain Jan 19, 2024

ashleypittman Jan 19, 2024

daltonbohning Jan 19, 2024

daltonbohning Jan 19, 2024

wiliamhuang Jan 19, 2024

ashleypittman Jan 23, 2024

daltonbohning commented Jan 19, 2024

daosbuild1 commented Jan 21, 2024

daltonbohning commented Jan 22, 2024

ashleypittman commented Jan 22, 2024

daltonbohning commented Jan 22, 2024

daosbuild1 commented Jan 22, 2024

daltonbohning commented Jan 22, 2024

ashleypittman commented Jan 22, 2024

ashleypittman commented Jan 22, 2024

daltonbohning commented Jan 22, 2024

ashleypittman commented Jan 23, 2024

daltonbohning Jan 23, 2024

daltonbohning Jan 23, 2024

daltonbohning left a comment

daltonbohning Jan 29, 2024

	server_config:
	name: daos_server
	engines_per_host: 1
	engines:
	0:
	targets: 4
	nr_xs_helpers: 0
	storage:
	0:
	class: ram
	scm_mount: /mnt/daos
	system_ram_reserved: 1

		@@ -207,8 +207,8 @@ def test_bashcmd_pil4dfs(self):
		commands.

		:avocado: tags=all

	mount_dir = f"/tmp/{self.pool.uuid}_daos_dfuse"
	mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse"

DAOS-15066 test: Update dfuse/bash and more to vms. #13631

DAOS-15066 test: Update dfuse/bash and more to vms. #13631

Conversation

ashleypittman commented Jan 18, 2024

github-actions bot commented Jan 18, 2024 • edited Loading

daosbuild1 commented Jan 18, 2024

daltonbohning left a comment

Choose a reason for hiding this comment

ashleypittman commented Jan 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daltonbohning commented Jan 19, 2024

daosbuild1 commented Jan 21, 2024

daltonbohning commented Jan 22, 2024

ashleypittman commented Jan 22, 2024

daltonbohning commented Jan 22, 2024

daosbuild1 commented Jan 22, 2024

daltonbohning commented Jan 22, 2024

ashleypittman commented Jan 22, 2024

ashleypittman commented Jan 22, 2024

daltonbohning commented Jan 22, 2024

ashleypittman commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daltonbohning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 18, 2024 •

edited

Loading