-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-15066 test: Update dfuse/bash and more to vms. #13631
Conversation
Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs Test-repeat: 5 Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
Bug-tracker data: |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13631/1/execution/node/1180/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To work on VMs, you'd want to update the yaml similar to
daos/src/tests/ftest/dfuse/mu_perms.yaml
Lines 5 to 16 in cf10b98
server_config: | |
name: daos_server | |
engines_per_host: 1 | |
engines: | |
0: | |
targets: 4 | |
nr_xs_helpers: 0 | |
storage: | |
0: | |
class: ram | |
scm_mount: /mnt/daos | |
system_ram_reserved: 1 |
Skip-unit-tests: true Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs Test-repeat: 5 Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
The only "slow" version of this test was the non-il version which takes about 10 minutes due to caching being disabled. The two il versions take about a minute each. Moving from hardware to vms doesn't affect run-time and there are more slots available in CI for it. I left the ioil test enabled for PRs but moved the slow test from prs to daily, as well as moving it from hw to vms. |
src/tests/ftest/dfuse/bash.py
Outdated
@@ -207,8 +207,8 @@ def test_bashcmd_pil4dfs(self): | |||
commands. | |||
|
|||
:avocado: tags=all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question only (no change suggested): is the idea to (at a future point) specify a pr or daily_regression or full_regression tag to this pil4dfs test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This test should probably be weekly I'd assume, there's good coverage here, it also only takes a minute so we could easily make it daily as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wiliamhuang Was this an oversight in #13257? Currently this test doesn't run in pr, daily, or weekly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend daily to be inline with the other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daltonbohning Thank you very much! I will add "daily" in my other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to update this PR anyway so I've added the tag.
One thing to keep in mind is that servers aren't restarted between each test case. So part of the test time for the first test includes server start/stop. (Maybe a minute?) If one of the other test cases were ran in isolation, they'd probably take another minute. This is mostly a minor consideration when splitting tests in the same file amongst pr, daily, and weekly |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13631/3/testReport/ |
Required-githooks: true
@ashleypittman Can we get a run with all three modified tests, please? The changes in recent commits looks sane, but it is possible to break the running of a test if the tags are malformed.
|
Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true
I thought build #3 had that already but I must have missed the tag, re-pushed. |
A previous commit did, but after e29092b there weren't tags specified. |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13631/5/testReport/ |
Looks the fio command timed out
The |
It was build #2 that ran and passed the tests, although it then got killed as I re-pushed before other testing had finished. The latest build failed, it looks to be in less than a minute but no stderr or stack trace info so I'm not sure the cause. |
When it runs then it seems to pass in 4-5 seconds and the timeout is set to 30. If the network on the VMs is that bad then we should take them out of service until we can identify the issue. Looking at this test it appears to loop for 5 pools and 5 containers, but not if testing ioil, I'm not sure why it would do that but it explains why the dfuse variant is so much slower. |
It's been a while, so I don't have pointers to past discussion, but I encountered this with some of the datamover tests and ultimately the only "solution" we had was to move to HW. Maybe we could just reduce the blocksize for that fio command? Since the intention of the test appears to be just testing bash commands over dfuse. |
Bump command timeout from 30 to 120 seconds. Reformat code as removing loop changed the indentation level anyway. Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
bf86733
Test updated and now passes with all three variants. The total non-il test takes 1m6s vs 8m55s on latest daily on top of moving from hw to vm. |
src/tests/ftest/dfuse/bash.py
Outdated
self.pool.destroy() | ||
self.add_pool(connect=False) | ||
self.add_container(self.pool) | ||
mount_dir = f"/tmp/{self.pool.uuid}_daos_dfuse" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing so not a blocker, but we prefer to use identifier so the label is used
mount_dir = f"/tmp/{self.pool.uuid}_daos_dfuse" | |
mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse" |
src/tests/ftest/dfuse/bash.py
Outdated
ret_code = general_utils.pcmd(self.hostlist_clients, env_str + cmd, timeout=120) | ||
if 0 not in ret_code: | ||
error_hosts = NodeSet( | ||
",".join( | ||
[ | ||
str(node_set) | ||
for code, node_set in list(ret_code.items()) | ||
if code != 0 | ||
] | ||
) | ||
) | ||
raise CommandFailure( | ||
f"Error running '{cmd}' on the following hosts: {error_hosts}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing, but FYI we have newer, better function called run_remote
that works like
result = run_remote(self.log, self.hostlist_clients, cmd)
if not result.passed:
self.fail(f"... failed on {result.failed_hosts}")
Which eliminates this nasty logic to get a nodeset
Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
b354e2b
Test-tag: dfuse,Cmd Skip-unit-tests: true Skip-fault-injection-test: true Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse" | ||
self.start_dfuse(self.hostlist_clients, self.pool, self.container, mount_dir=mount_dir) | ||
if il_lib is not None: | ||
# unmount dfuse and mount again with caching disabled | ||
self.dfuse.unmount(tries=1) | ||
self.dfuse.update_params(disable_caching=True) | ||
self.dfuse.update_params(disable_wb_cache=True) | ||
self.dfuse.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, pre-existing, so not a blocker. But if we only use one pool now, this can be simplified like this. I don't think there was any real need to unmount dfuse and remount with different params. We could just disable caching to begin with.
mount_dir = f"/tmp/{self.pool.identifier}_daos_dfuse" | |
self.start_dfuse(self.hostlist_clients, self.pool, self.container, mount_dir=mount_dir) | |
if il_lib is not None: | |
# unmount dfuse and mount again with caching disabled | |
self.dfuse.unmount(tries=1) | |
self.dfuse.update_params(disable_caching=True) | |
self.dfuse.update_params(disable_wb_cache=True) | |
self.dfuse.run() | |
if il_lib is None: | |
self.start_dfuse(self.hostlist_clients, self.pool, self.container) | |
else: | |
self.start_dfuse(self.hostlist_clients, self.pool, self.container, disable_caching=True, disable_wb_cache=True) |
Test-tag: test_bashcmd test_bashcmd_ioil test_bashcmd_pil4dfs
Test-repeat: 5
Required-githooks: true
Signed-off-by: Ashley Pittman [email protected]