DAOS-14725 client: force cleanup event query when test teardown #13509

Nasf-Fan · 2023-12-15T06:05:16Z

For the test cases that are driven by CMOCKA, the test_case_teardown()
may be triggered by some signal that is captured by CMOCKA registered
signal handler. Under such case, related logic has long jumped out of
original DAOS lower level (object or cart) context. Related tasks may
be still in RUNNING status, but they are not runnable any longer even
if re-scheduled. It means that we may lost some resources attached to
related tasks, that is not fatal. They can be reclaimed automatically
when current test process exits.

What we can do for test cleanup is that try to reset "ev_thpriv" that
will be reused for subsequent test cases. Otherwise, one test failure
may block all subsequent test cases, that is too bad.

For such purpose we introduce "force" parameter for DAOS internal API
daos_event_priv_reset() to cleanup the shared event queue by force.

The patch also contains some other test environment cleanup.

Required-githooks: true

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2023-12-15T06:15:34Z

Bug-tracker data:
Ticket title is 'Retry for object sync fails in cli_obj.c'
Status is 'In Review'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-14725

daosbuild1 · 2023-12-17T18:27:09Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/3/execution/node/1319/log

daosbuild1 · 2023-12-17T20:28:19Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/3/execution/node/1365/log

mchaarawi

This change seems inappropriate to me as it's changing public API for some internal test case situation.

mchaarawi · 2023-12-21T15:35:43Z

src/client/api/event.c

+			 * Wait and retry at most 60 seconds if related task is in busy status.
+			 * Currently, it is used by test logic to cleanup task with loop retry.
+			 */
+			sleep(1);


this is unacceptable imo. sleep should always be avoided in user space code. this can introduce many issues for apps.

daos_event_priv_reset() is not application visible API, instead, it is only used by DAOS internal tests.

i did not say that. my concern is that you are adding a new field in the public event structure (that users do not know about and will cause confusion if they try to use) and a sleep in the code just for testing purposes. this solution is not acceptable IMO.

I re-checked the logs and related cmocka logic, and found that the test process has already long jumped out of related task context. Under such case, aborting task become meaningless. I will remake the patch to handle that.

mchaarawi · 2023-12-21T15:37:53Z

src/include/daos_types.h

-	uint64_t		ev_debug;
+	uint32_t		ev_debug;
+	/** Abort the task(s) related with the event queue. */
+	uint32_t		ev_abort:1;


this is a public API change. given also this is an internal change, does not seem to apply to user events, this should not be here.

There is a space hole after "ev_error" in original "struct daos_event" for 64-bits alignment.

typedef struct daos_event {
/** return code of non-blocking operation /
int ev_error;
/* Internal use - 152 + 8 bytes pad for pthread_mutex_t size difference on aarch64 /
struct {
uint64_t space[20];
} ev_private;
/* Used for debugging */
uint64_t ev_debug;
} daos_event_t;

I think I can add the new field into such hole to avoid compatibility trouble caused by the layout changes for "struct daos_event".

mchaarawi · 2023-12-21T15:40:59Z

src/tests/suite/daos_test.h

@@ -306,7 +306,7 @@ async_overlap(void **state)
 static inline int
 test_case_teardown(void **state)
 {
-	assert_rc_equal(daos_event_priv_reset(), 0);
+	assert_rc_equal(daos_event_priv_reset(true), 0);


can we just remove this call here and just avoid this issue?

I do not think so. If we do not call daos_event_priv_reset(), there may be task leak when test completed. That will hide some potential issues.

On the other hand, daos_event_priv_reset() is not public API, instead, it is in src/client/api/client_internal.h. That means regular application cannot directly use such API. Then changing it will not affect the application, right?

well IMO that's really a bug in the test itself. doing a force reset in case of all testing might hide bugs IMO and we should not do that here.

No, the test itself does not has bug, instead, it is the CMOCKA framework to make the long jump out of the object level logic directly without recovering the shared event queue. Related test has already failed out, it did not hide the failure. What we do here is to cleanup the environment to make the subsequent test can go ahead.

What is your suggestion for this issue? Just removing the daos_event_priv_reset()? That only hides the "warning message" by skipping cleanup, but it will cause subsequent tests to be failed. So it does not resolve the trouble.

sorry i miss how cmocka is related to the object level logic.
If you have a test that short circuits in the object level internal code, you should make sure that this test cleans (forcefully) but only that test. making a force cleanup for all tests is what im against in this case.

i understand that and my concern again is for tests. consider this:
a PR introduces a bug in an existing test where it accidentally leaves an event in running state. with this change, this part of the code is not executed anymore:
https://github.com/daos-stack/daos/pull/13509/files#diff-11ada6ab4871be42589c1fed68b76148bb02a9d0001e3b027a2672e919bbb5e8L1084
and we just hide this failure, no?

I do not know whether we have efficient way to distinguish the two cases: test corruption or test logic bug. From the perspective of cleanup logic, it may be not the cleanup logic needs to care about. Sorry, I do not know how to give a suitable example for that right now.

If your concern case is more important, then just ignore the reset failure and go ahead?

sorry but im not talking about a test logic issue. im talking about an actual bug that could be introduced.
opening an object and not closing it is not something that can be detected in this case anyway. this about exiting an operation without finalizing the internal event. if we introduce such a bug unrelated to the cmocka framework, for example in pool_disconnect, this would not be detected. so we cannot ignore the reset failure unfortunately.

could you perhaps in your case where you think the fault is OK/expected but cmocka calling test_teardown() needs to be adjusted is simply registering a different teardown handler when setting up the test instead of changing the default one that does this force cleanup?

The fault in the case is not expected, but the long jump from cmocka is not controllable by DAOS. That is the direct reason for the dirty event queue when test_teardown(). It is not special for this test, instead, it is general issue for all test cases with test_teardown called via cmocka() long jump. So we cannot simple replace test_teardown one by one. On some degree, if your concern case has to be considered, then ignore such dirty event queue (and error message) when test_teardown() may be the best choice: mean fail the subsequent tests to draw the attention. How do you think?

I found the solution to distinguish whether the logic is from signal handling or regular test exit. Please check the new commit. Thanks!

liuxuezhao · 2023-12-22T12:55:46Z

src/object/cli_obj.c

+	int		 rc;
+
+	if (unlikely(ev != NULL && ev->ev_abort))
+		D_GOTO(err, rc = -DER_CANCELED);


if it is in dead loop in retry of object IO here, then the API task seems should not be finished and should not get to user's daos_event_priv_reset() call.
In the ticket pasted log is not detailed enough so not very sure if really related with this.

Related task is not completed, but the test logic long jumped out of the context, that is related with cmocka logic for handling some signal. I will remake the patch to handle that.

For the test cases that are driven by CMOCKA, the test_case_teardown() may be triggered by some signal that is captured by CMOCKA registered signal handler. Under such case, related logic has long jumped out of original DAOS lower level (object or cart) context. Related tasks may be still in RUNNING status, but they are not runnable any longer even if re-scheduled. It means that we may lost some resources attached to related tasks, that is not fatal. They can be reclaimed automatically when current test process exits. What we can do for test cleanup is that try to reset "ev_thpriv" that will be reused for subsequent test cases. Otherwise, one test failure may block all subsequent test cases, that is too bad. For such purpose we introduce "force" parameter for DAOS internal API daos_event_priv_reset() to cleanup the shared event queue by force. The patch also contains some other test environment cleanup. Required-githooks: true Signed-off-by: Fan Yong <[email protected]>

Nasf-Fan · 2024-01-07T13:47:15Z

Ping reviewers! Thanks!

mchaarawi

i still think we are making some generic code / test changes for just a single use case. this change might hide other bugs IMO as it removes the error condition in case something is leaked in all daos_test.

mchaarawi · 2024-01-08T14:46:28Z

src/tests/suite/daos_test.h

@@ -306,7 +306,7 @@ async_overlap(void **state)
 static inline int
 test_case_teardown(void **state)
 {
-	assert_rc_equal(daos_event_priv_reset(), 0);
+	assert_rc_equal(daos_event_priv_reset(true), 0);


well IMO that's really a bug in the test itself. doing a force reset in case of all testing might hide bugs IMO and we should not do that here.

mchaarawi · 2024-01-08T15:19:00Z

src/tests/suite/daos_base_tx.c

@@ -704,6 +704,9 @@ dtx_resend_delay(test_arg_t *arg, daos_oclass_id_t oclass)
 	daos_fail_loc_set(0);
 	dtx_set_fail_loc(arg, 0);

+	/* Wait for the former delayed RPC before destroying the container to avoid DER_BUSY. */
+	sleep(2);


can you please explain why 2 sec sleep is needed here? why not 1 or 10?
does it vary if we change some CI hardware?

We injected some fail_loc to make related IO handler to delay process the IO request about 1 second later after client RPC timeout. Here, we sleep additional 1 second to guarantee that related IO handler has already responded.

Cleanup test environment only when logic is longjump from cmocka for handling signal. Signed-off-by: Fan Yong <[email protected]>

liuxuezhao · 2024-01-09T09:30:17Z

src/tests/suite/daos_test.h

+			    sigismember(&sigset, SIGSEGV) ||
+			    sigismember(&sigset, SIGBUS) ||
+			    sigismember(&sigset, SIGSYS))) {
+		force = true;


it is good if it can check if it is longjump from cmocka internally, in that case it mostly like there are mem corruption already, maybe not worth to do the follow "cleanup".
How about just print fatal err log here first, to see if it will really hit this case in following test.
The change in event.c is not bad, but probably not must to do that now to avoid affect event internal handling. If will hit this failure later in CI tests, we can consider further later?

It's good idea to print the error message when hit such trouble. As for whether force cleanup or not, I tend to do that. Because:

If force cleanup works well, then the subsequent tests can go ahead.

We may have not verified related cleanup logic under some fatal failure cases. We do that now, it may help to expose some potential bug(s).

Since the test has already corrupted, the cleanup logic will not make the situation to be worse.

Help to distinguish the issue. Signed-off-by: Fan Yong <[email protected]>

Signed-off-by: Fan Yong <[email protected]>

mchaarawi

the change now seems harmless and im fine to approve it. although it would be good to look at the test results / log to see if this actually does get triggered to justify it (testing for daos_test has not started yet).

daosbuild1 · 2024-01-11T02:59:19Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13509/10/testReport/

Trigger SIGFPE signal. Signed-off-by: Fan Yong <[email protected]>

daosbuild1 · 2024-01-11T16:52:47Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13509/11/testReport/

daosbuild1 · 2024-01-11T19:09:38Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/11/execution/node/1434/log

Nasf-Fan · 2024-01-12T14:01:49Z

src/tests/suite/daos_test.h

+	 */
+	if (sigprocmask(0, NULL, &sigset) < 0) {
+		print_message("sigprocmask failure\n");
+	} else {


@liuxuezhao @mchaarawi , sorry, I made some mistake in former comment. We cannot distinguish whether the test_teardown() is call from normal test complete routine or from cmocka signal handler longjmp. Because after signal handler longjmp, its signal mask has already been recovered, means that the signal that triggered the exception has been dropped when come here. So we cannot use the signal mask to distinguish the context unless before calling siglongjmp().

By registering my own signal handler, I can simulate the exception and trigger teardown() by force. That proves the teardown() by force can work (and subsequent test can go ahead without failure). But because we cannot distinguish the context without our own signal hander, then whether still keep the teardown() by force? What's your suggestion?

Ping @liuxuezhao && @mchaarawi , what's your suggestion about above comment? Thanks!

you mean in this PR it actually cannot distinguish whether or not should force daos_event_priv_reset()? if so not very sure what's the benefit to do so?

We cannot distinguish whether the caller is from signal handler caused long jump or from regular test cleanup logic. So if consider @mchaarawi 's concern about potentially hiding bug when daos_event_priv_reset() by force, then we may have to drop the changes for daos_event_priv_reset() with "force" parameter. Then without such fixes, we have very few things to do for this ticket. Here is the simplified patch in another PR:
#13596.

I think the other PR looks much cleaner to me for this issue.

@liuxuezhao , are you fine with another PR? if yes, I will replace this one with it.

I am fine to go with the other PR first, thanks.

@Nasf-Fan "By registering my own signal handler, I can simulate the exception and trigger teardown() by force", if register signal handler in test_setup, can it get notification if got those abnormal signal? maybe it can be useful for daos_test to catch the error. just FYI.

I mean that by registering my own signal RPC handler, the daos_event_priv_reset() can distinguish whether the the caller is from signal handler or normal exit via some hack (but not suggested) flag, then I can verify the force cleanup logic works.

But because the corruption can happen during any test case, then we need to register related signal handlers to replace CMOCKA registered ones for almost all of our test cases when setup, that will cause a lot of code lines changes. Not sure whether it is worth or not.

On the other hand, which signal will be handled by CMOCKA is blind for DAOS, consider CMOCKA potential upgrade, then we need to trace CMOCKA changes (for signal handler) via new configuration. In theory, we can do that, but whether worth or not only for test cleanup.

ok, that seems quite complex and hack, maybe not worth to do that. thanks for the explanation.

Nasf-Fan · 2024-01-23T14:29:49Z

Replaced by #13596

Nasf-Fan force-pushed the Nasf-Fan/DAOS-14725 branch 2 times, most recently from 3763c23 to 93e1f48 Compare December 16, 2023 01:51

Nasf-Fan force-pushed the Nasf-Fan/DAOS-14725 branch from 93e1f48 to f628067 Compare December 18, 2023 05:42

Nasf-Fan marked this pull request as ready for review December 20, 2023 17:20

Nasf-Fan requested a review from a team as a code owner December 20, 2023 17:20

Nasf-Fan requested review from liuxuezhao, mchaarawi and wangdi1 December 20, 2023 17:21

mchaarawi requested changes Dec 21, 2023

View reviewed changes

Nasf-Fan requested a review from mchaarawi December 22, 2023 08:17

liuxuezhao reviewed Dec 22, 2023

View reviewed changes

Nasf-Fan changed the title ~~DAOS-14725 client: force abort task with loop retry~~ DAOS-14725 client: force cleanup event query when test teardown Dec 26, 2023

Nasf-Fan force-pushed the Nasf-Fan/DAOS-14725 branch from cfc92e0 to f4488bd Compare December 26, 2023 15:26

Nasf-Fan requested review from liuxuezhao and gnailzenh December 26, 2023 15:27

mchaarawi requested changes Jan 8, 2024

View reviewed changes

mchaarawi reviewed Jan 8, 2024

View reviewed changes

Nasf-Fan added 2 commits January 9, 2024 16:33

DAOS-14725 tests: force clenup when longjump from from cmocka

e9a5508

Cleanup test environment only when logic is longjump from cmocka for handling signal. Signed-off-by: Fan Yong <[email protected]>

Merge branch 'master' into Nasf-Fan/DAOS-14725

017b02c

Nasf-Fan requested a review from mchaarawi January 9, 2024 08:40

liuxuezhao reviewed Jan 9, 2024

View reviewed changes

Nasf-Fan added 3 commits January 9, 2024 18:35

DAOS-14725 tests: error message when hit test corruption

cd8e719

Help to distinguish the issue. Signed-off-by: Fan Yong <[email protected]>

DAOS-14725 tests: fix typo

0073bae

Signed-off-by: Fan Yong <[email protected]>

Merge branch 'master' into Nasf-Fan/DAOS-14725

160455b

mchaarawi reviewed Jan 10, 2024

View reviewed changes

DAOS-14725 tests: test case for cleanup after corruption

3ea6c8f

Trigger SIGFPE signal. Signed-off-by: Fan Yong <[email protected]>

Nasf-Fan commented Jan 12, 2024

View reviewed changes

Nasf-Fan requested review from mchaarawi and liuxuezhao January 22, 2024 12:43

Nasf-Fan closed this Jan 23, 2024

Nasf-Fan deleted the Nasf-Fan/DAOS-14725 branch January 23, 2024 14:30

DAOS-14725 client: force cleanup event query when test teardown #13509

DAOS-14725 client: force cleanup event query when test teardown #13509

Conversation

Nasf-Fan commented Dec 15, 2023 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Dec 15, 2023 • edited Loading

daosbuild1 commented Dec 17, 2023

daosbuild1 commented Dec 17, 2023

mchaarawi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nasf-Fan Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nasf-Fan Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nasf-Fan commented Jan 7, 2024

mchaarawi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mchaarawi left a comment • edited Loading

Choose a reason for hiding this comment

daosbuild1 commented Jan 11, 2024

daosbuild1 commented Jan 11, 2024

daosbuild1 commented Jan 11, 2024

Nasf-Fan Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nasf-Fan commented Jan 23, 2024

Nasf-Fan commented Dec 15, 2023 •

edited

Loading

github-actions bot commented Dec 15, 2023 •

edited

Loading

Nasf-Fan Dec 22, 2023 •

edited

Loading

Nasf-Fan Jan 8, 2024 •

edited

Loading

mchaarawi left a comment •

edited

Loading

Nasf-Fan Jan 12, 2024 •

edited

Loading