-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14725 client: force cleanup event query when test teardown #13509
Conversation
Bug-tracker data: |
3763c23
to
93e1f48
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/3/execution/node/1319/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/3/execution/node/1365/log |
93e1f48
to
f628067
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems inappropriate to me as it's changing public API for some internal test case situation.
src/client/api/event.c
Outdated
* Wait and retry at most 60 seconds if related task is in busy status. | ||
* Currently, it is used by test logic to cleanup task with loop retry. | ||
*/ | ||
sleep(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unacceptable imo. sleep should always be avoided in user space code. this can introduce many issues for apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
daos_event_priv_reset() is not application visible API, instead, it is only used by DAOS internal tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did not say that. my concern is that you are adding a new field in the public event structure (that users do not know about and will cause confusion if they try to use) and a sleep in the code just for testing purposes. this solution is not acceptable IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-checked the logs and related cmocka logic, and found that the test process has already long jumped out of related task context. Under such case, aborting task become meaningless. I will remake the patch to handle that.
src/include/daos_types.h
Outdated
uint64_t ev_debug; | ||
uint32_t ev_debug; | ||
/** Abort the task(s) related with the event queue. */ | ||
uint32_t ev_abort:1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a public API change. given also this is an internal change, does not seem to apply to user events, this should not be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a space hole after "ev_error" in original "struct daos_event" for 64-bits alignment.
typedef struct daos_event {
/** return code of non-blocking operation /
int ev_error;
/* Internal use - 152 + 8 bytes pad for pthread_mutex_t size difference on aarch64 /
struct {
uint64_t space[20];
} ev_private;
/* Used for debugging */
uint64_t ev_debug;
} daos_event_t;
I think I can add the new field into such hole to avoid compatibility trouble caused by the layout changes for "struct daos_event".
src/tests/suite/daos_test.h
Outdated
@@ -306,7 +306,7 @@ async_overlap(void **state) | |||
static inline int | |||
test_case_teardown(void **state) | |||
{ | |||
assert_rc_equal(daos_event_priv_reset(), 0); | |||
assert_rc_equal(daos_event_priv_reset(true), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just remove this call here and just avoid this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so. If we do not call daos_event_priv_reset(), there may be task leak when test completed. That will hide some potential issues.
On the other hand, daos_event_priv_reset() is not public API, instead, it is in src/client/api/client_internal.h. That means regular application cannot directly use such API. Then changing it will not affect the application, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well IMO that's really a bug in the test itself. doing a force reset in case of all testing might hide bugs IMO and we should not do that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the test itself does not has bug, instead, it is the CMOCKA framework to make the long jump out of the object level logic directly without recovering the shared event queue. Related test has already failed out, it did not hide the failure. What we do here is to cleanup the environment to make the subsequent test can go ahead.
What is your suggestion for this issue? Just removing the daos_event_priv_reset()? That only hides the "warning message" by skipping cleanup, but it will cause subsequent tests to be failed. So it does not resolve the trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i miss how cmocka is related to the object level logic.
If you have a test that short circuits in the object level internal code, you should make sure that this test cleans (forcefully) but only that test. making a force cleanup for all tests is what im against in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand that and my concern again is for tests. consider this:
a PR introduces a bug in an existing test where it accidentally leaves an event in running state. with this change, this part of the code is not executed anymore:
https://github.com/daos-stack/daos/pull/13509/files#diff-11ada6ab4871be42589c1fed68b76148bb02a9d0001e3b027a2672e919bbb5e8L1084
and we just hide this failure, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know whether we have efficient way to distinguish the two cases: test corruption or test logic bug. From the perspective of cleanup logic, it may be not the cleanup logic needs to care about. Sorry, I do not know how to give a suitable example for that right now.
If your concern case is more important, then just ignore the reset failure and go ahead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry but im not talking about a test logic issue. im talking about an actual bug that could be introduced.
opening an object and not closing it is not something that can be detected in this case anyway. this about exiting an operation without finalizing the internal event. if we introduce such a bug unrelated to the cmocka framework, for example in pool_disconnect, this would not be detected. so we cannot ignore the reset failure unfortunately.
could you perhaps in your case where you think the fault is OK/expected but cmocka calling test_teardown() needs to be adjusted is simply registering a different teardown handler when setting up the test instead of changing the default one that does this force cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fault in the case is not expected, but the long jump from cmocka is not controllable by DAOS. That is the direct reason for the dirty event queue when test_teardown(). It is not special for this test, instead, it is general issue for all test cases with test_teardown called via cmocka() long jump. So we cannot simple replace test_teardown one by one. On some degree, if your concern case has to be considered, then ignore such dirty event queue (and error message) when test_teardown() may be the best choice: mean fail the subsequent tests to draw the attention. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the solution to distinguish whether the logic is from signal handling or regular test exit. Please check the new commit. Thanks!
src/object/cli_obj.c
Outdated
int rc; | ||
|
||
if (unlikely(ev != NULL && ev->ev_abort)) | ||
D_GOTO(err, rc = -DER_CANCELED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is in dead loop in retry of object IO here, then the API task seems should not be finished and should not get to user's daos_event_priv_reset() call.
In the ticket pasted log is not detailed enough so not very sure if really related with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related task is not completed, but the test logic long jumped out of the context, that is related with cmocka logic for handling some signal. I will remake the patch to handle that.
For the test cases that are driven by CMOCKA, the test_case_teardown() may be triggered by some signal that is captured by CMOCKA registered signal handler. Under such case, related logic has long jumped out of original DAOS lower level (object or cart) context. Related tasks may be still in RUNNING status, but they are not runnable any longer even if re-scheduled. It means that we may lost some resources attached to related tasks, that is not fatal. They can be reclaimed automatically when current test process exits. What we can do for test cleanup is that try to reset "ev_thpriv" that will be reused for subsequent test cases. Otherwise, one test failure may block all subsequent test cases, that is too bad. For such purpose we introduce "force" parameter for DAOS internal API daos_event_priv_reset() to cleanup the shared event queue by force. The patch also contains some other test environment cleanup. Required-githooks: true Signed-off-by: Fan Yong <[email protected]>
cfc92e0
to
f4488bd
Compare
Ping reviewers! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still think we are making some generic code / test changes for just a single use case. this change might hide other bugs IMO as it removes the error condition in case something is leaked in all daos_test.
src/tests/suite/daos_test.h
Outdated
@@ -306,7 +306,7 @@ async_overlap(void **state) | |||
static inline int | |||
test_case_teardown(void **state) | |||
{ | |||
assert_rc_equal(daos_event_priv_reset(), 0); | |||
assert_rc_equal(daos_event_priv_reset(true), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well IMO that's really a bug in the test itself. doing a force reset in case of all testing might hide bugs IMO and we should not do that here.
@@ -704,6 +704,9 @@ dtx_resend_delay(test_arg_t *arg, daos_oclass_id_t oclass) | |||
daos_fail_loc_set(0); | |||
dtx_set_fail_loc(arg, 0); | |||
|
|||
/* Wait for the former delayed RPC before destroying the container to avoid DER_BUSY. */ | |||
sleep(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please explain why 2 sec sleep is needed here? why not 1 or 10?
does it vary if we change some CI hardware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We injected some fail_loc to make related IO handler to delay process the IO request about 1 second later after client RPC timeout. Here, we sleep additional 1 second to guarantee that related IO handler has already responded.
Cleanup test environment only when logic is longjump from cmocka for handling signal. Signed-off-by: Fan Yong <[email protected]>
src/tests/suite/daos_test.h
Outdated
sigismember(&sigset, SIGSEGV) || | ||
sigismember(&sigset, SIGBUS) || | ||
sigismember(&sigset, SIGSYS))) { | ||
force = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is good if it can check if it is longjump from cmocka internally, in that case it mostly like there are mem corruption already, maybe not worth to do the follow "cleanup".
How about just print fatal err log here first, to see if it will really hit this case in following test.
The change in event.c is not bad, but probably not must to do that now to avoid affect event internal handling. If will hit this failure later in CI tests, we can consider further later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good idea to print the error message when hit such trouble. As for whether force cleanup or not, I tend to do that. Because:
- If force cleanup works well, then the subsequent tests can go ahead.
- We may have not verified related cleanup logic under some fatal failure cases. We do that now, it may help to expose some potential bug(s).
Since the test has already corrupted, the cleanup logic will not make the situation to be worse.
Help to distinguish the issue. Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Fan Yong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change now seems harmless and im fine to approve it. although it would be good to look at the test results / log to see if this actually does get triggered to justify it (testing for daos_test has not started yet).
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13509/10/testReport/ |
Trigger SIGFPE signal. Signed-off-by: Fan Yong <[email protected]>
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13509/11/testReport/ |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/11/execution/node/1434/log |
*/ | ||
if (sigprocmask(0, NULL, &sigset) < 0) { | ||
print_message("sigprocmask failure\n"); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuxuezhao @mchaarawi , sorry, I made some mistake in former comment. We cannot distinguish whether the test_teardown() is call from normal test complete routine or from cmocka signal handler longjmp. Because after signal handler longjmp, its signal mask has already been recovered, means that the signal that triggered the exception has been dropped when come here. So we cannot use the signal mask to distinguish the context unless before calling siglongjmp().
By registering my own signal handler, I can simulate the exception and trigger teardown() by force. That proves the teardown() by force can work (and subsequent test can go ahead without failure). But because we cannot distinguish the context without our own signal hander, then whether still keep the teardown() by force? What's your suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @liuxuezhao && @mchaarawi , what's your suggestion about above comment? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean in this PR it actually cannot distinguish whether or not should force daos_event_priv_reset()? if so not very sure what's the benefit to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot distinguish whether the caller is from signal handler caused long jump or from regular test cleanup logic. So if consider @mchaarawi 's concern about potentially hiding bug when daos_event_priv_reset() by force, then we may have to drop the changes for daos_event_priv_reset() with "force" parameter. Then without such fixes, we have very few things to do for this ticket. Here is the simplified patch in another PR:
#13596.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other PR looks much cleaner to me for this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuxuezhao , are you fine with another PR? if yes, I will replace this one with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to go with the other PR first, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nasf-Fan "By registering my own signal handler, I can simulate the exception and trigger teardown() by force", if register signal handler in test_setup, can it get notification if got those abnormal signal? maybe it can be useful for daos_test to catch the error. just FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that by registering my own signal RPC handler, the daos_event_priv_reset() can distinguish whether the the caller is from signal handler or normal exit via some hack (but not suggested) flag, then I can verify the force cleanup logic works.
But because the corruption can happen during any test case, then we need to register related signal handlers to replace CMOCKA registered ones for almost all of our test cases when setup, that will cause a lot of code lines changes. Not sure whether it is worth or not.
On the other hand, which signal will be handled by CMOCKA is blind for DAOS, consider CMOCKA potential upgrade, then we need to trace CMOCKA changes (for signal handler) via new configuration. In theory, we can do that, but whether worth or not only for test cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that seems quite complex and hack, maybe not worth to do that. thanks for the explanation.
Replaced by #13596 |
For the test cases that are driven by CMOCKA, the test_case_teardown()
may be triggered by some signal that is captured by CMOCKA registered
signal handler. Under such case, related logic has long jumped out of
original DAOS lower level (object or cart) context. Related tasks may
be still in RUNNING status, but they are not runnable any longer even
if re-scheduled. It means that we may lost some resources attached to
related tasks, that is not fatal. They can be reclaimed automatically
when current test process exits.
What we can do for test cleanup is that try to reset "ev_thpriv" that
will be reused for subsequent test cases. Otherwise, one test failure
may block all subsequent test cases, that is too bad.
For such purpose we introduce "force" parameter for DAOS internal API
daos_event_priv_reset() to cleanup the shared event queue by force.
The patch also contains some other test environment cleanup.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: