Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14725 client: force cleanup event query when test teardown #13509

Closed
wants to merge 7 commits into from

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Dec 15, 2023

For the test cases that are driven by CMOCKA, the test_case_teardown()
may be triggered by some signal that is captured by CMOCKA registered
signal handler. Under such case, related logic has long jumped out of
original DAOS lower level (object or cart) context. Related tasks may
be still in RUNNING status, but they are not runnable any longer even
if re-scheduled. It means that we may lost some resources attached to
related tasks, that is not fatal. They can be reclaimed automatically
when current test process exits.

What we can do for test cleanup is that try to reset "ev_thpriv" that
will be reused for subsequent test cases. Otherwise, one test failure
may block all subsequent test cases, that is too bad.

For such purpose we introduce "force" parameter for DAOS internal API
daos_event_priv_reset() to cleanup the shared event queue by force.

The patch also contains some other test environment cleanup.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Dec 15, 2023

Bug-tracker data:
Ticket title is 'Retry for object sync fails in cli_obj.c'
Status is 'In Review'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-14725

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-14725 branch 2 times, most recently from 3763c23 to 93e1f48 Compare December 16, 2023 01:51
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/3/execution/node/1319/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/3/execution/node/1365/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-14725 branch from 93e1f48 to f628067 Compare December 18, 2023 05:42
@Nasf-Fan Nasf-Fan marked this pull request as ready for review December 20, 2023 17:20
@Nasf-Fan Nasf-Fan requested a review from a team as a code owner December 20, 2023 17:20
Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems inappropriate to me as it's changing public API for some internal test case situation.

* Wait and retry at most 60 seconds if related task is in busy status.
* Currently, it is used by test logic to cleanup task with loop retry.
*/
sleep(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unacceptable imo. sleep should always be avoided in user space code. this can introduce many issues for apps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daos_event_priv_reset() is not application visible API, instead, it is only used by DAOS internal tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did not say that. my concern is that you are adding a new field in the public event structure (that users do not know about and will cause confusion if they try to use) and a sleep in the code just for testing purposes. this solution is not acceptable IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the logs and related cmocka logic, and found that the test process has already long jumped out of related task context. Under such case, aborting task become meaningless. I will remake the patch to handle that.

uint64_t ev_debug;
uint32_t ev_debug;
/** Abort the task(s) related with the event queue. */
uint32_t ev_abort:1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a public API change. given also this is an internal change, does not seem to apply to user events, this should not be here.

Copy link
Contributor Author

@Nasf-Fan Nasf-Fan Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a space hole after "ev_error" in original "struct daos_event" for 64-bits alignment.

typedef struct daos_event {
/** return code of non-blocking operation /
int ev_error;
/
* Internal use - 152 + 8 bytes pad for pthread_mutex_t size difference on aarch64 /
struct {
uint64_t space[20];
} ev_private;
/
* Used for debugging */
uint64_t ev_debug;
} daos_event_t;

I think I can add the new field into such hole to avoid compatibility trouble caused by the layout changes for "struct daos_event".

@@ -306,7 +306,7 @@ async_overlap(void **state)
static inline int
test_case_teardown(void **state)
{
assert_rc_equal(daos_event_priv_reset(), 0);
assert_rc_equal(daos_event_priv_reset(true), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just remove this call here and just avoid this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so. If we do not call daos_event_priv_reset(), there may be task leak when test completed. That will hide some potential issues.

On the other hand, daos_event_priv_reset() is not public API, instead, it is in src/client/api/client_internal.h. That means regular application cannot directly use such API. Then changing it will not affect the application, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well IMO that's really a bug in the test itself. doing a force reset in case of all testing might hide bugs IMO and we should not do that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the test itself does not has bug, instead, it is the CMOCKA framework to make the long jump out of the object level logic directly without recovering the shared event queue. Related test has already failed out, it did not hide the failure. What we do here is to cleanup the environment to make the subsequent test can go ahead.

What is your suggestion for this issue? Just removing the daos_event_priv_reset()? That only hides the "warning message" by skipping cleanup, but it will cause subsequent tests to be failed. So it does not resolve the trouble.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i miss how cmocka is related to the object level logic.
If you have a test that short circuits in the object level internal code, you should make sure that this test cleans (forcefully) but only that test. making a force cleanup for all tests is what im against in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand that and my concern again is for tests. consider this:
a PR introduces a bug in an existing test where it accidentally leaves an event in running state. with this change, this part of the code is not executed anymore:
https://github.com/daos-stack/daos/pull/13509/files#diff-11ada6ab4871be42589c1fed68b76148bb02a9d0001e3b027a2672e919bbb5e8L1084
and we just hide this failure, no?

Copy link
Contributor Author

@Nasf-Fan Nasf-Fan Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know whether we have efficient way to distinguish the two cases: test corruption or test logic bug. From the perspective of cleanup logic, it may be not the cleanup logic needs to care about. Sorry, I do not know how to give a suitable example for that right now.

If your concern case is more important, then just ignore the reset failure and go ahead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry but im not talking about a test logic issue. im talking about an actual bug that could be introduced.
opening an object and not closing it is not something that can be detected in this case anyway. this about exiting an operation without finalizing the internal event. if we introduce such a bug unrelated to the cmocka framework, for example in pool_disconnect, this would not be detected. so we cannot ignore the reset failure unfortunately.

could you perhaps in your case where you think the fault is OK/expected but cmocka calling test_teardown() needs to be adjusted is simply registering a different teardown handler when setting up the test instead of changing the default one that does this force cleanup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fault in the case is not expected, but the long jump from cmocka is not controllable by DAOS. That is the direct reason for the dirty event queue when test_teardown(). It is not special for this test, instead, it is general issue for all test cases with test_teardown called via cmocka() long jump. So we cannot simple replace test_teardown one by one. On some degree, if your concern case has to be considered, then ignore such dirty event queue (and error message) when test_teardown() may be the best choice: mean fail the subsequent tests to draw the attention. How do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the solution to distinguish whether the logic is from signal handling or regular test exit. Please check the new commit. Thanks!

@Nasf-Fan Nasf-Fan requested a review from mchaarawi December 22, 2023 08:17
int rc;

if (unlikely(ev != NULL && ev->ev_abort))
D_GOTO(err, rc = -DER_CANCELED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is in dead loop in retry of object IO here, then the API task seems should not be finished and should not get to user's daos_event_priv_reset() call.
In the ticket pasted log is not detailed enough so not very sure if really related with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related task is not completed, but the test logic long jumped out of the context, that is related with cmocka logic for handling some signal. I will remake the patch to handle that.

For the test cases that are driven by CMOCKA, the test_case_teardown()
may be triggered by some signal that is captured by CMOCKA registered
signal handler. Under such case, related logic has long jumped out of
original DAOS lower level (object or cart) context. Related tasks may
be still in RUNNING status, but they are not runnable any longer even
if re-scheduled. It means that we may lost some resources attached to
related tasks, that is not fatal. They can be reclaimed automatically
when current test process exits.

What we can do for test cleanup is that try to reset "ev_thpriv" that
will be reused for subsequent test cases. Otherwise, one test failure
may block all subsequent test cases, that is too bad.

For such purpose we introduce "force" parameter for DAOS internal API
daos_event_priv_reset() to cleanup the shared event queue by force.

The patch also contains some other test environment cleanup.

Required-githooks: true

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan changed the title DAOS-14725 client: force abort task with loop retry DAOS-14725 client: force cleanup event query when test teardown Dec 26, 2023
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-14725 branch from cfc92e0 to f4488bd Compare December 26, 2023 15:26
@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 7, 2024

Ping reviewers! Thanks!

Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still think we are making some generic code / test changes for just a single use case. this change might hide other bugs IMO as it removes the error condition in case something is leaked in all daos_test.

@@ -306,7 +306,7 @@ async_overlap(void **state)
static inline int
test_case_teardown(void **state)
{
assert_rc_equal(daos_event_priv_reset(), 0);
assert_rc_equal(daos_event_priv_reset(true), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well IMO that's really a bug in the test itself. doing a force reset in case of all testing might hide bugs IMO and we should not do that here.

@@ -704,6 +704,9 @@ dtx_resend_delay(test_arg_t *arg, daos_oclass_id_t oclass)
daos_fail_loc_set(0);
dtx_set_fail_loc(arg, 0);

/* Wait for the former delayed RPC before destroying the container to avoid DER_BUSY. */
sleep(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please explain why 2 sec sleep is needed here? why not 1 or 10?
does it vary if we change some CI hardware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We injected some fail_loc to make related IO handler to delay process the IO request about 1 second later after client RPC timeout. Here, we sleep additional 1 second to guarantee that related IO handler has already responded.

Cleanup test environment only when logic is longjump from cmocka
for handling signal.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan requested a review from mchaarawi January 9, 2024 08:40
sigismember(&sigset, SIGSEGV) ||
sigismember(&sigset, SIGBUS) ||
sigismember(&sigset, SIGSYS))) {
force = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is good if it can check if it is longjump from cmocka internally, in that case it mostly like there are mem corruption already, maybe not worth to do the follow "cleanup".
How about just print fatal err log here first, to see if it will really hit this case in following test.
The change in event.c is not bad, but probably not must to do that now to avoid affect event internal handling. If will hit this failure later in CI tests, we can consider further later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good idea to print the error message when hit such trouble. As for whether force cleanup or not, I tend to do that. Because:

  1. If force cleanup works well, then the subsequent tests can go ahead.
  2. We may have not verified related cleanup logic under some fatal failure cases. We do that now, it may help to expose some potential bug(s).

Since the test has already corrupted, the cleanup logic will not make the situation to be worse.

Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change now seems harmless and im fine to approve it. although it would be good to look at the test results / log to see if this actually does get triggered to justify it (testing for daos_test has not started yet).

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13509/10/testReport/

Trigger SIGFPE signal.

Signed-off-by: Fan Yong <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13509/11/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13509/11/execution/node/1434/log

*/
if (sigprocmask(0, NULL, &sigset) < 0) {
print_message("sigprocmask failure\n");
} else {
Copy link
Contributor Author

@Nasf-Fan Nasf-Fan Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuxuezhao @mchaarawi , sorry, I made some mistake in former comment. We cannot distinguish whether the test_teardown() is call from normal test complete routine or from cmocka signal handler longjmp. Because after signal handler longjmp, its signal mask has already been recovered, means that the signal that triggered the exception has been dropped when come here. So we cannot use the signal mask to distinguish the context unless before calling siglongjmp().

By registering my own signal handler, I can simulate the exception and trigger teardown() by force. That proves the teardown() by force can work (and subsequent test can go ahead without failure). But because we cannot distinguish the context without our own signal hander, then whether still keep the teardown() by force? What's your suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @liuxuezhao && @mchaarawi , what's your suggestion about above comment? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean in this PR it actually cannot distinguish whether or not should force daos_event_priv_reset()? if so not very sure what's the benefit to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot distinguish whether the caller is from signal handler caused long jump or from regular test cleanup logic. So if consider @mchaarawi 's concern about potentially hiding bug when daos_event_priv_reset() by force, then we may have to drop the changes for daos_event_priv_reset() with "force" parameter. Then without such fixes, we have very few things to do for this ticket. Here is the simplified patch in another PR:
#13596.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other PR looks much cleaner to me for this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuxuezhao , are you fine with another PR? if yes, I will replace this one with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to go with the other PR first, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan "By registering my own signal handler, I can simulate the exception and trigger teardown() by force", if register signal handler in test_setup, can it get notification if got those abnormal signal? maybe it can be useful for daos_test to catch the error. just FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that by registering my own signal RPC handler, the daos_event_priv_reset() can distinguish whether the the caller is from signal handler or normal exit via some hack (but not suggested) flag, then I can verify the force cleanup logic works.

But because the corruption can happen during any test case, then we need to register related signal handlers to replace CMOCKA registered ones for almost all of our test cases when setup, that will cause a lot of code lines changes. Not sure whether it is worth or not.

On the other hand, which signal will be handled by CMOCKA is blind for DAOS, consider CMOCKA potential upgrade, then we need to trace CMOCKA changes (for signal handler) via new configuration. In theory, we can do that, but whether worth or not only for test cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that seems quite complex and hack, maybe not worth to do that. thanks for the explanation.

@Nasf-Fan
Copy link
Contributor Author

Replaced by #13596

@Nasf-Fan Nasf-Fan closed this Jan 23, 2024
@Nasf-Fan Nasf-Fan deleted the Nasf-Fan/DAOS-14725 branch January 23, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants