-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14725 client: force cleanup event query when test teardown #13509
Changes from all commits
f4488bd
e9a5508
017b02c
cd8e719
0073bae
160455b
3ea6c8f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,7 @@ | |
#include <linux/limits.h> | ||
#include <sys/stat.h> | ||
#include <dirent.h> | ||
#include <signal.h> | ||
|
||
#include <cmocka.h> | ||
|
||
|
@@ -66,7 +67,7 @@ extern char *test_io_dir; | |
/* the IO conf file*/ | ||
extern const char *test_io_conf; | ||
|
||
extern int daos_event_priv_reset(void); | ||
extern int daos_event_priv_reset(bool force); | ||
#define TEST_RANKS_MAX_NUM (13) | ||
#define DAOS_SERVER_CONF "/etc/daos/daos_server.yml" | ||
#define DAOS_SERVER_CONF_LENGTH 512 | ||
|
@@ -306,7 +307,35 @@ async_overlap(void **state) | |
static inline int | ||
test_case_teardown(void **state) | ||
{ | ||
assert_rc_equal(daos_event_priv_reset(), 0); | ||
char *str = NULL; | ||
sigset_t sigset; | ||
bool force = false; | ||
|
||
/* | ||
* If one of SIGFPE/SIGILL/SIGSEGV/SIGBUS/SIGSYS is in the signal mask, then the logic is | ||
* longjump from cmocka for handling the signal, then need force cleanup test environment. | ||
*/ | ||
if (sigprocmask(0, NULL, &sigset) < 0) { | ||
print_message("sigprocmask failure\n"); | ||
} else { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liuxuezhao @mchaarawi , sorry, I made some mistake in former comment. We cannot distinguish whether the test_teardown() is call from normal test complete routine or from cmocka signal handler longjmp. Because after signal handler longjmp, its signal mask has already been recovered, means that the signal that triggered the exception has been dropped when come here. So we cannot use the signal mask to distinguish the context unless before calling siglongjmp(). By registering my own signal handler, I can simulate the exception and trigger teardown() by force. That proves the teardown() by force can work (and subsequent test can go ahead without failure). But because we cannot distinguish the context without our own signal hander, then whether still keep the teardown() by force? What's your suggestion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ping @liuxuezhao && @mchaarawi , what's your suggestion about above comment? Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you mean in this PR it actually cannot distinguish whether or not should force daos_event_priv_reset()? if so not very sure what's the benefit to do so? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We cannot distinguish whether the caller is from signal handler caused long jump or from regular test cleanup logic. So if consider @mchaarawi 's concern about potentially hiding bug when daos_event_priv_reset() by force, then we may have to drop the changes for daos_event_priv_reset() with "force" parameter. Then without such fixes, we have very few things to do for this ticket. Here is the simplified patch in another PR: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the other PR looks much cleaner to me for this issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liuxuezhao , are you fine with another PR? if yes, I will replace this one with it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am fine to go with the other PR first, thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Nasf-Fan "By registering my own signal handler, I can simulate the exception and trigger teardown() by force", if register signal handler in test_setup, can it get notification if got those abnormal signal? maybe it can be useful for daos_test to catch the error. just FYI. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean that by registering my own signal RPC handler, the daos_event_priv_reset() can distinguish whether the the caller is from signal handler or normal exit via some hack (but not suggested) flag, then I can verify the force cleanup logic works. But because the corruption can happen during any test case, then we need to register related signal handlers to replace CMOCKA registered ones for almost all of our test cases when setup, that will cause a lot of code lines changes. Not sure whether it is worth or not. On the other hand, which signal will be handled by CMOCKA is blind for DAOS, consider CMOCKA potential upgrade, then we need to trace CMOCKA changes (for signal handler) via new configuration. In theory, we can do that, but whether worth or not only for test cleanup. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, that seems quite complex and hack, maybe not worth to do that. thanks for the explanation. |
||
if (unlikely(sigismember(&sigset, SIGFPE))) | ||
str = "SIGFPE"; | ||
else if (unlikely(sigismember(&sigset, SIGILL))) | ||
str = "SIGILL"; | ||
else if (unlikely(sigismember(&sigset, SIGSEGV))) | ||
str = "SIGSEGV"; | ||
else if (unlikely(sigismember(&sigset, SIGBUS))) | ||
str = "SIGBUS"; | ||
else if (unlikely(sigismember(&sigset, SIGSYS))) | ||
str = "SIGSYS"; | ||
|
||
if (str != NULL) { | ||
print_message("Hit corruption (%s), cleanup by force\n", str); | ||
force = true; | ||
} | ||
} | ||
|
||
assert_rc_equal(daos_event_priv_reset(force), 0); | ||
return 0; | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please explain why 2 sec sleep is needed here? why not 1 or 10?
does it vary if we change some CI hardware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We injected some fail_loc to make related IO handler to delay process the IO request about 1 second later after client RPC timeout. Here, we sleep additional 1 second to guarantee that related IO handler has already responded.