-
Notifications
You must be signed in to change notification settings - Fork 753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add wait_for_shutdown logic to check delayed shutdown of DUT #16805
Conversation
…imeout error on check_critical_processes
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@augusdn thanks for the change, could you elaborate on the issue you have seen?
This testcase was intended to cover an unhappy case that when linecards of a chassis got reboot nearly synchronized
(I call it a sync reboot if they reboot within 30sec).
e.g. LC01 reboot at timestamp 0, LC02 reboot at timestamp 5, LC03 reboot at timestamp 29.
In this unhappy case, we want to make sure nothing crashed.
with your change, I'm afraid the LCs will wait for each other to complete the shutdown, which is not the case I want to cover initially.
@wenyiz2021 Here is the snippet of log from failed case. Syslog from lc1: Meanwhile test logged following output: We can see critical process check is failing due to pmon, but we get host unreachable error right after this This continues until time elapsed = 309s At this point, we see other processes have started, but their uptime is only 26s. Which shows LC has rebooted recently |
@arlakshm for viz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no questions if delayed reboot of LC01 does not delay the start of reboot of LC02.
will approve if it pass on Arista and Nokia T2
Successful test run and confirmed synchronized reboot on 7804 device 06/02/2025 01:42:05 test_chassis_reboot.execute_reboot_comma L0030 INFO | Run cold reboot on platform_tests/test_chassis_reboot.py::test_parallel_reboot PASSED [100%] |
Successful test run and confirmed synchronized reboot on 7250 device platform_tests/test_chassis_reboot.py::test_parallel_reboot PASSED [100%] |
Hi @arlakshm, could you help me review and merge the PR please? Thank you! |
…imeout error on check_critical_processes (sonic-net#16805) Description of PR Summary: During test_chassis_reboot run, it is possible that DUT(s) can experience delay in executing reboot command. This delay in reboot command execution can cause timeout error on check_critical_processes, as the DUT(s) will not be able to complete the reboot and restart critical processes within the expected time frame. Therefore, add wait_for_shutdown logic to check delayed shutdown of DUT causing timeout error on check_critical_processes Approach What is the motivation for this PR? Failure during nightly test How did you do it? by adding wait_for_shutdown logic to check delayed shutdown of DUT How did you verify/test it? tested on T2 8800 device co-authorized by: [email protected]
Cherry-pick PR to 202411: #16833 |
Cherry-pick PR to msft-202405: Azure/sonic-mgmt.msft#65 |
Description of PR
Summary:
During test_chassis_reboot run, it is possible that DUT(s) can experience delay in executing reboot command. This delay in reboot command execution can cause timeout error on check_critical_processes, as the DUT(s) will not be able to complete the reboot and restart critical processes within the expected time frame.
Therefore, add wait_for_shutdown logic to check delayed shutdown of DUT causing timeout error on check_critical_processes
Type of change
Back port request
Approach
What is the motivation for this PR?
Failure during nightly test
How did you do it?
by adding wait_for_shutdown logic to check delayed shutdown of DUT
How did you verify/test it?
tested on T2 8800 device
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation