Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add wait_for_shutdown logic to check delayed shutdown of DUT #16805

Merged
merged 1 commit into from
Feb 6, 2025

Conversation

augusdn
Copy link
Contributor

@augusdn augusdn commented Feb 5, 2025

Description of PR

Summary:
During test_chassis_reboot run, it is possible that DUT(s) can experience delay in executing reboot command. This delay in reboot command execution can cause timeout error on check_critical_processes, as the DUT(s) will not be able to complete the reboot and restart critical processes within the expected time frame.
Therefore, add wait_for_shutdown logic to check delayed shutdown of DUT causing timeout error on check_critical_processes

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405
  • 202411

Approach

What is the motivation for this PR?

Failure during nightly test

How did you do it?

by adding wait_for_shutdown logic to check delayed shutdown of DUT

How did you verify/test it?

tested on T2 8800 device

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@augusdn augusdn requested a review from prgeor as a code owner February 5, 2025 16:38
@mssonicbld
Copy link
Collaborator

/azp run

@augusdn augusdn requested a review from wenyiz2021 February 5, 2025 16:38
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@wenyiz2021 wenyiz2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@augusdn thanks for the change, could you elaborate on the issue you have seen?
This testcase was intended to cover an unhappy case that when linecards of a chassis got reboot nearly synchronized(I call it a sync reboot if they reboot within 30sec).
e.g. LC01 reboot at timestamp 0, LC02 reboot at timestamp 5, LC03 reboot at timestamp 29.

In this unhappy case, we want to make sure nothing crashed.
with your change, I'm afraid the LCs will wait for each other to complete the shutdown, which is not the case I want to cover initially.

@augusdn
Copy link
Contributor Author

augusdn commented Feb 5, 2025

@wenyiz2021 Here is the snippet of log from failed case.
Test log:
26/01/2025 02:18:51 base._run L0071 DEBUG | /var/src/sonic-mgmt/tests/common/devices/multi_asic.py::_run_on_asics#136: [lc3-1] AnsibleModule::command, args=["reboot"], kwargs={}
26/01/2025 02:18:53 reboot.wait_for_startup L0196 INFO | waiting for ssh to startup on lc1-1

Syslog from lc1:
2025 Jan 26 02:18:46.645999 lc1-1 NOTICE root: User requested rebooting device ...
2025 Jan 26 02:19:01.271662 lc1-1 NOTICE root: Failed killing container pmon RC 1 .
2025 Jan 26 02:19:09.661568 lc1-1 INFO CCmisApi: reboot: Initiate shutdown command
2025 Jan 26 02:21:49.548658 lc1-1 INFO cisco_platform.py[1193]: boot: cold boot

Meanwhile test logged following output:
26/01/2025 02:19:16 processes_utils.get_critical_processes_s L0035 INFO | The status of checking process in container 'pmon' is: False
26/01/2025 02:19:16 processes_utils.get_critical_processes_s L0037 INFO | The processes not running in container 'pmon' are: '[]'
26/01/2025 02:19:16 utilities.wait_until L0153 DEBUG | _all_critical_processes_healthy is False, wait 20 seconds and check again
26/01/2025 02:19:36 transport._log L1873 DEBUG | EOF in transport thread
26/01/2025 02:19:36 utilities.wait_until L0135 DEBUG | Time elapsed: 32.826792 seconds

We can see critical process check is failing due to pmon, but we get host unreachable error right after this
pytest_ansible.errors.AnsibleConnectionFailure: Host unreachable in the inventory
, error:Host unreachable in the inventory
26/01/2025 02:20:55 utilities.wait_until L0153 DEBUG | _all_critical_processes_healthy is False, wait 20 seconds and check again
26/01/2025 02:21:15 utilities.wait_until L0135 DEBUG | Time elapsed: 131.649778 seconds

This continues until time elapsed = 309s
26/01/2025 02:23:53 processes_utils.get_critical_processes_s L0035 INFO | The status of checking process in container 'pmon' is: False
26/01/2025 02:23:53 processes_utils.get_critical_processes_s L0037 INFO | The processes not running in container 'pmon' are: '[]'
26/01/2025 02:23:53 utilities.wait_until L0153 DEBUG | _all_critical_processes_healthy is False, wait 20 seconds and check again
26/01/2025 02:24:13 utilities.wait_until L0135 DEBUG | Time elapsed: 309.133999 seconds

At this point, we see other processes have started, but their uptime is only 26s. Which shows LC has rebooted recently
"rsyslogd RUNNING pid 48, uptime 0:00:25", "supervisor-proc-exit-listener RUNNING pid 38, uptime 0:00:26"

@wenyiz2021 wenyiz2021 requested a review from arlakshm February 5, 2025 22:55
@wenyiz2021
Copy link
Contributor

@arlakshm for viz

Copy link
Contributor

@wenyiz2021 wenyiz2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no questions if delayed reboot of LC01 does not delay the start of reboot of LC02.
will approve if it pass on Arista and Nokia T2

@augusdn
Copy link
Contributor Author

augusdn commented Feb 6, 2025

Successful test run and confirmed synchronized reboot on 7804 device

06/02/2025 01:42:05 test_chassis_reboot.execute_reboot_comma L0030 INFO | Run cold reboot on
06/02/2025 01:42:08 test_chassis_reboot.execute_reboot_comma L0030 INFO | Run cold reboot on
06/02/2025 01:42:40 test_chassis_reboot.execute_reboot_comma L0030 INFO | Run cold reboot on
06/02/2025 01:43:10 test_chassis_reboot.test_parallel_reboot L0092 INFO | DEBUGGING - Wait for all reboots to complete
06/02/2025 01:43:10 test_chassis_reboot.test_parallel_reboot L0096 INFO | Reboot and shutdown result: shutdown success on lc5-1
06/02/2025 01:43:10 test_chassis_reboot.test_parallel_reboot L0096 INFO | Reboot and shutdown result: shutdown success on lc3-1
06/02/2025 01:43:10 test_chassis_reboot.test_parallel_reboot L0096 INFO | Reboot and shutdown result: shutdown success on lc7-1

platform_tests/test_chassis_reboot.py::test_parallel_reboot PASSED [100%]

@augusdn
Copy link
Contributor Author

augusdn commented Feb 6, 2025

Successful test run and confirmed synchronized reboot on 7250 device
06/02/2025 01:27:53 test_chassis_reboot.execute_reboot_comma L0030 INFO | Run cold reboot on
06/02/2025 01:28:08 test_chassis_reboot.execute_reboot_comma L0030 INFO | Run cold reboot on
06/02/2025 01:28:37 test_chassis_reboot.execute_reboot_comma L0030 INFO | Run cold reboot on
06/02/2025 01:28:47 test_chassis_reboot.test_parallel_reboot L0092 INFO | DEBUGGING - Wait for all reboots to complete
06/02/2025 01:28:47 test_chassis_reboot.test_parallel_reboot L0096 INFO | Reboot and shutdown result: shutdown success on lc1-1
06/02/2025 01:28:47 test_chassis_reboot.test_parallel_reboot L0096 INFO | Reboot and shutdown result: shutdown success on lc2-1
06/02/2025 01:28:59 test_chassis_reboot.test_parallel_reboot L0096 INFO | Reboot and shutdown result: shutdown success on lc3-1

platform_tests/test_chassis_reboot.py::test_parallel_reboot PASSED [100%]

@augusdn
Copy link
Contributor Author

augusdn commented Feb 6, 2025

Hi @arlakshm, could you help me review and merge the PR please? Thank you!

@arlakshm arlakshm requested a review from Javier-Tan February 6, 2025 06:04
@yejianquan yejianquan merged commit be40d55 into sonic-net:master Feb 6, 2025
13 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Feb 6, 2025
…imeout error on check_critical_processes (sonic-net#16805)

Description of PR
Summary:
During test_chassis_reboot run, it is possible that DUT(s) can experience delay in executing reboot command. This delay in reboot command execution can cause timeout error on check_critical_processes, as the DUT(s) will not be able to complete the reboot and restart critical processes within the expected time frame.
Therefore, add wait_for_shutdown logic to check delayed shutdown of DUT causing timeout error on check_critical_processes

Approach
What is the motivation for this PR?
Failure during nightly test

How did you do it?
by adding wait_for_shutdown logic to check delayed shutdown of DUT

How did you verify/test it?
tested on T2 8800 device

co-authorized by: [email protected]
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202411: #16833

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202405: Azure/sonic-mgmt.msft#65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants