-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Frequent kernel panics occurring during operation #1342
Comments
I'm seeing the same in #1340 |
@Jellyfrog Did you see system reboot or just Harvester cannot boot up? |
@janeczku Need more information to narrow down the issue:
|
The message in the summary is a warning, not a panic. If the system reboots later due to a panic it might be related, but there's no telling from this message. Can you enable crash dumps on this system? (install kdump and yast2-kdump and then run yast2 kdump to enable, then reboot) That will, at minimum, capture the log at the point of failure and should also capture a system kernel memory dump for further diagnosis. |
@yasker i don't think we can install additional packages on Harvester, do we ? |
I'm not actually sure, I left it unattended over the night, didn't check |
Without a capture of how the host is actually failing, there's not much the kernel folks can do to debug it. |
@janeczku Can you help to fill in more information from #1342 (comment) ? |
And yes, unfortunately, we cannot install additional packages to the OS. Filed rancher/elemental-toolkit#751 in cOS-toolkit to see what we can do to help with the kernel debugging. |
@Jellyfrog Can you help to check the |
@yasker We have regular failures here. Already setup sol to capture any dumps. I understand the nature of cos, but would appreciated some flexibly regarding setting kernel parameters or facilities for collecting dumps. |
@alexdepalex we're working on that now. The enhanced kernel debugging ability will definitely be ready for v1.0 GA (if it misses v0.3). |
Another report of kernel |
please include more output from netconsole/serial console. also is this using optane? this server runs a very old BIOS version that has an update available fixing machine check exceptions (which panic the machine immediately) when using intel optane. |
@dirkmueller We already applied the latest fw updates. Will share our config later, but @janeczku already has it. I also checked with Rado already, but he couldn't find any occurrences oof this type of issues in his resources. |
This is the same warning as in the summary. |
Thanks @jeffmahoney , corrected. |
Okay great, thats good news. I was just going from the original bugreport and searching for firmware related changelog - it is missing this update https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=4crd2&oscode=wst14&productcode=poweredge-r740xd&src=o I have no indication to say it has anything to do with the issue, so feel free to postpone this. |
I've filed a bug report to address the warning, at least: https://bugzilla.suse.com/show_bug.cgi?id=1191238 |
@dirkmeuller I checked, and unfortunately, we're running an ancient version of the bios (2.8.2). Not sure if I can use a more recent version. Our lab setup consists of the following hardware:
|
Since I can't make these kernel boot parameters persistent, I just caught one trace. Crash
|
This one is from the same host, but different panic.
|
@jeffmahoney @dirkmueller @janeczku see the panic trace provided by Alex above. |
The first Oops is trimmed such that we see a hard lockup report from a previous boot and are missing the address being dereferenced.
Given that we've already dereferenced The second Oops points to kernel/sched/fair.c:3636
The oops points to a dereferencing of
It's unclear to me if the patch @davidlohr posted will fix a use-after-free, but Michal Koutny commented on IRC that he suspects it could (with the caveat that it needs more investigation to confirm.) I'll build a kernel with that patch applied for testing. |
Please ignore the analysis of the first oops. I was reading the wrong register dump. It's not actually an oops and is probably another lockup report. The lines at the top of that report are from an Oops but it's truncated. |
@alexdepalex Is there any log above the line |
No. |
Built packages for the kernel are here: https://download.opensuse.org/repositories/home:/jeff_mahoney:/branches:/SUSE:/SLE-15-SP3:/Update/standard/x86_64/ You should only need kernel-default and maybe kernel-default-optional. |
@bk201 Can you help to build a Harvester iso today with the kernel from #1342 (comment) ? Also, since it's a kernel debug build, it's better if we can include kdump into the build as well. Ref: #1342 (comment) |
Some more information on how to debug in the OS: rancher/elemental-toolkit#751 (comment) |
Got another crash with the hotfix kernel. Uploaded the dump
|
Additional dumps on the ftp server.
|
@jeffmahoney it seems not an easy fix at the moment. Is it possible to have a kernel build package without the commit that's introduced the regression? Or any other way we can do to move forward, since we're releasing v0.3.0 next Monday. |
cc @davidlohr @dirkmueller ^^ |
I'm also encountering very frequent kernel panics in 0.3.0-rc1 in a different setup. Hardware: Dell PowerEdge R630 The panics reliably happen only a few minutes after boot without creating any additional VMs. We already did boot the system in I'll also try to provide crash dumps as soon as possible. |
We no longer observed kernel panics on
The most recent SLES 15 SP3 kernel package not containing the two problematic commits is This is probably the kernel package that should be used in lieu of |
I'm getting the same random crashes with the official 0.3.0 release... |
Facing this problem while trying to update the kernel... |
I believe that I'm hitting the same kernel panic on 0.3.0.
|
@silug the calltrace looks different. Can you file another issue? Also with the environment you're running on and other details e.g. how often do you see the issue. |
I'm encountering the same problem - can anyone comment if kernel-default-5.3.18-59.19.1.x86_64.rpm fixed the problem? This is what I caught from my serial console.
|
My kernel crashes are gone after installing kernel-default-5.3.18-59.19.1. |
Same issue here, random reboots v0.3.0
|
The master build ISO has the kernel updated, which should address this issue. |
This will be resolved in the |
Using Harvester 0.3.0-rc1 nodes are randomly rebooting/crashing.
The following trace can be found in the kernel logs shortly before the automatic reboot (due to
panic=10
) occurs:Same or similar issue has been reported for CoreOS (Fedora Kernel).
HW: Dell PowerEdge R740xd
The text was updated successfully, but these errors were encountered: