Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Frequent kernel panics occurring during operation #1342

Closed
janeczku opened this issue Oct 1, 2021 · 49 comments
Closed

[BUG] Frequent kernel panics occurring during operation #1342

janeczku opened this issue Oct 1, 2021 · 49 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release require/release-note
Milestone

Comments

@janeczku
Copy link
Contributor

janeczku commented Oct 1, 2021

Using Harvester 0.3.0-rc1 nodes are randomly rebooting/crashing.

The following trace can be found in the kernel logs shortly before the automatic reboot (due to panic=10) occurs:

[ 8258.424256] ------------[ cut here ]------------
[ 8258.424258] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
[ 8258.424281] WARNING: CPU: 33 PID: 0 at ../kernel/sched/fair.c:378 enqueue_task_fair+0x353/0x610
[ 8258.424283] Modules linked in: binfmt_misc rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core ebt_ip ebtable_broute ebtables vhost_net vhost tun tap ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set vxlan ip6_udp_tunnel udp_tunnel veth nf_conntrack_netlink nfnetlink xt_addrtype xt_recent xt_statistic xt_nat ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill intel_rapl_msr intel_rapl_common ipmi_ssif isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt mgag200 intel_pmc_bxt iTCO_vendor_support kvm drm_kms_helper dell_smbios dcdbas(X) mei_me cec rc_core ipmi_si syscopyarea sysfillrect sysimgblt irqbypass pcspkr dell_wmi_descriptor wmi_bmof joydev mei
[ 8258.424350]  i2c_i801 lpc_ich fb_sys_fops ipmi_devintf ipmi_msghandler button drm fuse configfs overlay loop hid_generic usbhid ext4 crc16 mbcache jbd2 sd_mod crc32_pclmul crc32c_intel ghash_clmulni_intel xhci_pci xhci_hcd aesni_intel i40e crypto_simd usbcore cryptd ahci glue_helper libahci nvme igb libata nvme_core megaraid_sas t10_pi i2c_algo_bit dca wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[ 8258.424401] Supported: Yes, External
[ 8258.424406] CPU: 33 PID: 0 Comm: swapper/33 Tainted: G          I    X    5.3.18-59.24-default #1 SLE15-SP3
[ 8258.424408] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[ 8258.424416] RIP: 0010:enqueue_task_fair+0x353/0x610
[ 8258.424420] Code: 60 09 00 00 0f 84 cc fd ff ff 80 3d 9a b8 51 01 00 0f 85 bf fd ff ff 48 c7 c7 98 56 f3 ad c6 05 86 b8 51 01 01 e8 3d 0c fc ff <0f> 0b e9 a5 fd ff ff 49 63 95 48 0a 00 00 48 c7 c0 40 94 01 00 48
[ 8258.424423] RSP: 0018:ffffb8174037be40 EFLAGS: 00010086
[ 8258.424425] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 8258.424428] RDX: 000000000000002d RSI: ffffffffaeaecd6d RDI: 0000000000000046
[ 8258.424430] RBP: ffffa1677f62cd00 R08: ffffffffaeaecd40 R09: 000000000002c500
[ 8258.424432] R10: ffffb8174037bdc0 R11: 0000000000000000 R12: 0000000000000000
[ 8258.424433] R13: ffffa1677f62cc80 R14: 0000000000000001 R15: 0000000000000000
[ 8258.424436] FS:  0000000000000000(0000) GS:ffffa1677f600000(0000) knlGS:0000000000000000
[ 8258.424438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8258.424440] CR2: 000000c000feb000 CR3: 000000be081c6002 CR4: 00000000007706e0
[ 8258.424442] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8258.424444] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8258.424446] PKRU: 55555554
[ 8258.424447] Call Trace:
[ 8258.424459]  ttwu_do_activate+0x72/0x170
[ 8258.424464]  sched_ttwu_pending+0xa5/0x110
[ 8258.424470]  do_idle+0x166/0x270
[ 8258.424476]  cpu_startup_entry+0x19/0x20
[ 8258.424481]  start_secondary+0x155/0x1a0
[ 8258.424489]  secondary_startup_64_no_verify+0xc2/0xd0
[ 8258.424494] ---[ end trace 5cd94bd1dde862f3 ]---

Same or similar issue has been reported for CoreOS (Fedora Kernel).

HW: Dell PowerEdge R740xd

@janeczku janeczku added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Oct 1, 2021
@Jellyfrog
Copy link

I'm seeing the same in #1340

@yasker yasker added the priority/0 Must be fixed in this release label Oct 1, 2021
@yasker yasker added this to the v0.3.0 milestone Oct 1, 2021
@yasker
Copy link
Member

yasker commented Oct 1, 2021

@Jellyfrog Did you see system reboot or just Harvester cannot boot up?

@yasker
Copy link
Member

yasker commented Oct 1, 2021

@janeczku Need more information to narrow down the issue:

  1. Does this always happen on one machine or different machines every time? Same model or different model.
  2. Have you observed any error messages from the IPMI console? if there is hardware issue IPMI console can detect it.
  3. Do we have the full serial port output? (btw, we should have this built-in for the formal release, filed [FEATURE] Improve Harvester kernel debug-ability. #1343 to track).

@janeczku janeczku closed this as completed Oct 1, 2021
@janeczku janeczku reopened this Oct 1, 2021
@jeffmahoney
Copy link

The message in the summary is a warning, not a panic. If the system reboots later due to a panic it might be related, but there's no telling from this message. Can you enable crash dumps on this system? (install kdump and yast2-kdump and then run yast2 kdump to enable, then reboot) That will, at minimum, capture the log at the point of failure and should also capture a system kernel memory dump for further diagnosis.

@janeczku
Copy link
Contributor Author

janeczku commented Oct 1, 2021

@yasker i don't think we can install additional packages on Harvester, do we ?

@Jellyfrog
Copy link

@Jellyfrog Did you see system reboot or just Harvester cannot boot up?

I'm not actually sure, I left it unattended over the night, didn't check

@janeczku
Copy link
Contributor Author

janeczku commented Oct 1, 2021

IPMI

Screenshot 2021-10-01 at 17 12 24

@jeffmahoney
Copy link

Without a capture of how the host is actually failing, there's not much the kernel folks can do to debug it.

@yasker
Copy link
Member

yasker commented Oct 1, 2021

@janeczku Can you help to fill in more information from #1342 (comment) ?

@yasker
Copy link
Member

yasker commented Oct 1, 2021

And yes, unfortunately, we cannot install additional packages to the OS.

Filed rancher/elemental-toolkit#751 in cOS-toolkit to see what we can do to help with the kernel debugging.

@yasker
Copy link
Member

yasker commented Oct 1, 2021

@Jellyfrog Can you help to check the uptime of your system? If it's not rebooting over the night, your issue might be different.

@alexdepalex
Copy link

@yasker We have regular failures here. Already setup sol to capture any dumps.

I understand the nature of cos, but would appreciated some flexibly regarding setting kernel parameters or facilities for collecting dumps.

@yasker
Copy link
Member

yasker commented Oct 1, 2021

@alexdepalex we're working on that now. The enhanced kernel debugging ability will definitely be ready for v1.0 GA (if it misses v0.3).

@yasker
Copy link
Member

yasker commented Oct 1, 2021

Another report of kernel panic warning for v0.3.0-rc1 is in https://rancher-users.slack.com/archives/C01GKHKAG0K/p1633011073251800 (appear to be the same as in the summary).

@dirkmueller
Copy link

please include more output from netconsole/serial console. also is this using optane? this server runs a very old BIOS version that has an update available fixing machine check exceptions (which panic the machine immediately) when using intel optane.

@alexdepalex
Copy link

@dirkmueller We already applied the latest fw updates. Will share our config later, but @janeczku already has it. I also checked with Rado already, but he couldn't find any occurrences oof this type of issues in his resources.

@jeffmahoney
Copy link

Another report of kernel panic for v0.3.0-rc1 is in https://rancher-users.slack.com/archives/C01GKHKAG0K/p1633011073251800

This is the same warning as in the summary.

@yasker
Copy link
Member

yasker commented Oct 1, 2021

Thanks @jeffmahoney , corrected.

@dirkmueller
Copy link

@dirkmueller We already applied the latest fw updates. Will share our config later, but @janeczku already has it. I also checked with Rado already, but he couldn't find any occurrences oof this type of issues in his resources.

Okay great, thats good news. I was just going from the original bugreport and searching for firmware related changelog - it is missing this update https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=4crd2&oscode=wst14&productcode=poweredge-r740xd&src=o

I have no indication to say it has anything to do with the issue, so feel free to postpone this.

@jeffmahoney
Copy link

I've filed a bug report to address the warning, at least: https://bugzilla.suse.com/show_bug.cgi?id=1191238

@alexdepalex
Copy link

alexdepalex commented Oct 1, 2021

@dirkmeuller I checked, and unfortunately, we're running an ancient version of the bios (2.8.2). Not sure if I can use a more recent version.

Our lab setup consists of the following hardware:

  • Dell PowerEdge R740xd
  • 2x Xeon Gold 5218R 20 cores
  • 768GB
  • 4 port integrated nic i350-t 10Gb
  • 2 x 4 port Intel X710-T network adapter 10Gb
  • 6 x 6TB NVMe Dell P4610
  • 1 x 900GB boot disk mirrored

@alexdepalex
Copy link

alexdepalex commented Oct 2, 2021

Since I can't make these kernel boot parameters persistent, I just caught one trace.

Crash
[  137.985537] general protection fault: 0000 [#1] SMP NOPTI
[  161.438247] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12
[  161.438248] Modules linked in: vhost_net vhost tun tap ebt_ip ebtable_broute ebtables rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set veth vxlan ip6_udp_tunnel udp_tunnel nf_conntrack_netlink nfnetlink xt_recent xt_nat xt_statistic xt_addrtype ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_
mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill intel_rapl_msr intel_rapl_common ipmi_ssif isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt intel_pmc_bxt iTCO_vendor_support dell_smbios kvm_intel mgag200 drm_kms_helper dcdbas(X) kvm mei_me ipmi_si cec rc_core syscopyarea sysfillrect ipmi_devintf dell_wmi_desc
riptor wmi_bmof pcspkr sysimgblt irqbypass mei joydev
[  161.438266]  i2c_i801 lpc_ich fb_sys_fops ipmi_msghandler button fuse drm configfs overlay loop hid_generic usbhid ext4 crc16 mbcache jbd2 sd_mod crc32_pclmul crc32c_intel xhci_pci ghash_clmulni_intel xhci_hcd aesni_intel ahci libahci nvme crypto_simd cryptd igb i40e nvme_core glue_helper usbcore libata i2c_algo_bit megaraid_sas t10_pi dca wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi
_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[  161.438279] Supported: Yes, External
[  161.438279] CPU: 12 PID: 22964 Comm: longhorn Tainted: G          I    X    5.3.18-59.24-default #1 SLE15-SP3
[  161.438280] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[    2.629678] .... node  #1, CPUs:   #17
[    2.633677] .... node  #0, CPUs:   #18
...skipping...
[  161.438280] RIP: 0010:native_queued_spin_lock_slowpath+0x136/0x1e0
[  161.438281] Code: ff ff 41 83 c0 01 c1 e1 10 41 c1 e0 12 44 09 c1 89 c8 c1 e8 10 66 87 47 02 89 c6 c1 e6 10 85 f6 75 39 31 f6 eb 02 f3 90 8b 07 <66> 85 c0 75 f7 41 89 c0 66 45 31 c0 44 39 c1 74 78 48 85 f6 c6 07
[  161.438281] RSP: 0018:ffffb75d357afd08 EFLAGS: 00000002
[  161.438282] RAX: 0000000000d80101 RBX: ffff9895e9864000 RCX: 0000000000340000
[  161.438282] RDX: ffff9836407ad900 RSI: 0000000000000000 RDI: ffff98364082cc80
[  161.438283] RBP: ffff98364082cc80 R08: 0000000000340000 R09: 000000000000b8c7
[  161.438283] R10: ffffb75d357afcb8 R11: 00000000004d33fd R12: 0000000000000004
[  161.438284] R13: ffff9895e9864b84 R14: 0000000000000206 R15: 0000000000000010
[  161.438284] FS:  00007f9a937fe700(0000) GS:ffff983640780000(0000) knlGS:0000000000000000
[  161.438284] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.438285] CR2: 00007f626c0f3000 CR3: 000000be6c7f8002 CR4: 00000000007706e0
[  161.438285] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  161.438286] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  161.438286] PKRU: 55555554
[  161.438286] Call Trace:
[  161.438286]  _raw_spin_lock+0x1b/0x20
[  161.438287]  try_to_wake_up+0x3e9/0x500
[  161.438287]  ? __switch_to_asm+0x34/0x70
[  161.438287]  wake_up_q+0x64/0xa0
[  161.438287]  futex_wake+0x13e/0x160
[  161.438288]  do_futex+0xcb/0xac0
[  161.438288]  ? __x2apic_send_IPI_dest+0x2e/0x40
[  161.438288]  ? kick_process+0x3d/0x40
[  161.438289]  ? __send_signal+0x28a/0x3f0
[  161.438289]  ? do_send_sig_info+0x5c/0x90
[  161.438289]  __x64_sys_futex+0x5e/0x1d0
[  161.438289]  do_syscall_64+0x5b/0x1e0
[  161.438290]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  161.438290] RIP: 0033:0x4681a3
[  161.438291] Code: 24 20 c3 cc cc cc cc 48 8b 7c 24 08 8b 74 24 10 8b 54 24 14 4c 8b 54 24 18 4c 8b 44 24 20 44 8b 4c 24 28 b8 ca 00 00 00 0f 05 <89> 44 24 30 c3 cc cc cc cc cc cc cc cc 8b 7c 24 08 48 8b 74 24 10
[  161.438291] RSP: 002b:00007f9a937fdbc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000ca
[  161.438292] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004681a3
[  161.438292] RDX: 0000000000000001 RSI: 0000000000000081 RDI: 000000c000600848
[  161.438292] RBP: 00007f9a937fdc18 R08: 0000000000000000 R09: 0000000000000000
[  161.438293] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[  161.438293] R13: 000000c000432480 R14: 0000000001062590 R15: 0000000000000000
[  161.438293] Kernel panic - not syncing: Hard LOCKUP
[  162.787553] Shutting down cpus with NMI
[  162.787554] Kernel Offset: 0x29a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  162.787555] NMI watchdog: Watchdog detected hard LOCKUP on cpu 16
[  162.787555] Modules linked in: vhost_net vhost tun tap ebt_ip ebtable_broute ebtables rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set veth vxlan ip6_udp_tunnel udp_tunnel nf_conntrack_netlink nfnetlink xt_recent xt_nat xt_statistic xt_addrtype ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xrv
[  162.787567]  i2c_i801 lpc_ich fb_sys_fops ipmi_msghandler button fuse drm configfs overlay loop hid_generic usbhid ext4 crc16 mbcache jbd2 sd_mod crc32_pclmul crc32c_intel xhci_pci ghash_clmulni_intel xhci_hcd aesni_intel ahci libahci nvme crypto_simd cryptd igb i40e nvme_core glue_helper usbcore libata i2c_algo_bit megaraid_sas t10_pi dca wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx isc_d
[  162.787576] Supported: Yes, External
[  162.787576] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G          I    X    5.3.18-59.24-default #1 SLE15-SP3
[  162.787577] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[  162.787577] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x1e0
[  162.787578] Code: 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1e 85 c0 75 0b b8 01 00 00 00 66 89 07 c3 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 01 00
[  162.787578] RSP: 0018:ffffb75d1932ca68 EFLAGS: 00000002
[  162.787579] RAX: 0000000000d80101 RBX: ffff97d88966c000 RCX: ffff983640800000
[  162.787579] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff98364082cc80
[  162.787580] RBP: ffff98364082cc80 R08: ffff983640800000 R09: ffff97d887805d88
[  162.787580] R10: 0000000000000000 R11: ffffffffabe639d8 R12: 0000000000000000
[  162.787580] R13: ffff97d88966cb84 R14: 0000000000000087 R15: 0000000000000010
[  162.787580] FS:  0000000000000000(0000) GS:ffff983640800000(0000) knlGS:0000000000000000
[  162.787581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  162.787581] CR2: 000056260c4c5748 CR3: 0000005df48d8001 CR4: 00000000007706e0
[  162.787581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  162.787582] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  162.787582] PKRU: 55555554
[  162.787582] Call Trace:
[  162.787582]  <IRQ>
[  162.787582]  _raw_spin_lock+0x1b/0x20
[  162.787583]  try_to_wake_up+0x3e9/0x500
[  162.787583]  __queue_work+0x13e/0x400
[  162.787583]  queue_work_on+0x34/0x40
[  162.787583]  bit_putcs+0x2d0/0x4e0
[  162.787583]  ? bit_clear+0x110/0x110
[  162.787584]  fbcon_putcs+0xeb/0x100
[  162.787584]  vt_console_print+0x2f2/0x3d0
[  162.787584]  console_unlock+0x3b2/0x4e0
[  162.787584]  vprintk_emit+0x109/0x200
[  162.787584]  printk+0x52/0x6e
[  162.787585]  __die+0x8b/0xe0
[  162.787585]  die+0x2a/0x50
[  162.787585]  general_protection+0x32/0x40
[  162.787585] RIP: 0010:update_blocked_averages+0x2b6/0x530
[  162.787586] Code: 48 89 de 48 89 d7 e8 99 7e 01 00 09 e8 48 8b 83 50 01 00 00 0f 85 26 01 00 00 48 8b 80 f0 00 00 00 4a 8b 34 38 48 85 f6 74 36 <48> 83 be a0 01 00 00 00 75 1b 48 83 be b0 01 00 00 00 75 11 48 8b
[  162.787586] RSP: 0018:ffffb75d1932cef0 EFLAGS: 00010006
[  162.787587] RAX: fffff2d3f75ba7c0 RBX: ffff983582436e00 RCX: 0000000000000000
[  162.787587] RDX: 0000000000000001 RSI: 0017ffffc000001e RDI: 0000000000000000
[  162.787587] RBP: 0000000000000000 R08: 0000000000000277 R09: 0000000000000000
[  162.787587] R10: ffffb75d1932cef0 R11: 0000000000000000 R12: ffff983582436f40
[  162.787588] R13: 0000000000000000 R14: ffff98355bc50e00 R15: 0000000000000080
[  162.787588]  ? enqueue_hrtimer+0x39/0x90
[  162.787588]  run_rebalance_domains+0x71/0xa0
[  162.787588]  __do_softirq+0xe3/0x2d6
[  162.787588]  irq_exit+0xd5/0xe0
[  162.787589]  smp_apic_timer_interrupt+0x74/0x130
[  162.787589]  apic_timer_interrupt+0xf/0x20
[  162.787589]  </IRQ>
[  162.787589] RIP: 0010:cpuidle_enter_state+0xab/0x430
[  162.787590] Code: db 5f f1 54 e8 b6 6f 9d ff 49 89 c5 0f 1f 44 00 00 31 ff e8 a7 7f 9d ff 80 7c 24 0b 00 0f 85 db 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 89 f4 01 00 00 c7 43 14 00 00 00 00 48 83 c4 10 44 89
[  162.787590] RSP: 0018:ffffb75d002f3e80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  162.787591] RAX: ffff98364082cc80 RBX: ffffd6fd00a05d40 RCX: 000000000000001f
[  162.787591] RDX: 000000202093f61f RSI: 000000003d1877c2 RDI: 0000000000000000
[  162.787591] RBP: ffffffffabf5f100 R08: 0000000000000002 R09: 000000000002c500
[  162.787592] R10: ffffb75d002f3e60 R11: 0000000000000078 R12: 0000000000000002
[  162.787592] R13: 000000202093f61f R14: 0000000000000002 R15: 0000000000000000
[  162.787592]  cpuidle_enter+0x29/0x40
[  162.787592]  do_idle+0x1f7/0x270
[  162.787593]  cpu_startup_entry+0x19/0x20
[  162.787593]  start_secondary+0x155/0x1a0
[  162.787593]  secondary_startup_64_no_verify+0xc2/0xd0
[  162.787606] NMI watchdog: Watchdog detected hard LOCKUP on cpu 53
riptor wmi_bmof pcspkr sysimgblt irqbypass mei joydev
_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[  162.787628] Supported: Yes, External
[  162.787628] CPU: 53 PID: 0 Comm: swapper/53 Tainted: G          I    X    5.3.18-59.24-default #1 SLE15-SP3
[  162.787628] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[  162.787629] RIP: 0010:native_queued_spin_lock_slowpath+0x191/0x1e0
[  162.787629] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 d9 02 00 48 03 04 f5 a0 09 bf ab 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 94 0f 0d 0e eb 8f 8b 07
[  162.787630] RSP: 0018:ffffb75d19ab0e88 EFLAGS: 00000046
[  162.787630] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000d80000
[  162.787630] RDX: ffff98963f8ad900 RSI: 000000000000000c RDI: ffff98364082cc80
[  162.787631] RBP: ffff98355bc50e00 R08: 0000000000d80000 R09: 0000000000000001
[  162.787631] R10: ffffb75d19ab0df0 R11: 0000000000f97c68 R12: ffff98963617ca10
[  162.787631] R13: ffff98963617cac0 R14: 0000000000000001 R15: 00000000001e847f
[  162.787632] FS:  0000000000000000(0000) GS:ffff98963f880000(0000) knlGS:0000000000000000
[  162.787632] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  162.787632] CR2: 00007fd4327fbe78 CR3: 000000bd814b2006 CR4: 00000000007706e0
[  162.787633] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  162.787633] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  162.787633] PKRU: 55555554
[  162.787633] Call Trace:
[  162.787633]  <IRQ>
[  162.787634]  _raw_spin_lock_irqsave+0x30/0x40
[  162.787634]  distribute_cfs_runtime+0x45/0x140
[  162.787634]  sched_cfs_period_timer+0xc1/0x210
[  162.787634]  ? sched_cfs_slack_timer+0xc0/0xc0
[  162.787635]  __hrtimer_run_queues+0x108/0x280
[  162.787635]  hrtimer_interrupt+0xe5/0x240
[  162.787635]  smp_apic_timer_interrupt+0x6a/0x130
[  162.787635]  apic_timer_interrupt+0xf/0x20
[  162.787635]  </IRQ>
[  162.787636] RIP: 0010:cpuidle_enter_state+0xab/0x430
[  162.787636] Code: db 5f f1 54 e8 b6 6f 9d ff 49 89 c5 0f 1f 44 00 00 31 ff e8 a7 7f 9d ff 80 7c 24 0b 00 0f 85 db 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 89 f4 01 00 00 c7 43 14 00 00 00 00 48 83 c4 10 44 89
[  162.787637] RSP: 0018:ffffb75d18f0fe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  162.787637] RAX: ffff98963f8acc80 RBX: ffffd75cffa85d40 RCX: 000000000000001f
[  162.787638] RDX: 0000002022c748ea RSI: 000000003d1877c2 RDI: 0000000000000000
[  162.787638] RBP: ffffffffabf5f100 R08: 0000000000000002 R09: 000000000002c500
[  162.787638] R10: ffffb75d18f0fe60 R11: 0000000000000093 R12: 0000000000000002
[  162.787639] R13: 0000002022c748ea R14: 0000000000000002 R15: 0000000000000000
[  162.787639]  cpuidle_enter+0x29/0x40
[  162.787639]  do_idle+0x1f7/0x270
[  162.787640]  cpu_startup_entry+0x19/0x20
[  162.787640]  start_secondary+0x155/0x1a0
[  162.787640]  secondary_startup_64_no_verify+0xc2/0xd0

@alexdepalex
Copy link

This one is from the same host, but different panic.
[  533.825927] BUG: kernel NULL pointer dereference, address: 0000000000000109
[  533.832911] #PF: supervisor read access in kernel mode
[  533.838061] #PF: error_code(0x0000) - not-present page
[  533.843208] PGD 5e48e44067 P4D 5e48e44067 PUD 5e48e43067 PMD 0
[  533.849140] Oops: 0000 [#1] SMP NOPTI
[  533.852810] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W I    X    5.3.18-59.24-default #1 SLE15-SP3
[  533.862390] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[  533.870062] RIP: 0010:update_blocked_averages+0x2d1/0x530
[  533.875467] Code: 8b 80 f0 00 00 00 4a 8b 34 38 48 85 f6 74 36 48 83 be a0 01 00 00 00 75 1b 48 83 be b0 01 00 00 00 75 11 48 8b 86 58 01 00 00 <48> 83 b8 08 01 00 00 00 74 11 48 8b be 50 01 00 00 ba 01 00 00 00
[  533.894252] RSP: 0018:ffffb74440003eb0 EFLAGS: 00010046
[  533.899481] RAX: 0000000000000001 RBX: ffff9887b70c1c00 RCX: 0000000000000000
[  533.906627] RDX: 0000000000000001 RSI: ffff9887b70c1c00 RDI: 0000000000000000
[  533.913769] RBP: 0000000000000000 R08: 0000000000000304 R09: 0000000000000000
[  533.920910] R10: ffffb74440003eb0 R11: 0000000000000000 R12: ffff9887b70c1d40
[  533.928054] R13: 0000000000000001 R14: ffff9887be57da00 R15: 0000000000000050
[  533.935204] FS:  0000000000000000(0000) GS:ffff9887c0600000(0000) knlGS:0000000000000000
[  533.943308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  533.949066] CR2: 0000000000000109 CR3: 0000005e48e06002 CR4: 00000000007706f0
[  533.956211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  533.963360] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  533.970501] PKRU: 55555554
[  533.973213] Call Trace:
[  533.975666]  <IRQ>
[  533.977688]  update_nohz_stats+0x42/0x60
[  533.981625]  _nohz_idle_balance+0xd1/0x200
[  533.985736]  __do_softirq+0xe3/0x2d6
[  533.989322]  irq_exit+0xd5/0xe0
[  533.992474]  reschedule_interrupt+0xf/0x20
[  533.996583]  </IRQ>
[  533.998695] RIP: 0010:cpuidle_enter_state+0xab/0x430
[  534.004177] Code: db 5f 71 7d e8 b6 6f 9d ff 49 89 c5 0f 1f 44 00 00 31 ff e8 a7 7f 9d ff 80 7c 24 0b 00 0f 85 db 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 89 f4 01 00 00 c7 43 14 00 00 00 00 48 83 c4 10 44 89
[  534.023942] RSP: 0018:ffffffff83603e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
[  534.032018] RAX: ffff9887c062cc80 RBX: ffffd6e440805d28 RCX: 000000000000001f
[  534.039655] RDX: 0000007c4a80e0f9 RSI: 000000003d1877c2 RDI: 0000000000000000
[  534.047296] RBP: ffffffff8375f100 R08: 0000000000000002 R09: 000000000002c500
[  534.054932] R10: ffffffff83603e48 R11: 000000000000bafa R12: 0000000000000002
[  534.062562] R13: 0000007c4a80e0f9 R14: 0000000000000002 R15: 0000000000000000
[  534.070193]  ? cpuidle_enter_state+0x99/0x430
[  534.075048]  cpuidle_enter+0x29/0x40
[  534.079125]  do_idle+0x1f7/0x270
[  534.082861]  cpu_startup_entry+0x19/0x20
[  534.087294]  start_kernel+0x559/0x57a
[  534.091474]  secondary_startup_64_no_verify+0xc2/0xd0
[  534.097040] Modules linked in: ebt_ip ebtable_broute ebtables rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core vhost_net vhost tun tap xt_set ipt_rpfilter iptable_raw ip_set_hash_ip ip_set_hash_net ip_set vxlan ip6_udp_tunnel udp_tunnel xt_multiport veth nf_conntrack_netlink nfnetlink xt_addrtype xt_recent xt_statistic xt_nat ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill intel_rapl_msr intel_rapl_common ipmi_ssif isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel mgag200 iTCO_wdt drm_kms_helper intel_pmc_bxt iTCO_vendor_support kvm cec dell_smbios rc_core ipmi_si mei_me dcdbas(X) syscopyarea sysfillrect sysimgblt irqbypass dell_wmi_descriptor wmi_bmof ipmi_devintf pcspkr mei joydev
[  534.097075]  i2c_i801 lpc_ich fb_sys_fops ipmi_msghandler button drm fuse configfs overlay loop hid_generic usbhid ext4 crc16 mbcache jbd2 sd_mod crc32_pclmul crc32c_intel xhci_pci ghash_clmulni_intel ahci xhci_hcd aesni_intel nvme crypto_simd igb libahci cryptd i40e nvme_core usbcore glue_helper i2c_algo_bit libata megaraid_sas t10_pi dca wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[  534.245630] Supported: Yes, External
[  534.249849] CR2: 0000000000000109
[  534.253841] ---[ end trace 9ed666102614ae81 ]---
[  534.267245] RIP: 0010:update_blocked_averages+0x2d1/0x530
[  534.273297] Code: 8b 80 f0 00 00 00 4a 8b 34 38 48 85 f6 74 36 48 83 be a0 01 00 00 00 75 1b 48 83 be b0 01 00 00 00 75 11 48 8b 86 58 01 00 00 <48> 83 b8 08 01 00 00 00 74 11 48 8b be 50 01 00 00 ba 01 00 00 00
[  534.293384] RSP: 0018:ffffb74440003eb0 EFLAGS: 00010046
[  534.299281] RAX: 0000000000000001 RBX: ffff9887b70c1c00 RCX: 0000000000000000
[  534.307086] RDX: 0000000000000001 RSI: ffff9887b70c1c00 RDI: 0000000000000000
[  534.314887] RBP: 0000000000000000 R08: 0000000000000304 R09: 0000000000000000
[  534.322692] R10: ffffb74440003eb0 R11: 0000000000000000 R12: ffff9887b70c1d40
[  534.330496] R13: 0000000000000001 R14: ffff9887be57da00 R15: 0000000000000050
[  534.338303] FS:  0000000000000000(0000) GS:ffff9887c0600000(0000) knlGS:0000000000000000
[  534.347073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  534.353512] CR2: 0000000000000109 CR3: 0000005e48e06002 CR4: 00000000007706f0
[  534.361340] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  534.369171] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  534.377007] PKRU: 55555554
[  534.380416] Kernel panic - not syncing: Fatal exception in interrupt
[  535.427286] Shutting down cpus with NMI
[  535.741424] Kernel Offset: 0x1200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  535.760879] Rebooting in 10 seconds..
[  545.709021] ACPI MEMORY or I/O RESET_REG.

@yasker
Copy link
Member

yasker commented Oct 2, 2021

@jeffmahoney @dirkmueller @janeczku see the panic trace provided by Alex above.

@jeffmahoney
Copy link

jeffmahoney commented Oct 4, 2021

The first Oops is trimmed such that we see a hard lockup report from a previous boot and are missing the address being dereferenced.

(gdb) list *native_queued_spin_lock_slowpath+0x136
0xffffffff810fa836 is in native_queued_spin_lock_slowpath (../kernel/locking/qspinlock.c:510).
452         /*                                                                       
453          * Publish the updated tail.                                             
454          * We have already touched the queueing cacheline; don't bother with     
455          * pending stuff.                                                        
456          *                                                                       
457          * p,*,* -> n,*,*                                                        
458          */                                                                      
459         old = xchg_tail(lock, tail);                                             
460         next = NULL;                                                             
461                                                                                  
462         /*                                                                       
[...]
507         if ((val = pv_wait_head_or_lock(lock, node)))
508                 goto locked;
509 
510         val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
511 
512 locked:
(gdb) disass native_queued_spin_lock_slowpath
[...]
   0xffffffff810fa834 <+308>:	mov    (%rdi),%eax
   0xffffffff810fa836 <+310>:	test   %ax,%ax

%rdi = 0 -> Oops.

Given that we've already dereferenced lock at line 459 (and possibly line 507), lock should still be valid at line 510 so something has corrupted the variable.

The second Oops points to kernel/sched/fair.c:3636

(gdb) list *update_blocked_averages+0x2d1
0xffffffff810d8671 is in update_blocked_averages (../kernel/sched/fair.c:3636).
3631	in ../kernel/sched/fair.c
 3622 {
 3623         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 3624 
 3625         /*
 3626          * If sched_entity still have not zero load or utilization, we have to  
 3627          * decay it:
 3628          */
 3629         if (se->avg.load_avg || se->avg.util_avg)
 3630                 return false;
 3631 
 3632         /*
 3633          * If there is a pending propagation, we have to update the load and    
 3634          * the utilization of the sched_entity:
 3635          */
 3636         if (gcfs_rq->propagate)  
 3637                 return false;
 3638 
 3639         /*
 3640          * Otherwise, the load and the utilization of the sched_entity is
 3641          * already zero and there is no pending propagation, so it will be a    
 3642          * waste of time to try to decay it:
 3643          */
 3644         return true;
 3645 }   
(gdb) disass update_blocked_averages
[...]
   0xffffffff810d866a <+714>:	mov    0x158(%rsi),%rax
   0xffffffff810d8671 <+721>:	cmpq   $0x0,0x108(%rax)

The oops points to a dereferencing of 0x109, which tracks 0x108(%rax) when %rax is 1 as seen in the Oops.

(gdb) print &((struct cfs_rq *)0)->propagate
$2 = (long *) 0x108

%rax of 1 means that group_cfs_req(se) returned 0x1, which in turn means that se->my_q = 0x1, which would suggest memory corruption.

It's unclear to me if the patch @davidlohr posted will fix a use-after-free, but Michal Koutny commented on IRC that he suspects it could (with the caveat that it needs more investigation to confirm.) I'll build a kernel with that patch applied for testing.

@jeffmahoney
Copy link

Please ignore the analysis of the first oops. I was reading the wrong register dump. It's not actually an oops and is probably another lockup report. The lines at the top of that report are from an Oops but it's truncated.

@yasker
Copy link
Member

yasker commented Oct 4, 2021

@alexdepalex Is there any log above the line [ 137.985537] general protection fault: 0000 [#1] SMP NOPTI?

@alexdepalex
Copy link

No.

@jeffmahoney
Copy link

Built packages for the kernel are here: https://download.opensuse.org/repositories/home:/jeff_mahoney:/branches:/SUSE:/SLE-15-SP3:/Update/standard/x86_64/

You should only need kernel-default and maybe kernel-default-optional.

@yasker
Copy link
Member

yasker commented Oct 4, 2021

@bk201 Can you help to build a Harvester iso today with the kernel from #1342 (comment) ?

Also, since it's a kernel debug build, it's better if we can include kdump into the build as well. Ref: #1342 (comment)

@yasker
Copy link
Member

yasker commented Oct 5, 2021

Some more information on how to debug in the OS: rancher/elemental-toolkit#751 (comment)

@alexdepalex
Copy link

Got another crash with the hotfix kernel. Uploaded the dump harvester-vmcore-1342.tar to the suse upload server.

[ 8995.095798] BUG: kernel NULL pointer dereference, address: 0000000000000080
[ 9016.281685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21
[ 9016.281686] Modules linked in: ebt_ip ebtable_broute ebtables vhost_net vhost tun tap xt_statistic rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core xt_recent nf_conntrack_netlink vxlan ip6_udp_tunnel udp_tunnel ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set nfnetlink veth xt_addrtype xt_nat ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt intel_pmc_bxt ipmi_ssif iTCO_vendor_support kvm mgag200 drm_kms_helper cec dell_smbios rc_core mei_me dcdbas(X) syscopyarea sysfillrect sysimgblt dell_wmi_descriptor irqbypass wmi_bmof pcspkr mei i2c_i801 fb_sys_fops
[ 9016.281704]  lpc_ich ipmi_si ipmi_devintf ipmi_msghandler button drm fuse configfs overlay loop ext4 crc16 mbcache jbd2 sd_mod crc32_pclmul crc32c_intel ghash_clmulni_intel xhci_pci xhci_hcd aesni_intel i40e crypto_simd cryptd ahci glue_helper nvme libahci igb usbcore nvme_core libata megaraid_sas t10_pi i2c_algo_bit dca wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[ 9016.281715] Supported: No, Unreleased kernel
[ 9016.281716] CPU: 21 PID: 49287 Comm: systemd-udevd Kdump: loaded Tainted: G        W I    X    5.3.18-59.27-default #1 SLE15-SP3 (unreleased)
[ 9016.281717] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[ 9016.281717] RIP: 0010:native_queued_spin_lock_slowpath+0x191/0x1e0
[ 9016.281718] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 d9 02 00 48 03 04 f5 a0 f9 1b bb 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 94 0f 0d 0e eb 8f 8b 07
[ 9016.281718] RSP: 0018:ffffa8508af8f998 EFLAGS: 00000046
[ 9016.281718] RAX: 0000000000000000 RBX: 0000000000000082 RCX: 0000000000580000
[ 9016.281719] RDX: ffff8aeeac8ad900 RSI: 0000000000000005 RDI: ffff8a8ec082cc80
[ 9016.281719] RBP: ffff8a8ec082cc80 R08: 0000000000580000 R09: ffffffffffffffff
[ 9016.281720] R10: 0000000000000008 R11: 0000000000000000 R12: ffffa8508af8fc20
[ 9016.281720] R13: ffff8a8ec082cc80 R14: 0000000000000010 R15: 0000000000000000
[ 9016.281721] FS:  00007f64c4385980(0000) GS:ffff8aeeac880000(0000) knlGS:0000000000000000
[ 9016.281721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9016.281721] CR2: 00007f64c3deb590 CR3: 000000be5c62c003 CR4: 00000000007706e0
[ 9016.281722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9016.281722] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9016.281723] PKRU: 55555554
[ 9016.281723] Call Trace:
[ 9016.281723]  _raw_spin_lock_irqsave+0x30/0x40
[ 9016.281723]  update_blocked_averages+0x2d/0x530
[ 9016.281724]  update_nohz_stats+0x42/0x60
[ 9016.281724]  update_sd_lb_stats.constprop.122+0x274/0x890
[ 9016.281724]  find_busiest_group+0x41/0x380
[ 9016.281725]  load_balance+0x15a/0xc60
[ 9016.281725]  newidle_balance+0x2a5/0x3b0
[ 9016.281725]  pick_next_task_fair+0x3e/0x3a0
[ 9016.281725]  __schedule+0x18d/0x760
[ 9016.281726]  schedule+0x2f/0xa0
[ 9016.281726]  schedule_hrtimeout_range_clock+0xee/0x100
[ 9016.281726]  ? sock_write_iter+0x97/0x100
[ 9016.281727]  ? __seccomp_filter+0x7a/0x690
[ 9016.281727]  ep_poll+0x3d4/0x4d0
[ 9016.281727]  ? wait_woken+0x80/0x80
[ 9016.281727]  do_epoll_wait+0xab/0xc0
[ 9016.281728]  __x64_sys_epoll_wait+0x1a/0x20
[ 9016.281728]  do_syscall_64+0x5b/0x1e0
[ 9016.281728]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9016.281729] RIP: 0033:0x7f64c318aff6
[ 9016.281729] Code: 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 e8 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5a f3 c3 41 55 41 54 41 89 cd 55 53 41 89 d4
[ 9016.281730] RSP: 002b:00007ffcf896e958 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[ 9016.281730] RAX: ffffffffffffffda RBX: 000055e713942940 RCX: 00007f64c318aff6
[ 9016.281731] RDX: 0000000000000002 RSI: 000055e7138f06b0 RDI: 0000000000000003
[ 9016.281731] RBP: 0000000000000001 R08: 00007f64c34589e0 R09: 0000000000000000
[ 9016.281731] R10: 00000000ffffffff R11: 0000000000000246 R12: ffffffffffffffff
[ 9016.281732] R13: 0000000000000002 R14: 000055e7139fc720 R15: 000055e713035298
[ 9016.281732] Kernel panic - not syncing: Hard LOCKUP

@alexdepalex
Copy link

alexdepalex commented Oct 7, 2021

Additional dumps on the ftp server.

lpedge01002-2021-10-07-0245.tar
lpedge01005-2021-10-07-0345.tar
lpedge01006-2021-10-07-0613.tar

@yasker
Copy link
Member

yasker commented Oct 7, 2021

@jeffmahoney it seems not an easy fix at the moment. Is it possible to have a kernel build package without the commit that's introduced the regression? Or any other way we can do to move forward, since we're releasing v0.3.0 next Monday.

@yasker
Copy link
Member

yasker commented Oct 7, 2021

cc @davidlohr @dirkmueller ^^

@bashofmann
Copy link

I'm also encountering very frequent kernel panics in 0.3.0-rc1 in a different setup.

Hardware: Dell PowerEdge R630

The panics reliably happen only a few minutes after boot without creating any additional VMs.

We already did boot the system in debugrw mode, installed the patched kernel rpm mentioned in #1342 (comment) and rebooted. But this did not fix the panics.

I'll also try to provide crash dumps as soon as possible.

@janeczku
Copy link
Contributor Author

We no longer observed kernel panics on harvester-0.3.0-rc1-hotfix2 which uses a test kernel build (5.3.18-59.24.1.23030.1.TEST.1191238) that reverted two commits suspected to introduce the bug:

* Mon Aug 30 2021 [email protected]
- sched/fair: Correctly insert cfs_rq's to list on unthrottle (git-fixes)
- commit 1732b9b

* Wed Sep 08 2021 [email protected]
- sched/fair: Ensure that the CFS parent is added after unthrottling (git-fixes).
- commit f3a38fb

The most recent SLES 15 SP3 kernel package not containing the two problematic commits is kernel-default-5.3.18-59.19.1.x86_64.rpm 2021-08-03.

This is probably the kernel package that should be used in lieu of 5.3.18-59.24-default until a bug fix version is available upstream.

@marcuspocus1
Copy link

I'm getting the same random crashes with the official 0.3.0 release...
running kernel : 5.3.18-59.24-default
hardware: dell poweredge R740xd
I'll try to update the kernel later on to see if it fix my issue...

@marcuspocus1
Copy link

Facing this problem while trying to update the kernel...
#1388
would be great if someone can re-generate an iso with a stable kernel ;)

@silug
Copy link

silug commented Oct 27, 2021

I believe that I'm hitting the same kernel panic on 0.3.0.

[ 9743.593270] BUG: unable to handle page fault for address: 000000000c000e6a
[ 9762.977582] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12
[ 9762.977583] Modules linked in: rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core vhost_net vhost tun tap ebt_ip ebtable_broute ebtables xt_set ipt_rpfilter iptable_raw ip_set_hash_ip ip_set_hash_net ip_set xt_multiport vxlan ip6_udp_tunnel udp_tunnel veth nf_conntrack_netlink nfnetlink xt_addrtype xt_recent xt_statistic xt_nat ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass pcspkr mgag200 drm_kms_helper lpc_ich cec rc_core syscopyarea sysfillrect sysimgblt hpilo hpwdt fb_sys_fops ipmi_ssif joydev ses ioatdma enclosure thermal ipmi_si ipmi_devintf ipmi_msghandler button drm
[ 9762.977611]  fuse configfs overlay loop ext4 crc16 mbcache jbd2 hid_generic usbhid sd_mod t10_pi uhci_hcd crc32_pclmul crc32c_intel ghash_clmulni_intel ehci_pci ehci_hcd aesni_intel ahci libahci mpt3sas crypto_simd cryptd glue_helper libata igb usbcore serio_raw raid_class scsi_transport_sas i2c_algo_bit dca sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[ 9762.977625] Supported: Yes
[ 9762.977625] CPU: 12 PID: 0 Comm: swapper/12 Kdump: loaded Tainted: G        W I         5.3.18-59.24-default #1 SLE15-SP3
[ 9762.977626] Hardware name: HP ProLiant DL380e Gen8, BIOS P73 05/24/2019
[ 9762.977626] RIP: 0010:native_queued_spin_lock_slowpath+0x62/0x1e0
[ 9762.977627] Code: ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1e 85 c0 75 0b b8 01 00 00 00 66 89 07 c3 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
[ 9762.977627] RSP: 0018:ffffc0f286748a30 EFLAGS: 00000002
[ 9762.977628] RAX: 0000000000080101 RBX: ffffa087043c0000 RCX: ffffa0870f780000
[ 9762.977629] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0870f7acc80
[ 9762.977629] RBP: ffffa0870f7acc80 R08: ffffa0870f780000 R09: ffffa07bc7803d98
[ 9762.977630] R10: 0000000000000000 R11: ffffffff8f8639d8 R12: 0000000000000000
[ 9762.977630] R13: ffffa087043c0b84 R14: 0000000000000006 R15: 000000000000000c
[ 9762.977630] FS:  0000000000000000(0000) GS:ffffa0870f780000(0000) knlGS:0000000000000000
[ 9762.977631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9762.977631] CR2: 000000000c000e6a CR3: 00000017be5f8003 CR4: 00000000000606e0
[ 9762.977632] Call Trace:
[ 9762.977632]  <IRQ>
[ 9762.977632]  _raw_spin_lock+0x1b/0x20
[ 9762.977632]  try_to_wake_up+0x3e9/0x500
[ 9762.977633]  __queue_work+0x13e/0x400
[ 9762.977633]  queue_work_on+0x34/0x40
[ 9762.977633]  bit_putcs+0x2d0/0x4e0
[ 9762.977634]  ? bit_clear+0x110/0x110
[ 9762.977634]  fbcon_putcs+0xeb/0x100
[ 9762.977634]  vt_console_print+0x2f2/0x3d0
[ 9762.977635]  console_unlock+0x3b2/0x4e0
[ 9762.977635]  vprintk_emit+0x109/0x200
[ 9762.977635]  printk+0x52/0x6e
[ 9762.977636]  ? update_blocked_averages+0x2c5/0x530
[ 9762.977636]  no_context+0x28b/0x580
[ 9762.977636]  ? update_load_avg+0x5d3/0x5f0
[ 9762.977637]  do_page_fault+0x30/0x110
[ 9762.977637]  page_fault+0x3e/0x50
[ 9762.977637] RIP: 0010:update_blocked_averages+0x2b6/0x530
[ 9762.977638] Code: 48 89 de 48 89 d7 e8 99 7e 01 00 09 e8 48 8b 83 50 01 00 00 0f 85 26 01 00 00 48 8b 80 f0 00 00 00 4a 8b 34 38 48 85 f6 74 36 <48> 83 be a0 01 00 00 00 75 1b 48 83 be b0 01 00 00 00 75 11 48 8b
[ 9762.977639] RSP: 0018:ffffc0f286748ef0 EFLAGS: 00010006
[ 9762.977639] RAX: ffffefb16cd2a000 RBX: ffffa0870849c800 RCX: 0000000000000000
[ 9762.977640] RDX: 0000000000000001 RSI: 000000000c000cca RDI: 0000000000000000
[ 9762.977640] RBP: 0000000000000000 R08: 00000000000003e0 R09: 0000000000000000
[ 9762.977641] R10: ffffc0f286748ef0 R11: 0000000000000000 R12: ffffa0870849c940
[ 9762.977641] R13: 0000000000000000 R14: ffffa08614f56600 R15: 0000000000000060
[ 9762.977641]  ? enqueue_hrtimer+0x39/0x90
[ 9762.977642]  run_rebalance_domains+0x71/0xa0
[ 9762.977642]  __do_softirq+0xe3/0x2d6
[ 9762.977642]  irq_exit+0xd5/0xe0
[ 9762.977643]  smp_apic_timer_interrupt+0x74/0x130
[ 9762.977643]  apic_timer_interrupt+0xf/0x20
[ 9762.977643]  </IRQ>
[ 9762.977644] RIP: 0010:cpuidle_enter_state+0xa8/0x430
[ 9762.977645] Code: 65 8b 3d db 5f 51 71 e8 b6 6f 9d ff 49 89 c5 66 66 66 66 90 31 ff e8 a7 7f 9d ff 80 7c 24 0b 00 0f 85 db 01 00 00 fb 66 66 90 <66> 66 90 45 85 e4 0f 89 f4 01 00 00 c7 43 14 00 00 00 00 48 83 c4
[ 9762.977645] RSP: 0018:ffffc0f28632fe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 9762.977646] RAX: ffffa0870f7acc80 RBX: ffffe0e68f580b00 RCX: 000000000000001f
[ 9762.977647] RDX: 000008dc9b6a1b81 RSI: 0000000040277d83 RDI: 0000000000000000
[ 9762.977647] RBP: ffffffff8f95f100 R08: 0000000000000002 R09: 000000000002c500
[ 9762.977648] R10: ffffc0f28632fe60 R11: 0000000000000076 R12: 0000000000000002
[ 9762.977648] R13: 000008dc9b6a1b81 R14: 0000000000000002 R15: 0000000000000000
[ 9762.977649]  cpuidle_enter+0x29/0x40
[ 9762.977649]  do_idle+0x1f7/0x270
[ 9762.977649]  cpu_startup_entry+0x19/0x20
[ 9762.977650]  start_secondary+0x155/0x1a0
[ 9762.977650]  secondary_startup_64_no_verify+0xc2/0xd0
[ 9762.977651] Kernel panic - not syncing: Hard LOCKUP

@yasker
Copy link
Member

yasker commented Oct 28, 2021

@silug the calltrace looks different. Can you file another issue? Also with the environment you're running on and other details e.g. how often do you see the issue.

@abonillabeeche
Copy link

abonillabeeche commented Nov 10, 2021

I'm encountering the same problem - can anyone comment if kernel-default-5.3.18-59.19.1.x86_64.rpm fixed the problem?

This is what I caught from my serial console.

r620-2 login: [ 2264.710891] BUG: kernel NULL pointer dereference, address: 0000000000000999
[ 2283.049334] NMI watchdog: Watchdog detected hard LOCKUP on cpu 17
[ 2283.049334] Modules linked in: rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core vhost_net vhost tun tap ebt_ip ebtable_broute ebtables ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set veth vxlan ip6_udp_tunnel udp_tunnel nf_conntrack_netlink nfnetlink xt_addrtype xt_recent xt_nat xt_statistic ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill iTCO_wdt intel_pmc_bxt intel_rapl_msr iTCO_vendor_support dcdbas(X) ipmi_ssif intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp mgag200 coretemp i2c_algo_bit drm_kms_helper kvm_intel cec kvm rc_core ipmi_si mei_me syscopyarea sysfillrect sysimgblt irqbypass ipmi_devintf pcspkr joydev mei fb_sys_fops lpc_ich ipmi_msghandler button fuse drm
[ 2283.049364]  configfs overlay hid_generic usbhid loop ext4 crc16 mbcache jbd2 sd_mod t10_pi crc32_pclmul crc32c_intel ehci_pci ghash_clmulni_intel ehci_hcd ixgbe ahci aesni_intel libahci crypto_simd usbcore xfrm_algo libata tg3 cryptd dca glue_helper megaraid_sas libphy wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[ 2283.049382] Supported: Yes, External
[ 2283.049383] CPU: 17 PID: 41870 Comm: runc Tainted: G        W      X    5.3.18-59.24-default #1 SLE15-SP3
[ 2283.049384] Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 2.8.0 06/26/2019
[ 2283.049384] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x1e0
[ 2283.049385] Code: 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1e 85 c0 75 0b b8 01 00 00 00 66 89 07 c3 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 01 00
[ 2283.049386] RSP: 0018:ffffb32ca5a1fc50 EFLAGS: 00000002
[ 2283.049387] RAX: 0000000000700101 RBX: 000000000002cc80 RCX: ffff94f319198c98
[ 2283.049388] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff94f31f96cc80
[ 2283.049388] RBP: ffffb32ca5a1fc98 R08: ffff94db1b9acd58 R09: 0000000000000000
[ 2283.049389] R10: ffff94db1b9adbf8 R11: ffff94db1b9adb00 R12: ffffffffa81f09a0
[ 2283.049389] R13: ffff94da366a3204 R14: ffff94da366a2680 R15: ffff94f31f96cc80
[ 2283.049390] FS:  00007f16c444bb38(0000) GS:ffff94f31f800000(0000) knlGS:0000000000000000
[ 2283.049390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2283.049391] CR2: 00007f16c4325740 CR3: 0000001736466001 CR4: 00000000000606e0
[ 2283.049391] Call Trace:
[ 2283.049392]  _raw_spin_lock+0x1b/0x20
[ 2283.049392]  task_rq_lock+0x49/0xb0
[ 2283.049393]  dl_add_task_root_domain+0x2e/0x120
[ 2283.049393]  update_tasks_root_domain+0x33/0x70
[ 2283.049394]  rebuild_sched_domains_locked+0x5b5/0x7a0
[ 2283.049394]  ? update_cpumasks_hier+0x1e2/0x510
[ 2283.049395]  cpuset_write_resmask+0x86e/0x9f0
[ 2283.049395]  cgroup_file_write+0x89/0x150
[ 2283.049396]  kernfs_fop_write+0x113/0x1a0
[ 2283.049396]  vfs_write+0xad/0x1b0
[ 2283.049396]  ksys_write+0xa1/0xe0
[ 2283.049397]  do_syscall_64+0x5b/0x1e0
[ 2283.049397]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2283.049398] RIP: 0033:0x4bec3b
[ 2283.049399] Code: fb ff eb bd e8 06 e5 fa ff e9 61 ff ff ff cc e8 5b b5 fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 2283.049399] RSP: 002b:000000c0002441b0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2283.049400] RAX: ffffffffffffffda RBX: 000000c000035000 RCX: 00000000004bec3b
[ 2283.049401] RDX: 0000000000000005 RSI: 000000c000244380 RDI: 000000000000000a
[ 2283.049401] RBP: 000000c000244200 R08: 0000000000000005 R09: 0000000000000004
[ 2283.049402] R10: 0000000000000020 R11: 0000000000000202 R12: 00000000000000f2
[ 2283.049402] R13: 0000000000000000 R14: 0000000000b714ee R15: 0000000000000000
[ 2283.049403] Kernel panic - not syncing: Hard LOCKUP
[ 2284.159838] Shutting down cpus with NMI
[ 2284.159839] Kernel Offset: 0x26000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

@abonillabeeche
Copy link

My kernel crashes are gone after installing kernel-default-5.3.18-59.19.1.

@aledbf
Copy link

aledbf commented Nov 15, 2021

Same issue here, random reboots v0.3.0

Linux harvester 5.3.18-59.24-default #1 SMP Mon Sep 13 15:06:42 UTC 2021 (2f872ea) x86_64 x86_64 x86_64 GNU/Linux

Nov 15 00:05:24 harvester kernel: ------------[ cut here ]------------
Nov 15 00:05:24 harvester kernel: rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
Nov 15 00:05:24 harvester kernel: WARNING: CPU: 33 PID: 0 at ../kernel/sched/fair.c:378 enqueue_task_fair+0x353/0x610
Nov 15 00:05:24 harvester kernel: Modules linked in: xt_set ipt_rpfilter xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set vxlan ip6_udp_tunnel udp_tunnel veth nf_conntrack_netlink nfnetlink xt_recent xt_statistic xt_nat xt_addrtype ipt_REJECT xt_tcpudp iptable_mangle i>
Nov 15 00:05:24 harvester kernel:  sr_mod cdrom uas usb_storage hid_generic usbhid ext4 crc32_pclmul crc32c_intel crc16 mbcache jbd2 ghash_clmulni_intel ixgbe ahci aesni_intel libahci nvme xfrm_algo xhci_pci dca crypto_simd xhci_hcd cryptd libata nvme_core libphy glue_helper usbcore>
Nov 15 00:05:24 harvester kernel: Supported: Yes
Nov 15 00:05:24 harvester kernel: CPU: 33 PID: 0 Comm: swapper/33 Not tainted 5.3.18-59.24-default #1 SLE15-SP3
Nov 15 00:05:24 harvester kernel: Hardware name: Supermicro SYS-2029BT-HNR/X11DPT-B, BIOS 3.3V1 07/22/2020
Nov 15 00:05:24 harvester kernel: RIP: 0010:enqueue_task_fair+0x353/0x610
Nov 15 00:05:24 harvester kernel: Code: 60 09 00 00 0f 84 cc fd ff ff 80 3d 9a b8 51 01 00 0f 85 bf fd ff ff 48 c7 c7 98 56 93 b5 c6 05 86 b8 51 01 01 e8 3d 0c fc ff <0f> 0b e9 a5 fd ff ff 49 63 95 48 0a 00 00 48 c7 c0 40 94 01 00 48
Nov 15 00:05:24 harvester kernel: RSP: 0018:ffffb3e58d10ce50 EFLAGS: 00010086
Nov 15 00:05:24 harvester kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Nov 15 00:05:24 harvester kernel: RDX: 000000000000002d RSI: ffffffffb64ecd6d RDI: 0000000000000046
Nov 15 00:05:24 harvester kernel: RBP: ffff9ff2bf96cd00 R08: ffffffffb64ecd40 R09: 000000000002c500
Nov 15 00:05:24 harvester kernel: R10: ffffb3e58d10cdd0 R11: 0000000080000021 R12: 0000000000000000
Nov 15 00:05:24 harvester kernel: R13: ffff9ff2bf96cc80 R14: 0000000000000001 R15: 0000000000000021
Nov 15 00:05:24 harvester kernel: FS:  0000000000000000(0000) GS:ffff9ff2bf940000(0000) knlGS:0000000000000000
Nov 15 00:05:24 harvester kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 15 00:05:24 harvester kernel: CR2: 0000000001d16d64 CR3: 0000003f7297c006 CR4: 00000000007706e0
Nov 15 00:05:24 harvester kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 15 00:05:24 harvester kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 15 00:05:24 harvester kernel: PKRU: 55555554
Nov 15 00:05:24 harvester kernel: Call Trace:
Nov 15 00:05:24 harvester kernel:  <IRQ>
Nov 15 00:05:24 harvester kernel:  ttwu_do_activate+0x72/0x170
Nov 15 00:05:24 harvester kernel:  try_to_wake_up+0x413/0x500
Nov 15 00:05:24 harvester kernel:  ? update_process_times+0x48/0x60
Nov 15 00:05:24 harvester kernel:  ? hrtimer_init_sleeper+0x90/0x90
Nov 15 00:05:24 harvester kernel:  hrtimer_wakeup+0x1e/0x30
Nov 15 00:05:24 harvester kernel:  __hrtimer_run_queues+0x108/0x280
Nov 15 00:05:24 harvester kernel:  hrtimer_interrupt+0xe5/0x240
Nov 15 00:05:24 harvester kernel:  ? sched_clock_local+0x12/0x80
Nov 15 00:05:24 harvester kernel:  smp_apic_timer_interrupt+0x6a/0x130
Nov 15 00:05:24 harvester kernel:  apic_timer_interrupt+0xf/0x20
Nov 15 00:05:24 harvester kernel:  </IRQ>
Nov 15 00:05:24 harvester kernel: RIP: 0010:cpuidle_enter_state+0xab/0x430
Nov 15 00:05:24 harvester kernel: Code: db 5f 11 4b e8 b6 6f 9d ff 49 89 c5 0f 1f 44 00 00 31 ff e8 a7 7f 9d ff 80 7c 24 0b 00 0f 85 db 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 e4 0f 89 f4 01 00 00 c7 43 14 00 00 00 00 48 83 c4 10 44 89
Nov 15 00:05:24 harvester kernel: RSP: 0018:ffffb3e58ca1fe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Nov 15 00:05:24 harvester kernel: RAX: ffff9ff2bf96cc80 RBX: ffffd3e57fb45f78 RCX: 000000108421fbea
Nov 15 00:05:24 harvester kernel: RDX: 000000108421fcd0 RSI: 000000108421fcd0 RDI: 0000000000000000
Nov 15 00:05:24 harvester kernel: RBP: ffffffffb5d5f100 R08: 0000000000000002 R09: 000000000002c500
Nov 15 00:05:24 harvester kernel: R10: ffffb3e58ca1fe40 R11: 000000000000003a R12: 0000000000000002
Nov 15 00:05:24 harvester kernel: R13: 000000108421fcd0 R14: 0000000000000002 R15: 0000000000000000
Nov 15 00:05:24 harvester kernel:  cpuidle_enter+0x29/0x40
Nov 15 00:05:24 harvester kernel:  do_idle+0x1f7/0x270
Nov 15 00:05:24 harvester kernel:  cpu_startup_entry+0x19/0x20
Nov 15 00:05:24 harvester kernel:  start_secondary+0x155/0x1a0
Nov 15 00:05:24 harvester kernel:  secondary_startup_64_no_verify+0xc2/0xd0
Nov 15 00:05:24 harvester kernel: ---[ end trace 12b883a5d65a0327 ]---

@bk201
Copy link
Member

bk201 commented Nov 26, 2021

The master build ISO has the kernel updated, which should address this issue.

@guangbochen
Copy link
Contributor

This will be resolved in the v1.0.0 GA version, currently, it can be verified using the https://github.com/harvester/harvester/releases/tag/v1.0.0-rc1, please feel free to re-open it if u encounter this issue with any of the 1.0+ version, thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release require/release-note
Projects
None yet
Development

No branches or pull requests