Hi all,
i've recently set up a ClusterHat v2.5 system with a RPi4 host, 2 PiZeros and 2 PiZero 2s with usbboot to bookworm (current 2024-07-04 images) on all nodes.
On the first try, everything worked fine, the zeros booted and were accessible via SSH.
Since the second try however, i can't get _any_ of the nodes to usb boot anymore.
I've noticed multiple issues:
- Serial output on the modified Zero2 shows a kernel panic:
[ 32.881810] skbuff: skb_over_panic: text:ffffffe9127cbe7c len:-522752 put:-522752 head:ffffff8002e70c00 data:ffffff8002e70c40 tail:0xfff80640 end:0x640 dev:usb0
[ 32.896552] kernel BUG at net/core/skbuff.c:192!
[ 32.901258] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[ 32.908166] Modules linked in: cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep brcmfmac_wcc vc4 snd_soc_hdmi_codec drm_display_helper cec drm_dma_helper drm_kms_helper brcmfmac brcmutil snd_soc_core cfg80211 hci_uart snd_compress btbcm snd_pcm_dmaengine bluetootho
[ 32.990421] CPU: 2 PID: 615 Comm: dhclient-script Tainted: G C 6.6.31+rpt-rpi-v8 #1 Debian 1:6.6.31-1+rpt1
[ 33.001747] Hardware name: Raspberry Pi Zero 2 W Rev 1.0 (DT)
[ 33.007590] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 33.014675] pc : skb_panic+0x54/0x60
[ 33.018329] lr : skb_panic+0x54/0x60
[ 33.021971] sp : ffffffc080013cf0
[ 33.025343] x29: ffffffc080013d00 x28: ffffff8002589080 x27: 0000000000000000
[ 33.032619] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000801000
[ 33.039893] x23: ffffff8003466990 x22: ffffff8002c42c80 x21: ffffff8003466940
[ 33.047167] x20: ffffff8002940e00 x19: 0000000000000000 x18: 0000000000000006
[ 33.054442] x17: ffffff96f2fbe000 x16: ffffffe927ccebe8 x15: ffffffc080013760
[ 33.061715] x14: 0000000000000003 x13: 666666663a747865 x12: 74203a63696e6170
[ 33.068989] x11: 7265766f5f626b73 x10: ffffffe9288a3710 x9 : ffffffe9273ff768
[ 33.076262] x8 : 00000000ffffefff x7 : ffffffe9288a3710 x6 : 80000000fffff000
[ 33.083537] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 33.090809] x2 : 0000000000000000 x1 : ffffff8003068000 x0 : 0000000000000094
[ 33.098083] Call trace:
[ 33.100573] skb_panic+0x54/0x60
[ 33.103864] skb_put+0x74/0x80
[ 33.106979] rx_complete+0xec/0x270 [u_ether]
[ 33.111446] usb_gadget_giveback_request+0x34/0xe8
[ 33.116328] dwc2_hsotg_complete_request+0x88/0x178 [dwc2]
[ 33.121962] dwc2_hsotg_handle_outdone+0xc4/0x1d8 [dwc2]
[ 33.127415] dwc2_hsotg_epint+0x9ac/0xe90 [dwc2]
[ 33.132160] dwc2_hsotg_irq+0x8f0/0xea8 [dwc2]
[ 33.136729] __handle_irq_event_percpu+0x60/0x230
[ 33.141529] handle_irq_event+0x54/0xc0
[ 33.145442] handle_level_irq+0xc8/0x1b0
[ 33.149442] generic_handle_domain_irq+0x34/0x58
[ 33.154149] bcm2836_chained_handle_irq+0x30/0x58
[ 33.158948] generic_handle_domain_irq+0x34/0x58
[ 33.163655] bcm2836_arm_irqchip_handle_irq+0x64/0x80
[ 33.168800] call_on_irq_stack+0x24/0x58
[ 33.172800] do_interrupt_handler+0x88/0x98
[ 33.177065] el1_interrupt+0x34/0x68
[ 33.180713] el1h_64_irq_handler+0x18/0x28
[ 33.184890] el1h_64_irq+0x64/0x68
[ 33.188355] finish_task_switch.isra.0+0x7c/0x258
[ 33.193151] __schedule+0x380/0xd60
[ 33.196705] schedule+0x64/0x108
[ 33.199993] do_wait+0x15c/0x2f8
[ 33.203283] kernel_wait4+0xa8/0x198
[ 33.206925] __do_sys_wait4+0xe8/0x108
[ 33.210744] __arm64_sys_wait4+0x2c/0x40
[ 33.214739] invoke_syscall+0x50/0x128
[ 33.218564] el0_svc_common.constprop.0+0x48/0xf0
[ 33.223359] do_el0_svc+0x24/0x38
[ 33.226740] el0_svc+0x40/0xe8
[ 33.229856] el0t_64_sync_handler+0x100/0x130
[ 33.234297] el0t_64_sync+0x190/0x198
[ 33.238035] Code: 29572107 a90027e8 91346000 97d8dd96 (d4210000)
[ 33.244244] ---[ end trace 0000000000000000 ]---
[ 33.248954] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[ 33.256479] SMP: stopping secondary CPUs
[ 33.260478] Kernel Offset: 0x28a7200000 from 0xffffffc080000000
[ 33.266498] PHYS_OFFSET: 0x0
[ 33.269427] CPU features: 0x0,0000000d,00020000,0000421b
[ 33.274832] Memory Limit: none
[ 33.277945] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]---
- Occasionally, a netdev watchdog is triggered on the host:
Jan 10 18:24:27 cbridge kernel: ------------[ cut here ]------------
Jan 10 18:24:27 cbridge kernel: NETDEV WATCHDOG: ethupi2 (rndis_host): transmit queue 0 timed out 5572 ms
Jan 10 18:24:27 cbridge kernel: WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x2a8/0x2b8
Jan 10 18:24:27 cbridge kernel: Modules linked in: 8021q garp rndis_wlan rndis_host cdc_ether cdc_acm bridge stp llc cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep nft_chain_nat xt_MASQUERADE vc4 nf_nat xt_conntrack brcmfmac_wcc nf_conntrack hci_uart snd_soc_hdmi_c>
Jan 10 18:24:27 cbridge kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G C 6.6.31+rpt-rpi-v8 #1 Debian 1:6.6.31-1+rpt1
Jan 10 18:24:27 cbridge kernel: Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
Jan 10 18:24:27 cbridge kernel: pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Jan 10 18:24:27 cbridge kernel: pc : dev_watchdog+0x2a8/0x2b8
Jan 10 18:24:27 cbridge kernel: lr : dev_watchdog+0x2a8/0x2b8
Jan 10 18:24:27 cbridge kernel: sp : ffffffc080013db0
Jan 10 18:24:27 cbridge kernel: x29: ffffffc080013db0 x28: ffffffec5c758b18 x27: ffffffc080013ee0
Jan 10 18:24:27 cbridge kernel: x26: ffffffec5ce94008 x25: 00000000000015c4 x24: ffffffec5d226000
Jan 10 18:24:27 cbridge kernel: x23: 0000000000000000 x22: ffffff80444483dc x21: ffffff8044448000
Jan 10 18:24:27 cbridge kernel: x20: ffffff8043977400 x19: ffffff8044448488 x18: ffffffffffffffff
Jan 10 18:24:27 cbridge kernel: x17: 756f2064656d6974 x16: 2030206575657571 x15: 2074696d736e6172
Jan 10 18:24:27 cbridge kernel: x14: 74203a2974736f68 x13: 736d203237353520 x12: 74756f2064656d69
Jan 10 18:24:27 cbridge kernel: x11: 7420302065756575 x10: ffffffec5d2a3710 x9 : ffffffec5bd1da8c
Jan 10 18:24:27 cbridge kernel: x8 : 00000000ffffefff x7 : ffffffec5d2a3710 x6 : 80000000fffff000
Jan 10 18:24:27 cbridge kernel: x5 : 0000000000000000 x4 : 0000000000000040 x3 : 0000000000000004
Jan 10 18:24:27 cbridge kernel: x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80402bdc40
Jan 10 18:24:27 cbridge kernel: Call trace:
Jan 10 18:24:27 cbridge kernel: dev_watchdog+0x2a8/0x2b8
Jan 10 18:24:27 cbridge kernel: call_timer_fn+0x3c/0x1c8
Jan 10 18:24:27 cbridge kernel: __run_timers+0x25c/0x330
Jan 10 18:24:27 cbridge kernel: run_timer_softirq+0x28/0x50
Jan 10 18:24:27 cbridge kernel: __do_softirq+0x118/0x384
Jan 10 18:24:27 cbridge kernel: ____do_softirq+0x18/0x30
Jan 10 18:24:27 cbridge kernel: call_on_irq_stack+0x24/0x58
Jan 10 18:24:27 cbridge kernel: do_softirq_own_stack+0x24/0x38
Jan 10 18:24:27 cbridge kernel: irq_exit_rcu+0x8c/0xd0
Jan 10 18:24:27 cbridge kernel: el1_interrupt+0x38/0x68
Jan 10 18:24:27 cbridge kernel: el1h_64_irq_handler+0x18/0x28
Jan 10 18:24:27 cbridge kernel: el1h_64_irq+0x64/0x68
Jan 10 18:24:27 cbridge kernel: default_idle_call+0x5c/0x170
Jan 10 18:24:27 cbridge kernel: do_idle+0x204/0x238
Jan 10 18:24:27 cbridge kernel: cpu_startup_entry+0x3c/0x50
Jan 10 18:24:27 cbridge kernel: secondary_start_kernel+0x128/0x150
Jan 10 18:24:27 cbridge kernel: __secondary_switched+0xb8/0xc0
Jan 10 18:24:27 cbridge kernel: ---[ end trace 0000000000000000 ]---
- The usb-serial connection does not give any output on any of the nodes, maybe because it just comes up immediately before a potential kernel panic?
As standalone devices, the zeros seem to boot just fine.
Does anyone have an idea what i could have messed up?
Thanks,
Sebastian