ClusterHat controller stops communicating after apt update/upgrade

199 views
Skip to first unread message

Tony Brack

unread,
Sep 24, 2021, 11:23:04 PM9/24/21
to ClusterHAT
Greetings All,

I posted something in May and can't seem to add to it. I never really got an answer, but I have been doing a lot of troubleshooting since then and what to report a consistent error I am having. I am surprised I was unable to find reference to it here.

My ClusterHat is v2.4, and I have 3 Pi ZEROs and 1 Pi ZERO W on a 2GB Pi 4.

The problem is reproducible on both the latest 32 bit image and the 64 bit beta.


Details:

1. load P1-P4 with the latest images
2020-12-02-1-ClusterCTRL-armhf-lite-pX.zip

- touch ssh
- update
- upgrade

2. load the controller image (the version seems unimportant, so I use minimum & CNAT)

2020-12-02-1-ClusterCTRL-armhf-lite-CNAT.zip
2020-08-20-8-ClusterCTRL-arm64-lite-CNAT.zip
2020-08-20-8-ClusterCTRL-arm64-lite-CBRIDGE.zip


- touch ssh

At this point one can validate that everything works just fine. I can ssh into the slave nodes and clusterctrl brings them up and down ... I have not messed around with the serial devices, but all devices seem present (FYI: working on a conserver build)

sudo apt update
sudo apt install cockpit


... things still seem to work - BUT:

sudo apt install rcs cvs groff python3-pip apache2 php net-tools
sudo apt install bind9 bind9utils bind9-doc dnsutils
sudo apt install samba python3-pip mariadb-server
sudo apt autoremove
(sudo apt upgrade) - doesn't seem to matter


... and now I can't get a route to any of the slaves anymore. The devices are still present and state of warm or cold boot do not seem to affect anything. Clusterctrl reports correct on/off status. See my prior post, or I can gather additional data.

I popped the SD card out and reloaded a fresh image without all of my software. The world is back ... but I am still a year back-rev on updates. If I attempt to install software without first performing the 'apt update', I find corrupt or incomplete repositories, so the 'apt update' is mandatory to have a consistent environment.

Regards & Thanks,
Tony

Tony Brack

unread,
Sep 24, 2021, 11:36:56 PM9/24/21
to ClusterHAT
... oh, I forgot to mention that USB boots fail as well.

This was only tested with the latest 64 bit cbridge after 'apt update/upgrade', so this is consistent and inline with my other observations. It was not a serious attempt at building a configuration ...

Tony

Tony Brack

unread,
Sep 25, 2021, 1:07:47 AM9/25/21
to ClusterHAT
I just loaded the 64 bit cbridge image, validated functionality and added a few packages. Note that other than supplied dependencies, not much has been loaded. I have seen these dumps before ...

This is after:

pi@cbridge:~ $ history | grep sudo
    1  sudo apt update
   12  sudo apt install cockpit
   14  sudo systemctl enable cockpit.socket
   15  sudo apt install cvs rcs dump
   17  sudo clusterctrl off p1
   18  sudo clusterctrl on p1
   23  sudo clusterctrl off p1
   24  sudo reboot
   25  sudo clusterctrl on p1


Message of interest:

[  103.183434] device ethpi1 entered promiscuous mode
[  103.204808] br0: port 2(ethpi1) entered blocking state
[  103.204819] br0: port 2(ethpi1) entered forwarding state
[  119.013948] cdc_acm 1-1.4.4:1.4: failed to set dtr/rts
[  123.877701] ------------[ cut here ]------------
[  123.877733] NETDEV WATCHDOG: ethpi1 (rndis_host): transmit queue 0 timed out
[  123.877807] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x394/0x3a0
[  123.877821] Modules linked in: rndis_wlan rndis_host cdc_acm cdc_ether aes_neon_blk crypto_simd cryptd bnep hci_uart btbcm bluetooth ecdh_generic ecc bridge 8021q garp stp llc nft_chain_nat xt_MASQUERADE nf_nat nft_counter xt_conntrack nf_conntrack nf_defrag_ipv4 nft_compat nf_tables nfnetlink brcmfmac brcmutil evdev vc4 sha256_generic libsha256 cec cfg80211 drm_kms_helper v3d gpu_sched drm drm_panel_orientation_quirks raspberrypi_hwmon rfkill bcm2835_isp(C) bcm2835_codec(C) snd_soc_core bcm2835_v4l2(C) v4l2_mem2mem videobuf2_dma_contig videobuf2_vmalloc bcm2835_mmal_vchiq(C) snd_bcm2835(C) videobuf2_memops snd_compress videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common snd_pcm i2c_bcm2835 videodev snd_timer syscopyarea snd mc sysfillrect sysimgblt vc_sm_cma(C) fb_sys_fops rpivid_mem uio_pdrv_genirq uio nfsd i2c_dev ip_tables x_tables ipv6 nf_defrag_ipv6
[  123.877982] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G         C        5.4.51-v8+ #1333
[  123.877988] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
[  123.877996] pstate: 80000005 (Nzcv daif -PAN -UAO)
[  123.878006] pc : dev_watchdog+0x394/0x3a0
[  123.878014] lr : dev_watchdog+0x394/0x3a0
[  123.878019] sp : ffffffc01001bd50
[  123.878024] x29: ffffffc01001bd50 x28: ffffff806faada80
[  123.878035] x27: 0000000000000004 x26: 0000000000000140
[  123.878044] x25: 00000000ffffffff x24: 0000000000000003
[  123.878053] x23: ffffff807617145c x22: ffffff8076171000
[  123.878061] x21: ffffff8076171480 x20: ffffffc010e36000
[  123.878069] x19: 0000000000000000 x18: ffffffc010e38888
[  123.878077] x17: 0000000000000000 x16: 0000000000000000
[  123.878085] x15: ffffffc010f71670 x14: ffffffffffffffff
[  123.878099] x13: ffffffc010f712c8 x12: 0000000000009c40
[  123.878107] x11: 0000000000000000 x10: 0000000000000189
[  123.878115] x9 : 0000000000000003 x8 : 0000000000000189
[  123.878123] x7 : 0000000000000000 x6 : ffffff807fbc7158
[  123.878131] x5 : 0000000000000000 x4 : fffffffffffffff0
[  123.878139] x3 : 0000000000000000 x2 : 0000000000000100
[  123.878146] x1 : a59d8b5a0fa32600 x0 : 0000000000000000
[  123.878155] Call trace:
[  123.878164]  dev_watchdog+0x394/0x3a0
[  123.878176]  call_timer_fn+0x3c/0x1e0
[  123.878185]  run_timer_softirq+0x268/0x520
[  123.878194]  __do_softirq+0x184/0x404
[  123.878204]  irq_exit+0xf4/0xf8
[  123.878215]  __handle_domain_irq+0x90/0x100
[  123.878222]  gic_handle_irq+0x68/0xc0
[  123.878228]  el1_irq+0xbc/0x180
[  123.878238]  arch_cpu_idle+0x38/0x218
[  123.878247]  default_idle_call+0x24/0x48
[  123.878257]  do_idle+0x21c/0x260
[  123.878265]  cpu_startup_entry+0x28/0x48
[  123.878275]  secondary_start_kernel+0x1b0/0x208
[  123.878286] ---[ end trace 0ac15a792756e3de ]---


Here is as much as I can conveniently send you:

pi@cbridge:~ $ hostnamectl
   Static hostname: cbridge
         Icon name: computer
        Machine ID: 1d1920b42b084cd98f93d94631da560a
           Boot ID: 5ece33b4adea458b8803a255d5bff24d
  Operating System: Debian GNU/Linux 10 (buster)
            Kernel: Linux 5.4.51-v8+
      Architecture: arm64

pi@cbridge:~ $ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

pi@cbridge:~ $ ls -al /dev | grep ttypi
lrwxrwxrwx  1 root root           7 Sep 25 05:47 ttypi1 -> ttyACM0
lrwxrwxrwx  1 root root           7 Sep 25 05:47 ttypi1a -> ttyACM1

pi@cbridge:~ $ ifconfig -a
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.40  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::8ea2:5cd1:cb20:e13c  prefixlen 64  scopeid 0x20<link>
        ether dc:a6:32:ad:7d:78  txqueuelen 1000  (Ethernet)
        RX packets 3819  bytes 713676 (696.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 471  bytes 108204 (105.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

brint: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.19.180.254  netmask 255.255.255.0  broadcast 172.19.180.255
        inet6 fe80::6439:2aff:fe1c:e5de  prefixlen 64  scopeid 0x20<link>
        ether 66:39:2a:1c:e5:de  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 40  bytes 4930 (4.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether dc:a6:32:ad:7d:78  txqueuelen 1000  (Ethernet)
        RX packets 3475  bytes 750566 (732.9 KiB)
        RX errors 0  dropped 3  overruns 0  frame 0
        TX packets 827  bytes 127441 (124.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ethpi1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::222:82ff:feff:fe01  prefixlen 64  scopeid 0x20<link>
        ether 00:22:82:ff:fe:01  txqueuelen 1000  (Ethernet)
        RX packets 356  bytes 14253 (13.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 7  bytes 1161 (1.1 KiB)
        TX errors 2610  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 2  bytes 78 (78.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2  bytes 78 (78.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 56:7a:a2:c6:5b:80  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


Please let me know if this helps or if there is anything else that can be done, other than staying back-rev?

/Tony

Chris Burton

unread,
Sep 30, 2021, 1:58:40 AM9/30/21
to ClusterHAT
Hi, 
[  123.877733] NETDEV WATCHDOG: ethpi1 (rndis_host): transmit queue 0 timed out

 I've seen these types of error before (as have many others looking at Google results) but I'm not sure what causes it.

Are you continuously seeing them or does it go away if after power cycling?

Chris.

Tony Brack

unread,
Oct 1, 2021, 8:28:18 PM10/1/21
to ClusterHAT
Hi Chris,

From what I can tell, it seems to be completely reproducible on 3 different clusters that I have  (cbridge or cnat on 2 Pi4s and a Pi3B+, both 32 and 64 bit). Just load the latest 64 bit image, verify functionality, and do an apt update followed by apt upgrade or install some packages. It seems that just installing cockpit was enough to cause it to fail. All 3 are v2.4 clusterhats. After the 'apt upgrade', communications to the Pi ZEROs fail. Trying again with 'apt full-upgrade'.

The slave (Pi ZERO) images do not seem to matter.

I only see this mesage when I boot a slave Pi ZERO.

Tony

Tony Brack

unread,
Oct 1, 2021, 8:53:18 PM10/1/21
to ClusterHAT
Hi Chris,

To verify this, I just tried it again on a Pi4b. I put ssh on the boot partition of the latest 64 bit image, then booted it. The following is a complete abstract of everything I did. Everything works fine with the un-updated image.

pi@cnat:~ $ hostnamectl
   Static hostname: cnat
         Icon name: computer
        Machine ID: 0f3c5680c02a4cfcaa21f0f44703dcbb
           Boot ID: 38e40f0d7821403b8d93b9b9ec20dc64
  Operating System: Debian GNU/Linux 10 (buster)
            Kernel: Linux 5.10.63-v8+
      Architecture: arm64:~ $ history

    1  sudo apt update
    2  sudo apt full-upgrade
    3  sudo apt install cockpit
    4  sudo systemctrl enable cockpit.socket
    5  sudo systemctl enable cockpit.socket
    6  sudo reboot
    7  history
pi@cnat:~ $ sudo clusterctrl on p1
pi@cnat:~ $ sudo clusterctrl status
clusterhat:1
clusterctrl:False
maxpi:4
throttled:0x0
hat_version:2.4
hat_version_major:2
hat_version_minor:4
hat_size:4
hat_uuid:de91a4ce-ac7f-11e9-a2a3-2a2ae2dbcce4
hat_vendor:8086 Consultancy
hat_product_id:0x0004
hat_alert:0
hat_hub:1
hat_wp:1
hat_led:1
hat_wplink:0
hat_xra1200p:True
p1:1
p2:0
p3:0
p4:0
pi@cnat:~ $

And then took a look at the console. This took a few minutes to pop up:

[  226.830668] usb 1-1.4.4: new high-speed USB device number 5 using xhci_hcd
[  231.982924] usb 1-1.4.4: device descriptor read/64, error -110
[  238.954996] usb 1-1.4.4: device descriptor read/64, error -71
[  239.142793] usb 1-1.4.4: new high-speed USB device number 6 using xhci_hcd
[  239.243863] usb 1-1.4.4: New USB device found, idVendor=3171, idProduct=0020, bcdDevice= 1.00
[  239.243881] usb 1-1.4.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  239.243895] usb 1-1.4.4: Product: ClusterCTRL
[  239.243907] usb 1-1.4.4: Manufacturer: 8086 Consultancy
[  239.243919] usb 1-1.4.4: SerialNumber: 1
[  239.302208] usbcore: registered new interface driver cdc_ether
[  239.311834] cdc_acm 1-1.4.4:1.2: ttyACM0: USB ACM device
[  239.314303] rndis_host 1-1.4.4:1.0 eth1: register 'rndis_host' at usb-0000:01:00.0-1.4.4, RNDIS device, 00:22:82:ff:fe:01
[  239.314508] usbcore: registered new interface driver rndis_host
[  239.324086] cdc_acm 1-1.4.4:1.4: ttyACM1: USB ACM device
[  239.324748] usbcore: registered new interface driver cdc_acm
[  239.324760] cdc_acm: USB Abstract Control Model driver for USB modems and ISDN adapters
[  239.324775] usbcore: registered new interface driver rndis_wlan
[  239.381231] rndis_host 1-1.4.4:1.0 ethpi1: renamed from eth1
[  239.558633] br0: port 1(ethpi1) entered blocking state
[  239.558645] br0: port 1(ethpi1) entered disabled state
[  239.559027] device ethpi1 entered promiscuous mode
[  239.581067] br0: port 1(ethpi1) entered blocking state
[  239.581079] br0: port 1(ethpi1) entered forwarding state
[  302.047870] ------------[ cut here ]------------
[  302.047912] NETDEV WATCHDOG: ethpi1 (rndis_host): transmit queue 0 timed out
[  302.047991] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:468 dev_watchdog+0x3a0/0x3a8
[  302.048000] Modules linked in: rndis_wlan rndis_host cdc_acm cdc_ether bridge bnep hci_uart btbcm bluetooth ecdh_generic ecc 8021q garp stp llc nft_chain_nat xt_MASQUERADE nf_nat nft_counter xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink brcmfmac brcmutil sg cfg80211 bcm2835_codec(C) rfkill bcm2835_v4l2(C) bcm2835_isp(C) bcm2835_mmal_vchiq(C) raspberrypi_hwmon v4l2_mem2mem videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 vc4 cec i2c_bcm2835 videobuf2_common drm_kms_helper videodev mc snd_soc_core vc_sm_cma(C) snd_compress snd_bcm2835(C) v3d snd_pcm_dmaengine snd_pcm gpu_sched snd_timer snd drm syscopyarea drm_panel_orientation_quirks rpivid_mem sysfillrect sysimgblt fb_sys_fops backlight uio_pdrv_genirq uio nvmem_rmem i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd nfsd ip_tables x_tables ipv6
[  302.048439] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G         C        5.10.63-v8+ #1457
[  302.048445] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[  302.048456] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[  302.048465] pc : dev_watchdog+0x3a0/0x3a8
[  302.048473] lr : dev_watchdog+0x3a0/0x3a8
[  302.048479] sp : ffffffc0115cbd10
[  302.048486] x29: ffffffc0115cbd10 x28: ffffff80449c9080
[  302.048503] x27: 0000000000000004 x26: 0000000000000140
[  302.048518] x25: 00000000ffffffff x24: 0000000000000002
[  302.048533] x23: ffffffc011286000 x22: ffffff8046f473dc
[  302.048548] x21: ffffff8046f47000 x20: ffffff8046f47480
[  302.048563] x19: 0000000000000000 x18: 0000000000000000
[  302.048577] x17: 0000000000000000 x16: 0000000000000000
[  302.048592] x15: ffffffffffffffff x14: ffffffc011288948
[  302.048607] x13: ffffffc011471c10 x12: ffffffc0113154b8
[  302.048622] x11: 0000000000000003 x10: ffffffc0112fd478
[  302.048637] x9 : ffffffc0100e62b8 x8 : 0000000000017fe8
[  302.048652] x7 : c0000000ffffefff x6 : 0000000000000003
[  302.048666] x5 : 0000000000000000 x4 : 0000000000000000
[  302.048681] x3 : 0000000000000103 x2 : 0000000000000102
[  302.048703] x1 : cd860106a7ebd700 x0 : 0000000000000000
[  302.048718] Call trace:
[  302.048728]  dev_watchdog+0x3a0/0x3a8
[  302.048742]  call_timer_fn+0x38/0x200
[  302.048752]  run_timer_softirq+0x298/0x548
[  302.048761]  __do_softirq+0x1a8/0x510
[  302.048771]  irq_exit+0xe8/0x108
[  302.048781]  __handle_domain_irq+0xa0/0x110
[  302.048788]  gic_handle_irq+0xb0/0xf0
[  302.048796]  el1_irq+0xcc/0x180
[  302.048810]  arch_cpu_idle+0x18/0x28
[  302.048820]  default_idle_call+0x58/0x1d4
[  302.048831]  do_idle+0x25c/0x270
[  302.048841]  cpu_startup_entry+0x2c/0x70
[  302.048852]  secondary_start_kernel+0x168/0x178
[  302.048859] ---[ end trace 98ab9a56382b0271 ]---

I don't know if this hlps or hurts. If you would like a copy of the MicroSD, it should be relatively small once compressed. I can supply it if you give me a place. Literally, I have not even run raspi-config.

Regards,
Tony

Tony Brack

unread,
Nov 19, 2021, 12:37:07 PM11/19/21
to ClusterHAT
The new "testing" versions seem to have provided relief and are working. Ironically they seem stable, while the "stable" releases do not. Having said this, there is an issue with the site ...

2021-05-07-2-ClusterCTRL-armhf-full-usbboot.tar.xz

has incorrect permissions set and cannot be uploaded. The 64 bit PX image(s) will boot a PiZ2W but will not establish a stable WiFi connection. Serial communications work to a PC, but not with a cbridge image running. Still looking into that, but the PiZ and PiZW (original) cluster clients work just fine on 64 bit for both cnet and cbridge controllers as far as I can tell as of this writing.

Thanks & Regards,
Tony

Chris Burton

unread,
Nov 20, 2021, 3:36:55 AM11/20/21
to ClusterHAT
Hi,
The new "testing" versions seem to have provided relief and are working. Ironically they seem stable, while the "stable" releases do not. Having said this, there is an issue with the site ...

2021-05-07-2-ClusterCTRL-armhf-full-usbboot.tar.xz

has incorrect permissions set and cannot be uploaded. The 64 bit PX image(s) will boot a PiZ2W but will not establish a stable WiFi connection. Serial communications work to a PC, but not with a cbridge image running. Still looking into that, but the PiZ and PiZW (original) cluster clients work just fine on 64 bit for both cnet and cbridge controllers as far as I can tell as of this writing.
 
I'm not sure how I managed that on just one file but the perms are fixed now, thanks for letting me know.

I'm testing the bullseye images atm, but once I've done those I'll try and take a look at the buster wifi on pizero2w.

Chris.
Reply all
Reply to author
Forward
0 new messages