ClusterCTRL Pi Zeros unreachable after a while

pimovietc

unread,

May 24, 2021, 11:57:27 AM5/24/21

to ClusterHAT

I've setup a Pi 4 with a ClusterCTRL HAT and a single Pi Zero (for now, will populate once things are stable). I am using the CNAT image on the controller which is connected to my WiFi.

Initially things work fine, I can ssh into my pi zero (172.19.181.1) and the pi zero has internet access through the controller. I've a PiCam connected to the Pi Zero and the script is working fine (a near identical script also runs on the controller!).

I've added a iptable rule to forward a connection to the Pi Zero to access it from my LAN and this also works wonderfully.

So everything appears to be working fine, I leave it running and after a while the Pi Zero's are unreachable. When I ssh to it it simply times out. I can't ping it either.

As my own Camera script grabs a picture every 5 min I can tell that it has suddenly stopped working (even if it loses network it should not matter). When I inspected the kern.log and syslog I can see that there are no entries after the 'crash' (untill I reboot).

I can 'fix' the problem by toggling clusterctrl off/on.

Any pointers how to debug?

Chris Burton

unread,

May 25, 2021, 4:39:43 PM5/25/21

to ClusterHAT

Hi,

Any pointers how to debug?

I'd advise checking the Pi Zero is pushed onto the ClusterHAT firmly - see the picture in step 6 of the troubleshooting guide which shows how far on the Pi Zero should be.

What does the output of "clusterctrl status" show on the controller after it's gone wrong?

I'd also try with an alternate SD card in the Pi Zero to rule out that being an issue.

Chris.

pimovietc

unread,

May 25, 2021, 4:56:39 PM5/25/21

to ClusterHAT

Hi Chris,

Thank you for your quick reply. You might've been correct that it might've not been plugged in fully, I disconnected it and re-plugged it and I could feel a "click". I'll have to test it tomorrow and see if it no longer crashes.

I remember running "clusterctrl status" earlier today when it had crashed and I believe it looked normal, I'll post the output if it crashes again tomorrow. At the very least I'm 100% sure it said "clusterhat:1".

At the moment I don't have a spare SD card available, but I'll source one and try swapping it if the problem persists.

Pim

pimovietc

unread,

May 26, 2021, 6:03:42 AM5/26/21

to ClusterHAT

pi1 has become unresponsive again, as promised I'd post the output of clusterctrl status.

I also ran the other cmd's suggested in the troubleshooting guide (as I can't make it fail when I want to).

I'm using the CNAT image and connect over WiFi. I also setup the following iptable to forward my app on pi1 to my LAN.

iptables -t nat -A PREROUTING -p tcp --dport 9301 -i wlan0 -j DNAT --to 172.19.181.1:9300

clusterctrl status

clusterhat:1

clusterctrl:False

maxpi:4

throttled:0x0

hat_version:2.5

hat_version_major:2

hat_version_minor:5

hat_size:4

hat_uuid:16aeb902-9d28-11ea-bb37-0242ac130002

hat_vendor:8086 Consultancy

hat_product_id:0x0004

hat_alert:0

hat_hub:1

hat_wp:1

hat_led:1

hat_wplink:0

hat_xra1200p:True

p1:1

p2:1

p3:1

p4:1

lsusb -t

/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M

/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M

|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M

|__ Port 1: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M

|__ Port 4: Dev 6, If 0, Class=Communications, Driver=rndis_host, 480M

|__ Port 4: Dev 6, If 1, Class=CDC Data, Driver=rndis_host, 480M

|__ Port 4: Dev 6, If 2, Class=Communications, Driver=cdc_acm, 480M

|__ Port 4: Dev 6, If 3, Class=CDC Data, Driver=cdc_acm, 480M

|__ Port 4: Dev 6, If 4, Class=Communications, Driver=cdc_acm, 480M

|__ Port 4: Dev 6, If 5, Class=CDC Data, Driver=cdc_acm, 480M

ls -l /dev/ttyACM* /dev/ttypi*

crw-rw---- 1 root dialout 166, 0 May 26 09:51 /dev/ttyACM0

crw-rw---- 1 root dialout 166, 1 May 26 09:51 /dev/ttyACM1

lrwxrwxrwx 1 root root 7 May 26 09:51 /dev/ttypi1 -> ttyACM0

lrwxrwxrwx 1 root root 7 May 26 09:51 /dev/ttypi1a -> ttyACM1

ifconfig

br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 172.19.181.254 netmask 255.255.255.0 broadcast 172.19.181.255

inet6 fe80::ca6:e4f:a956:70cb prefixlen 64 scopeid 0x20<link>

ether e4:5f:01:1c:ff:f6 txqueuelen 1000 (Ethernet)

RX packets 2259257 bytes 3354436608 (3.1 GiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 1245967 bytes 71317268 (68.0 MiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

brint: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 172.19.180.254 netmask 255.255.255.0 broadcast 172.19.180.255

inet6 fe80::6c8c:63ff:fe34:e2c9 prefixlen 64 scopeid 0x20<link>

ether 6e:8c:63:34:e2:c9 txqueuelen 1000 (Ethernet)

RX packets 0 bytes 0 (0.0 B)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 82 bytes 8660 (8.4 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500

ether e4:5f:01:1c:ff:f6 txqueuelen 1000 (Ethernet)

RX packets 0 bytes 0 (0.0 B)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 0 bytes 0 (0.0 B)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ethpi1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet6 fe80::222:82ff:feff:fe01 prefixlen 64 scopeid 0x20<link>

ether 00:22:82:ff:fe:01 txqueuelen 1000 (Ethernet)

RX packets 2259257 bytes 3354436608 (3.1 GiB)

RX errors 43 dropped 0 overruns 0 frame 43

TX packets 1245997 bytes 126144252 (120.3 MiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

inet 127.0.0.1 netmask 255.0.0.0

inet6 ::1 prefixlen 128 scopeid 0x10<host>

loop txqueuelen 1000 (Local Loopback)

RX packets 23 bytes 1930 (1.8 KiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 23 bytes 1930 (1.8 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 192.168.68.135 netmask 255.255.255.0 broadcast 192.168.68.255

inet6 fe80::ad66:17e3:795c:dc06 prefixlen 64 scopeid 0x20<link>

ether e4:5f:01:1c:ff:f7 txqueuelen 1000 (Ethernet)

RX packets 6438949 bytes 311433737 (297.0 MiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 11555147 bytes 425776110 (406.0 MiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

pimovietc

unread,

May 27, 2021, 3:31:59 PM5/27/21

to ClusterHAT

Hi,

I flashed a new image to a new SD, but to no avail.

p1 still becomes unreachable. Any ideas?

Peter Cross

unread,

May 27, 2021, 3:47:58 PM5/27/21

to clust...@googlegroups.com

Crazy thought but did you try changing the pi zero hw?

Cheers!

Peter J. Cross
San Antonio, TX

"Experience has taught mankind the necessity of auxiliary precautions"
-James Madison, Federalist Paper No. 51

Please consider the environment before printing this email

--
You received this message because you are subscribed to the Google Groups "ClusterHAT" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clusterhat+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/clusterhat/8a044f20-b949-413c-a748-4f518fe59653n%40googlegroups.com.

pimovietc

unread,

May 28, 2021, 4:58:05 AM5/28/21

to ClusterHAT

Hi,

Crazy thought but did you try changing the pi zero hw?

I have not, mainly because I do not have a second pi zero yet. One is on its way over.

I actually managed to see it crash (by pure coincidence).

I power cycled it with cluserctrl off p1 / clusterctrl on p1 in order to access the logs.

It crashed near 10:16, I already rebooted it at 10:17:18 (according to the logs).

I grabbed the log files and put them here in the attachment (that didnt work, added pastebin links).

I see some time jumps (i.e. logs 10:17:01 followed by 10:19:51 followed by 10:17:18) in the syslog. Not sure if that might be an issue (although this appears to be after rebooting).

There's a couple of errors/traces, but google tells me nothing, e.g. "Internal error: Oops: 37 [#1] ARM"

kern => https://pastebin.com/ArivY9J9

messages => https://pastebin.com/u1CyULu5

syslog => https://pastebin.com/rbNGYrz3

Any ideas?

Chris Burton

unread,

May 29, 2021, 5:28:40 AM5/29/21

to ClusterHAT

Hi,

Any ideas?

This isn't something I remember seeing previously, it looks a strange one.

Might be worth taking the Cluster HAT out of the equation.

With everything powered off - plug the P1 Pi Zero (using USB not PWR port) into the controller Pi using a standard uUSB cable (different to the one used with the cluster hat if you have one to rule that out too) and then power on the controller pi - you should still see P1 connected as it was on the HAT.

Do you still see the stability problem with it plugged in that way?

Chris.

pimovietc

unread,

May 29, 2021, 10:14:29 AM5/29/21

to ClusterHAT

I believe I've solved the issue and it appears not to be related to the Cluster HAT.

I discovered by accident that when idling the Pi Zero would not crash. I was running the same application that I'm running on a Pi 4 so I expect that the software was not the issue. It turns out to be a hardware (or maybe firmware?) issue with Pi Zero's and streaming camera data. I found a thread on github with similar issues but on a Pi Zero WH. Their solution? Downclock the CPU on the Pi Zero and no more crashes. I'd figure it was worth a shot and it has been running stable for 6 hours now. Before it would crash in like 15 min to a couple of hours.

Solution:

modify /boot/config.txt

arm_freq=600

arm_freq_max=700

arm_freq_min=500

Reply all

Reply to author

Forward