ClusterCTRL Pi Zeros unreachable after a while

88 views
Skip to first unread message

pimovietc

unread,
May 24, 2021, 11:57:27 AMMay 24
to ClusterHAT
I've setup a Pi 4 with a ClusterCTRL HAT and a single Pi Zero (for now, will populate once things are stable). I am using the CNAT image on the controller which is connected to my WiFi.

Initially things work fine, I can ssh into my pi zero (172.19.181.1) and the pi zero has internet access through the controller. I've a PiCam connected to the Pi Zero and the script is working fine (a near identical script also runs on the controller!).

I've added a iptable rule to forward a connection to the Pi Zero to access it from my LAN and this also works wonderfully.

So everything appears to be working fine, I leave it running and after a while the Pi Zero's are unreachable. When I ssh to it it simply times out. I can't ping it either.

As my own Camera script grabs a picture every 5 min I can tell that it has suddenly stopped working (even if it loses network it should not matter). When I inspected the kern.log and syslog I can see that there are no entries after the 'crash' (untill I reboot).

I can 'fix' the problem by toggling clusterctrl off/on.

Any pointers how to debug?

Chris Burton

unread,
May 25, 2021, 4:39:43 PMMay 25
to ClusterHAT
Hi, 
Any pointers how to debug?

I'd advise checking the Pi Zero is pushed onto the ClusterHAT firmly - see the picture in step 6 of the troubleshooting guide which shows how far on the Pi Zero should be.

What does the output of "clusterctrl status" show on the controller after it's gone wrong?

I'd also try with an alternate SD card in the Pi Zero to rule out that being an issue.

Chris.

pimovietc

unread,
May 25, 2021, 4:56:39 PMMay 25
to ClusterHAT
Hi Chris,

Thank you for your quick reply. You might've been correct that it might've not been plugged in fully, I disconnected it and re-plugged it and I could feel a "click". I'll have to test it tomorrow and see if it no longer crashes.
I remember running "clusterctrl status" earlier today when it had crashed and I believe it looked normal, I'll post the output if it crashes again tomorrow. At the very least I'm 100% sure it said "clusterhat:1".

At the moment I don't have a spare SD card available, but I'll source one and try swapping it if the problem persists.

Pim

pimovietc

unread,
May 26, 2021, 6:03:42 AMMay 26
to ClusterHAT
pi1 has become unresponsive again, as promised I'd post the output of clusterctrl status. 
I also ran the other cmd's suggested in the troubleshooting guide (as I can't make it fail when I want to).

I'm using the CNAT image and connect over WiFi. I also setup the following iptable to forward my app on pi1 to my LAN.
iptables -t nat -A PREROUTING -p tcp --dport 9301 -i wlan0 -j DNAT --to 172.19.181.1:9300

clusterctrl status
clusterhat:1
clusterctrl:False
maxpi:4
throttled:0x0
hat_version:2.5
hat_version_major:2
hat_version_minor:5
hat_size:4
hat_uuid:16aeb902-9d28-11ea-bb37-0242ac130002
hat_vendor:8086 Consultancy
hat_product_id:0x0004
hat_alert:0
hat_hub:1
hat_wp:1
hat_led:1
hat_wplink:0
hat_xra1200p:True
p1:1
p2:1
p3:1
p4:1

lsusb -t
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        |__ Port 1: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
            |__ Port 4: Dev 6, If 0, Class=Communications, Driver=rndis_host, 480M
            |__ Port 4: Dev 6, If 1, Class=CDC Data, Driver=rndis_host, 480M
            |__ Port 4: Dev 6, If 2, Class=Communications, Driver=cdc_acm, 480M
            |__ Port 4: Dev 6, If 3, Class=CDC Data, Driver=cdc_acm, 480M
            |__ Port 4: Dev 6, If 4, Class=Communications, Driver=cdc_acm, 480M
            |__ Port 4: Dev 6, If 5, Class=CDC Data, Driver=cdc_acm, 480M

ls -l /dev/ttyACM* /dev/ttypi*
crw-rw---- 1 root dialout 166, 0 May 26 09:51 /dev/ttyACM0
crw-rw---- 1 root dialout 166, 1 May 26 09:51 /dev/ttyACM1
lrwxrwxrwx 1 root root         7 May 26 09:51 /dev/ttypi1 -> ttyACM0
lrwxrwxrwx 1 root root         7 May 26 09:51 /dev/ttypi1a -> ttyACM1

ifconfig 
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.19.181.254  netmask 255.255.255.0  broadcast 172.19.181.255
        inet6 fe80::ca6:e4f:a956:70cb  prefixlen 64  scopeid 0x20<link>
        ether e4:5f:01:1c:ff:f6  txqueuelen 1000  (Ethernet)
        RX packets 2259257  bytes 3354436608 (3.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1245967  bytes 71317268 (68.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

brint: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.19.180.254  netmask 255.255.255.0  broadcast 172.19.180.255
        inet6 fe80::6c8c:63ff:fe34:e2c9  prefixlen 64  scopeid 0x20<link>
        ether 6e:8c:63:34:e2:c9  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 82  bytes 8660 (8.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether e4:5f:01:1c:ff:f6  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ethpi1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::222:82ff:feff:fe01  prefixlen 64  scopeid 0x20<link>
        ether 00:22:82:ff:fe:01  txqueuelen 1000  (Ethernet)
        RX packets 2259257  bytes 3354436608 (3.1 GiB)
        RX errors 43  dropped 0  overruns 0  frame 43
        TX packets 1245997  bytes 126144252 (120.3 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 23  bytes 1930 (1.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 23  bytes 1930 (1.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.68.135  netmask 255.255.255.0  broadcast 192.168.68.255
        inet6 fe80::ad66:17e3:795c:dc06  prefixlen 64  scopeid 0x20<link>
        ether e4:5f:01:1c:ff:f7  txqueuelen 1000  (Ethernet)
        RX packets 6438949  bytes 311433737 (297.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11555147  bytes 425776110 (406.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

pimovietc

unread,
May 27, 2021, 3:31:59 PMMay 27
to ClusterHAT
Hi,

I flashed a new image to a new SD, but to no avail.
p1 still becomes unreachable. Any ideas?

Peter Cross

unread,
May 27, 2021, 3:47:58 PMMay 27
to clust...@googlegroups.com
Crazy thought but did you try changing the pi zero hw?

Cheers!

Peter J. Cross
San Antonio, TX

"Experience has taught mankind the necessity of auxiliary precautions"
-James Madison, Federalist Paper No. 51

Please consider the environment before printing this email


--
You received this message because you are subscribed to the Google Groups "ClusterHAT" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clusterhat+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/clusterhat/8a044f20-b949-413c-a748-4f518fe59653n%40googlegroups.com.

pimovietc

unread,
May 28, 2021, 4:58:05 AMMay 28
to ClusterHAT
Hi,

Crazy thought but did you try changing the pi zero hw?
I have not, mainly because I do not have a second pi zero yet. One is on its way over.

I actually managed to see it crash (by pure coincidence). 
I power cycled it with cluserctrl off p1 / clusterctrl on p1 in order to access the logs.
It crashed near 10:16, I already rebooted it at 10:17:18 (according to the logs).
I grabbed the log files and put them here in the attachment (that didnt work, added pastebin links).

I see some time jumps (i.e. logs 10:17:01 followed by 10:19:51 followed by 10:17:18) in the syslog. Not sure if that might be an issue (although this appears to be after rebooting).
There's a couple of errors/traces, but google tells me nothing, e.g. "Internal error: Oops: 37 [#1] ARM"


Any ideas?

Chris Burton

unread,
May 29, 2021, 5:28:40 AMMay 29
to ClusterHAT
Hi, 
Any ideas?

This isn't something I remember seeing previously, it looks a strange one.

Might be worth taking the Cluster HAT out of the equation.

With everything powered off - plug the P1 Pi Zero (using USB not PWR port) into the controller Pi using a standard uUSB cable (different to the one used with the cluster hat if you have one to rule that out too) and then power on the controller pi - you should still see P1 connected as it was on the HAT.

Do you still see the stability problem with it plugged in that way?

Chris. 

pimovietc

unread,
May 29, 2021, 10:14:29 AMMay 29
to ClusterHAT
I believe I've solved the issue and it appears not to be related to the Cluster HAT. 
I discovered by accident that when idling the Pi Zero would not crash. I was running the same application that I'm running on a Pi 4 so I expect that the software was not the issue. It turns out to be a hardware (or maybe firmware?) issue with Pi Zero's and streaming camera data. I found a thread on github with similar issues but on a Pi Zero WH. Their solution? Downclock the CPU on the Pi Zero and no more crashes. I'd figure it was worth a shot and it has been running stable for 6 hours now. Before it would crash in like 15 min to a couple of hours.

Solution:
modify /boot/config.txt
arm_freq=600
arm_freq_max=700
arm_freq_min=500
Reply all
Reply to author
Forward
0 new messages