Moving this to a new thread, I will add some info.
I have had many loss of connections for several months now and been trying to capture the cause with no luck. This is in connection with trying to track down the cause of excessive but usually random sequence errors in piHPSDR in a stable controlled network situation. That is another topic.
For me this happens on SparkSDR, Quisk, piHPSDR and Thetis running on a PC or RPi4B. It happens on 2 different networks and (rarely) on a direct link. I have changed to smart GiGE switches and found no errors reported there.
For the connection loss I have determined this is a loss of UDP layer connection. There are Zero Rx packet loss on the CPU ends or switch. Cables can be old or new CAT7, no difference. It can be minutes or days between connection loss events.
When the connection drops, there are no app standard log file errors and no Linux or Windows network statistics I have found that point to any problem. Not even UDP RX packet/buffer errors change on the lost connection moment.
Any other app can take control of the HL2 at this point, or if there is a restart button like in the DL1YCF build of piHPSDR, the app picks up where it left off. In fact we (DL1YCF and I) have found some interesting interplay in this scenario where the original app can take back the connection from the 2nd app (any SDR app) since a busy condition from the HL2 in this state does not seem to be sent, asked for or received.
Since I have not observed this connection loss on 1 of my Pi controllers and the newest HL2 that I can recall, I plan to do a long term test to verify this is true or not by pairing them up and then swapping them. I have tried this before with no conclusions because as soon as I think I see the problem follow a piece of gear/app, it stops and/or shows up elsewhere. I just changed my QTH for a few months and need to finish getting things installed and stable to begin such tests again.
I have not found any measurement yet to truly know a successful test other than wait several weeks, very difficult to do since I am working on app feature modification and testing frequently. Need a way to capture the transaction that causes the HL2 to drop the connection. I assume it is the HL2 side drop since 5 or more separate SDR apps on multiple CPU hosts Linux and Windows, and multiple networks, have the same problem with that particular HL2 (so far). That is where the new 2nd HL2 may help sort things out. But then why is this never observed on my 2nd Pi controller (I think) with same HL2? Tough one.
Mike
K7MDL
CN88sf and EL87sm
From: herme...@googlegroups.com <herme...@googlegroups.com>
On Behalf Of philip.j.s...@gmail.com
Sent: Monday, December 6, 2021 15:05
To: Hermes-Lite <herme...@googlegroups.com>
Subject: Re: Future SDR Project Survey
One minor point -- my HL-2 does not need to be power-cycled to recover, I press and unpress the soft button in SparkSDR. I still haven't managed to capture what goes wrong as it happens infrequently and I don't have a good way to capture all the traffic between those two boxes.
Philip
On Monday, December 6, 2021 at 12:29:52 PM UTC-5 softerh...@gmail.com wrote:
Hi Philip,
Yes, that is helpful information. Since this is now on the list, I've posted the rest of this thread below.
I also run with a long wire and currently "hermes-lite" has more spots than n1dq on pskreporter. ;) But I think that has more to do with the depth of decode done by SparkSDR versus the depth of the decode possible on the Zynq's ARM A9 cores. I ran some experiments last year with Pavel's FT8 decoder recompiled for a standard PC. I had to change settings in the code related to decode depth to get more spots. There are more details in this github issue.
Sorry to hear that your HL2 still drops ethernet connection. I remember you submitted some patches to try and fix this. Unfortunately I can't replicate the problem. My HL2 on my home network with DHCP and the one at a remote location with direct connection are both rock solid and up for weeks. I haven't heard of any recent reports similar to yours. Do you have suggestions for how I might replicate your problem? Do you want to send your HL2 to me for possible fix or exchange?
Regarding a soft processor, I am a big fan of the work by enjoy-digital with the vex risc-v processor. I would probably use that setup for cpu/ethernet/etc in a larger FPGA. This will run Linux, but you need about 32MB or more to do that reasonably. That requires the FPGA to have external DRAM which requires significant pins and PCB area. I might be able to squeeze in a hyperram which can be used for linux. Bare metal code is more likely. At my day job we use a soft RISC-V for DRAM controller training. It should be able to do DHCP, ICMP and other similar tasks like a MCU would.
73,
Steve
kf7o

Mike,
Maybe you have an application / security software / worse which is port scanning. This will attempt to open every port it can find and just take it from there. A good router / switch with anti-virus can stop and log this.
It could also be a bad switch.
I have DOS attacks at least once a month, a good TP-Link switch / router takes care of this.
Simon Brown, G4ELI
I have the KSZ9031 chip. Gateway code version 72, ID 6.
I have been using the default settings in Quisk and others for the watchdog so far, which appears to be enabled. I will set them to off when I use them next. I am primarily running piHPSDR on 2 controllers with Pi4B and 7” touchscreens as SparkSDR and Quisk do not support the encoders and switch hardware on these units. Over the last week since my return to this QTH I have only seen few disconnects, all within a 1 hour time span.
Both homes I use DHCP reservations, are setup similar, with the same newer TP-Link smart GigE switches for local connections in the shack, and the same router models. I have swapped cables, GigE switches, and bypassed them, as well as used VLANS, same problems. Since I can see these many times in hours or a day (at times, other times can go many days), DHCP lease expirations would not be the issue, they are 1 week or longer and is not in the picture for direct and VLAN test conditions.
I did have an IP address set in the older HL2’s EEPROM to match the DHCP reservation at my previous QTH. I was reminded of that when I moved them to the current QTH network and it showed up after discovery being on a subnet 😊. I use a different subnet at each QTH to allow for easier VPN config. There is no VPN active during any of this.
After a connection loss the HL2 and controllers are all active, you can ping them, and restart a connection within the app.
As mentioned, I ran them for a time direct connected, and more recently on smart switches with VLANs isolating each controller-HL2 pairing. This would eliminate outside forces such as port scanning based intrusions. My router is set to block most inbound ports and well-known ones are changed and port forwarded to specific endpoints. No ports are forwarded to the Pi controllers or HL2s so in theory they and the HL2s should not see any outside influences. Since the disconnects are proven to happen in isolated networks, the problem lies inside the 2 endpoints.
The unpredictable nature of these does make testing difficult. After a connection loss the HL2 LEDs revert to their normal pattern - address acquired and ready for a connection. Perhaps a test build of the gateware could rapid flash a set of LEDs when a WD timeout occurs, and stay that way until power is cycled and/or a new connection takes place. We can know when improvement attempts are positive or negative, and rule out if the WD is or is not being activated.
What is the WD timeout period? Is it stored in EEPROM? If one app like Quisk sets the WD timeout to OFF, and another app does not specifically change it, will the setting of OFF stick through each new connection and each power cycle? A real example might be using Quisk to set the WD then start piHPSDR (assuming it does not change that state today). What state will the WD value be in on the HL2?
From: herme...@googlegroups.com <herme...@googlegroups.com>
On Behalf Of Steve Haynal
Sent: Monday, December 6, 2021 20:59
To: Hermes-Lite <herme...@googlegroups.com>
Subject: Re: HL2 Random Loss of Connection
Hi Mike and Philip,
Maybe we can narrow down the problem. One common reason for disconnect is the watchdog timer timing out. The watchdog timer can be disabled in SparkSDR, Quisk and for any software using hermeslite.py. Since you are using SparkSDR, try disabling the watch dog timer as shown in the picture below.
Another possibility is that something happened during DHCP renewal. Are you using DHCP or fixed IP? Please try a fixed IP on your subnet and see if the problem goes away. See this wiki page for details:
Which version of gateware are you using?
Can you ping the computer at the expected address after the event? Can you access the computer with hermeslite.py after the event?
Mike, the last build which used the KSZ9021 had 3 or 4 units with a wrong resistor value. The problem can manifest as you describe. Is your ethernet phy a KSZ9021 or KSZ9031?
73,
Steve
kf7o

--
You received this message because you are subscribed to the Google Groups "Hermes-Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
hermes-lite...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/hermes-lite/22f8e1e6-8ed0-4ef4-b355-34e5adc8ebe7n%40googlegroups.com.
Ah,
Another project on another continent has a problem with loss of connection when the firmware’s DHCP lease is renewed. Maybe use a static address and see if this solves the problem?
Would not account dropped connection multiple times a day or in an hour. I did have static IP assigned to the matching reservation address, which was useful for operating with a direct connection and in the VLAN setup. Problem still occurred.
From: herme...@googlegroups.com <herme...@googlegroups.com>
On Behalf Of si...@sdr-radio.com
Sent: Tuesday, December 7, 2021 04:05
To: 'Hermes-Lite' <herme...@googlegroups.com>
--
You received this message because you are subscribed to the Google Groups "Hermes-Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
hermes-lite...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hermes-lite/01a401d7eb62%24adc5e930%240951bb90%24%40sdr-radio.com.
Temperature?
There’s a very good reason why I stick to software; when that breaks I can blame the users.
Simon Brown, G4ELI
From: herme...@googlegroups.com <herme...@googlegroups.com> On Behalf Of Mike Lewis
Sent: 07 December 2021 12:11
To: si...@sdr-radio.com; 'Hermes-Lite' <herme...@googlegroups.com>
Subject: RE: HL2 Random Loss of Connection
Would not account dropped connection multiple times a day or in an hour. I did have static IP assigned to the matching reservation address, which was useful for operating with a direct connection and in the VLAN setup. Problem still occurred.
From: herme...@googlegroups.com <herme...@googlegroups.com> On Behalf Of si...@sdr-radio.com
Sent: Tuesday, December 7, 2021 04:05
To: 'Hermes-Lite' <herme...@googlegroups.com>
Subject: RE: HL2 Random Loss of Connection
Ah,
Another project on another continent has a problem with loss of connection when the firmware’s DHCP lease is renewed. Maybe use a static address and see if this solves the problem?
Simon Brown, G4ELI
From: herme...@googlegroups.com <herme...@googlegroups.com> On Behalf Of Mike Lewis
Both homes I use DHCP reservations, are setup similar, with the same newer TP-Link smart GigE switches for local connections in the shack, and the same router models
--
You received this message because you are subscribed to the Google Groups "Hermes-Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hermes-lite...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hermes-lite/01a401d7eb62%24adc5e930%240951bb90%24%40sdr-radio.com.
--
You received this message because you are subscribed to the Google Groups "Hermes-Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hermes-lite...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hermes-lite/CO6PR18MB38583CDEAB746AC25C135E43F66E9%40CO6PR18MB3858.namprd18.prod.outlook.com.
Yes I have different MAC addresses set. This was a problem long before the 2nd HL2 arrived, and It happens on direct connect and VLAN so the HL2 + SDR app machine are isolated from everything. I have been listening to the radio several times when it suddenly stops. I cannot tell if it is after 13 seconds since I only know when it happens when I cannot hear audio anymore (or see the spectrum display go blank), there are no log entries or packet errors to timestamp anything, the apps still run happily but with no spectrum display or audio.
At the moment it is not happening, just a few times at this QTH the other day. It comes and goes. I will keep the WD setting in mind once it starts dropping again. I am running these 24/7.
From: herme...@googlegroups.com <herme...@googlegroups.com>
On Behalf Of Steve Haynal
Sent: Tuesday, December 7, 2021 20:07
To: Hermes-Lite <herme...@googlegroups.com>
Subject: Re: HL2 Random Loss of Connection
Hi Group,
Yes, setting ADDR=0x39 bits 27:24 to 1001b will disable the watchdog. You can also disable it when you start the radio from software as described here:
Yes, the value is sticky after software stops the radio. You can disable the watchdog with one software, disconnect, and then connect with other software and the watchdog will still be disabled. A complete power cycle of the HL2 does reset the value to the default of watchdog on.
The watchdog counts the packets sent to the PC versus packets received from the PC. If the number of missed but expected received packets from the PC reaches 4096, the HL2 automatically disconnects and turns transmit off. This is currently at almost 13 seconds, and is independent of the number and bandwidth of receivers in use. This means that something must have blocked PC->HL2 packets for 13 seconds, which is very unlikely in my opinion and indicates other problems with your network, computer or software. The intent of the watchdog it to prevent runaway transmit and keep the radio accessible if software crashes. I never disable it.
Mike, I may be asking a question here that is obvious to you, but have you set different ethernet MAC addresses for your two units? All HL2s clone the same ethernet MAC, and different ethernet MAC addresses must be used for multiple HL2s on the same network. Do you see the problem if only one HL2 is on your network at a time?
73,
Steve
kf7o
On Tuesday, December 7, 2021 at 7:11:00 PM UTC-8 Ward Cunningham wrote:
Steve,
Thank you for pointing this out. It has been my experience that SparkSDR stops decoding FT8 after a few hours. I disabled the Watchdog timer as you suggest and have seen 24 hours of uninterrupted decoding. If it runs for a week I will report back here, maybe with some new visualizations.Best regards -- Ward K9OX
Radio details:
Firmware version 71
Firmware patch 3
Board ID 5
Receivers 4
SparkSDR details:
Version 2.0.7.4
Avalonia Version 0.10.2.0
On Monday, December 6, 2021 at 8:58:40 PM UTC-8 softerh...@gmail.com wrote:
Hi Mike and Philip,
Maybe we can narrow down the problem. One common reason for disconnect is the watchdog timer timing out. The watchdog timer can be disabled in SparkSDR, Quisk and for any software using hermeslite.py. Since you are using SparkSDR, try disabling the watch dog timer as shown in the picture below.
Another possibility is that something happened during DHCP renewal. Are you using DHCP or fixed IP? Please try a fixed IP on your subnet and see if the problem goes away. See this wiki page for details:
Which version of gateware are you using?
Can you ping the computer at the expected address after the event? Can you access the computer with hermeslite.py after the event?
Mike, the last build which used the KSZ9021 had 3 or 4 units with a wrong resistor value. The problem can manifest as you describe. Is your ethernet phy a KSZ9021 or KSZ9031?
73,
Steve
kf7o
--
You received this message because you are subscribed to the Google Groups "Hermes-Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
hermes-lite...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hermes-lite/1ea5fb54-cb67-4ec2-bf16-3a0b20ad1710n%40googlegroups.com.
Looks like 72 I see the newest HL2 is at 73. I should probably upgrade the older one. I have yet to see the newest one lose connection but it has relatively little run time so too early to know for sure.
My older HL2 (built this summer)

The new HL2:

To view this discussion on the web visit https://groups.google.com/d/msgid/hermes-lite/7a681c93-8c32-43c6-b5cc-dcf1c08e8927n%40googlegroups.com.
