DNS-320L temperature stops updating

155 views
Skip to first unread message

Bill R

unread,
Dec 1, 2023, 9:51:04 PM12/1/23
to Alt-F
Hi João, I'm having an issue with a custom Alt-F build I made for my DNS-320L so I'm looking for debugging suggestions. I did a build using all your latest commits, and everything initially works normally. What eventually happens after an extended time though (seems like 25 days or more uptime) is that the system temperature stops updating. I can see in fantemp.log that no new entries are logged, and the fan does not start even though the system is overheating. After rebooting everything works again and the correct temperature is reported. My impression is that the last temp reported was always 1 degree below the point where I would expect the fan to start running.

I've verified that both sysctrl and dns320l-daemon are still running (which appear to be the key processes) and there are no unusual messages in syslog.

Do you have any suggestions how to debug this or get additional logging?

Thanks.

João Cardoso

unread,
Dec 4, 2023, 12:25:49 PM12/4/23
to Alt-F


On Saturday, December 2, 2023 at 2:51:04 AM UTC Bill R wrote:
Hi João,
Hi Bill 
I'm having an issue with a custom Alt-F build I made for my DNS-320L so I'm looking for debugging suggestions. I did a build using all your latest commits,

using "trunk"? it is unstable.
 
and everything initially works normally. What eventually happens after an extended time though (seems like 25 days or more uptime)

it's going to be difficult to debug...
 
is that the system temperature stops updating. I can see in fantemp.log that no new entries are logged, and the fan does not start even though the system is overheating. After rebooting everything works again and the correct temperature is reported. My impression is that the last temp reported was always 1 degree below the point where I would expect the fan to start running.

I've verified that both sysctrl and dns320l-daemon are still running (which appear to be the key processes) and there are no unusual messages in syslog.

Do you have any suggestions how to debug this or get additional logging?

I have been bitten by that  a couple of times, some years ago.

the dns-daemon is the easier to debug.
-If it is running, when you issue a command such as  "dns320l-daemon -x GetTemperature" from the command line,  the program you invoked will try to contact the running one in the background. If you get no answer, it means that the daemon is blocked for some reason.
-if you kill the daemon, 'init' will try to relaunch it, and you will see that in syslog; if after killing it you watch in syslog that init is trying to relaunch it in succession, and you get a message similar to "socket in use", it means that the daemon did not remove /var/run/dns320l.socket at exit, but you can "rm /var/run/dns320l.socket" for yourself.
-if you send a SIGUSR1 signal to the daemon, it will enter debug mode, syslog will have a lot a cryptic messages. A SIGUSR1 will make it return to info mode.

So, a healthy daemon means that you can use the cmd line for sending commands to it and you get answers, the /var/run/dns320l.socket exists, and it reacts to SIGUSR1/2

The daemon communicates via a serial line to the MCU  (micro controller) , a small chip that actually reads buttons, temperature, set leds, fan speed, etc. The daemon regularly reads from the MCU the sensor temperature and writes its values to /tmp/sys/temp1_input; also regularly it reads /tmp/sys/pwm1 and sets the fan accordingly.

sysctrl will regularly read the temperature reading /tmp/sys/temp1_input and sets the fan speed writing to /tmp/sys/pwm1. Before reading and writing to those files a lock is obtained, otherwise wrong/parcial values would result.
sysctrl only has LOG_INFO and LOG_ERROR, and both go to syslog; *changes* in temp or fan speed are logged to fantemp.log, as you know. sysctrl it is not relaunched by init if killed.
You can test if it running OK by pressing e.g. the USB button and watching is syslog the event; also, if you write to the pwm1 file, its previous value will reappear after a few (tens?) of seconds: 'cat /tmp/sys/pwm1; echo 127 > /tmp/sys/pwm1; cat /tmp/sys/pwm1) the values are 0->fan off, 127->medium speed 255->high speed). If dns-daemon is running OK it will react to those changes (you can kill sysctrl to preserve your changes and ear the fan speed changing) -- of course, when you write to the file you did not get a lock first, so mixed results might happen (the lock is obtained be creating the directory /var/lock/temp-lock if it does not exists (an atomic operation) , 'mkdir  /var/lock/temp-lock' will return 0 if the lock is obtained, 0 otherwise , 'rmdir /var/lock/temp-lock' removes the lock -- the dns-daemon uses the same mechanism)

So you now have the big picture.
The problem is that the next time it happens you only have one chance to debug it, the next opportunity will happen only in a couple of weeks... As I told above, I had a couple of similar issues in the past, and was not able to diagnose the *real* cause. Fixing the issue, yes, its easy -- or by killing the daemon, or removing the socket file, or killing and restarting sysctr. The cause for the lock (I believe it's a dead-lock or a race condition) remains unknown to me.

Let me know if you get more highlights on this,
Thanks
João


Thanks.

Bill R

unread,
Dec 10, 2023, 9:53:58 PM12/10/23
to Alt-F
Thanks for all the detailed info. It will take some planning and practice to make sure I check everything correctly when this problem happens again. I will reply again (hopefully in just a few weeks) when I have more to report.

Also, I understand that you consider the trunk unstable but in my personal experience it is working very well. I did have to make a few minor adjustments to get it to build successfully. When I have more time, I will post the details of what I did and maybe you can provide a critique.

Bill

João Cardoso

unread,
Dec 21, 2023, 11:09:30 AM12/21/23
to Alt-F
I wrote:
> Before reading and writing to those files a lock is obtained, otherwise wrong/parcial values would result.
And that is incorrect, there are no sync locks between sysctrl and dns-daemon. As both use one read()/write() to access the tiny sized /tmp/sys/... files, and that operations are atomic, there is no need for lock files. 

Bill R

unread,
Jan 12, 2024, 9:52:19 PMJan 12
to Alt-F
João, the issue happened again (at about 26 days uptime) and your info was invaluable in helping me figure this out.

The problem is in dns320l-daemon and not in sysctrl. Sysctrl continues to read /tmp/sys/temp1_input and update /tmp/sys/pwm1, but dns320l-daemon would not update the temp1_input file or start the fan based on the value in pwm1. However, from the command line I could still execute "dns320l-daemon -x GetTemperature" and have the temperature update. Putting dns320l-daemon in debug mode showed activity when executing that GetTemperature command but no other time.

Looking at the source code, it was obvious that this while loop was no longer calling readset_temp (and probably readset_powerled too but I didn't check that.)

while(1) {
// read desired power led state twice a second
if(((sleepCount * pollTimeMs) % 500) == 0)
readset_powerled();

// read temperature and adjust fan speed every 15 seconds
if(((sleepCount * pollTimeMs) % 15000) == 0)
readset_temp();

sleepCount++;

// sleep, waiting for a client command
poll_wait(pollTimeMs);
}


The issue is that sleepCount and pollTimeMs are both of type int, and sleepCount is always increasing. (pollTimeMs is set to 10.) When sleepCount gets large enough, the expression (sleepCount * pollTimeMs) becomes larger than what can be supported by a signed integer. Whatever result this overflow generates appears to never satisfy the if statements, and so those readset functions are no longer called.

If you do the math with sleepCount increasing every 10 ms, you will hit the maximum integer value for (sleepCount * pollTimeMs) at 24.9 days if you assume an exact 10 ms increment cycle (which probably isn't that precise.)

2,147,483,647 (max signed 32-bit int) * .01 (seconds per increment) / 3600 / 24 / 10 (pollTimeMs) = 24.9

I was wondering why the failure always occurred at about the same uptime and this explains it. I propose the fix below:

while(1) {
// read desired power led state twice a second
if(((sleepCount * pollTimeMs) % 500) == 0)
readset_powerled();

// read temperature and adjust fan speed every 15 seconds
if(((sleepCount * pollTimeMs) % 15000) == 0) {
readset_temp();
sleepCount = 0;
}

sleepCount++;

// sleep, waiting for a client command
poll_wait(pollTimeMs);
}

There is no reason to keep incrementing sleepCount once the longest wait loop has completed, so this code change resets the value back to zero when this happens.

I will try to get my build environment back up to try this out, but I feel confident this is the issue.

Bill

João Cardoso

unread,
Jan 13, 2024, 3:28:25 PMJan 13
to Alt-F

Hi Bill,

 you have hunted it :-)

The code relied on the fact that incrementing an integer variable from its maximum value would wrap it around to zero. And that is true, except that as the variable is declared  as a signed integer it would wrap to a negative number. Eventually, after another 25 days it would become zero again and restart working :-)
Curiously my boxes have generally uptimes of several months, so the bug should have bitten me more often, and I didn't notice it...

As an alternative to your proposed solution, I believe that declaring the  sleepCount as 'unsigned int', or better, 'unsigned short' will also solve the bug.
There seems to exists some implementation or standards details that make the remainder '%' operation result dubious when one of the operands is negative.

Thanks,
Joao

PS-To debug the daemon, send it a SIGUSR1 and watch syslog:
[root@DNS-327L]# kill -USR1 $(pidof dns320l-daemon); logread |tail -1
Jan 13 20:25:49 DNS-327L daemon.info dns320l-daemon: Setting loglevel to debug
[root@DNS-327L]# kill -USR2 $(pidof dns320l-daemon); logread |tail -1
Jan 13 20:25:53 DNS-327L daemon.info dns320l-daemon: Setting loglevel to info

Bill R

unread,
Jan 13, 2024, 10:41:29 PMJan 13
to Alt-F

As an alternative to your proposed solution, I believe that declaring the  sleepCount as 'unsigned int', or better, 'unsigned short' will also solve the bug.
There seems to exists some implementation or standards details that make the remainder '%' operation result dubious when one of the operands is negative.


I agree that using an unsigned value would also solve the issue. Personally, I don't like relying on the wraparound behavior of unsigned integers even though it's defined. I would also worry about (sleepCount * pollTimeMs) wrapping before sleepCount did and how that would impact behavior. I think that what I proposed is a safer approach.

In the next few days, I hope to do a quick build of the daemon and try it out.

Thanks again for the debugging tips, they were so helpful. I'll report back in 30 days or so once I've tested the code change.

Bill

Bill R

unread,
Feb 18, 2024, 7:12:38 PMFeb 18
to Alt-F
I'm following up to say that after 30 days uptime on the modified daemon, I've seen no recurrence of the issue. Thanks again for the help. In a new thread I will post all the steps I followed to build firmware from the current trunk including all the patches I had to make.
Reply all
Reply to author
Forward
0 new messages