VP2 datalogger possible issue (?)

157 views
Skip to first unread message

vince

unread,
Dec 12, 2021, 3:11:51 PM12/12/21
to weewx-user
Looking for thoughts....

I see this every archive period....

Dec 12 08:15:21 debian weewx[30827] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-03 11:00:00 PST (1638558000) less than final timestamp 2021-12-12 08:15:00 PST (1639325700)

Given that 9 day offset is pretty close to the point in time where old records should roll off the logger's memory, should I just leave it alone for a few days and see if it goes away ?   Logger is over 10 years old and has never been cleared/reset.

Dec 12 02:18:58 debian weewx[30827] ERROR weewx.drivers.vantage: LOOP try #1; error: Expected to read 99 chars; got 0 instead
Dec 12 02:19:01 debian /USR/SBIN/CRON[4840]: (root) CMD (scp /home/weewx/public_html/current.html pi:/mnt/ramdisk/current.html)
Dec 12 02:19:03 debian weewx[30827] ERROR weewx.drivers.vantage: LOOP try #2; error: Expected to read 99 chars; got 0 instead
Dec 12 02:19:08 debian weewx[30827] ERROR weewx.drivers.vantage: LOOP try #3; error: Expected to read 99 chars; got 0 instead
Dec 12 02:19:13 debian weewx[30827] ERROR weewx.drivers.vantage: LOOP try #4; error: Expected to read 99 chars; got 0 instead
Dec 12 02:19:13 debian weewx[30827] ERROR weewx.drivers.vantage: LOOP max tries (4) exceeded.
Dec 12 02:19:13 debian weewx[30827] INFO weewx.engine: Main loop exiting. Shutting engine down.
Dec 12 02:19:13 debian weewx[30827] DEBUG weewx.restx: Shut down MQTT thread.
Dec 12 02:19:13 debian weewx[30827] DEBUG weewx.restx: Shut down CWOP thread.
Dec 12 02:19:13 debian weewx[30827] DEBUG weewx.restx: Shut down PWSWeather thread.
Dec 12 02:19:13 debian weewx[30827] DEBUG weewx.restx: Shut down Wunderground-PWS thread.
Dec 12 02:19:13 debian weewx[30827] DEBUG weewx.restx: Shut down StationRegistry thread.
Dec 12 02:19:13 debian weewx[30827] INFO user.MQTTSubscribe: (Service) Disconnected with result code 0
Dec 12 02:19:13 debian weewx[30827] CRITICAL __main__: Caught WeeWxIOError: Max tries exceeded while getting LOOP data.
Dec 12 02:19:13 debian weewx[30827] CRITICAL __main__:     ****  Waiting 60 seconds then retrying...
Dec 12 02:20:01 debian /USR/SBIN/CRON[4844]: (root) CMD (scp /home/weewx/public_html/current.html pi:/mnt/ramdisk/current.html)
Dec 12 02:20:13 debian weewx[30827] INFO __main__: retrying...
Dec 12 02:20:14 debian weewx[30827] INFO __main__: Using configuration file /home/weewx/weewx.conf
Dec 12 02:20:14 debian weewx[30827] INFO __main__: Debug is 1
Dec 12 02:20:14 debian weewx[30827] DEBUG __main__: Initializing engine


Looking at the 'wake up' messages in syslog, I see about 40% rude vs. 60% gentle wake ups, if that matters.   A few LOOP CRC errors in the past week but just a handful.

Or should I start thinking about putting weewx on a far faster box ?  It does work pretty hard running Belchertown and a variety of other things but everything fits easily in the 5-minute archive window.  Worst case I've seen is everything taking 3:22 to complete.


Graham Eddy

unread,
Dec 12, 2021, 6:54:54 PM12/12/21
to weewx...@googlegroups.com
i have started seeing same thing on 15yo vp2 starting some months ago:
 * a few consecutive periods of DMPAFT msgs then none for a while, repeat; and
 * high incidence of LOOP try# errors of 2 or more (but rarely 4 causing restart, perhaps 6 months between restarts).

i think the higher rate of try# errors began long ago (couple years?) but i only noticed the DMPAFT msgs after i upgraded to 4.5.1 (might have been happening before - so stable i was not examining production logs often)

--
You received this message because you are subscribed to the Google Groups "weewx-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to weewx-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/weewx-user/788d7dfb-b8f3-4639-be16-66528f03ffafn%40googlegroups.com.

Tom Keffer

unread,
Dec 12, 2021, 7:19:33 PM12/12/21
to weewx-user
Vince, you should know better! That's not much of a log excerpt. What was before that DMPAFT log entry? Did weewx put something in the database? If not, that looks like a classic case of memory corruption.

As for the LOOP errors, I see those every once in a while, particularly when the logger is busy after a large catch up. Again, more of the log would help.

-tk

vince

unread,
Dec 12, 2021, 8:23:29 PM12/12/21
to weewx-user
On Sunday, December 12, 2021 at 4:19:33 PM UTC-8 Tom Keffer wrote:
Vince, you should know better! That's not much of a log excerpt. What was before that DMPAFT log entry? Did weewx put something in the database? If not, that looks like a classic case of memory corruption.

As for the LOOP errors, I see those every once in a while, particularly when the logger is busy after a large catch up. Again, more of the log would help.


Pot. Kettle .Black. Moi ? - Guilty as charged, sir :-)

I guess I was trying to get a feel for "occasionally seeing this or that" is not that unusual nor necessarily actionable if everything else works ok.

But to answer - everything seems to be working ok.
The db is getting updated just fine, as are the web pages and images.
All the extensions posting to WU/PWS/CWOP and the MQTT stuff (publish and subscribe) are working fine too.

I've attached a gzipped syslog from a restart of weewx and a couple cycles afterward.  Typically the rsync stuff completes last, so I added a blank line between cycles to make this a little easier to parse.   I did notice a api key error in my forecast setup (fixed) so the initial startup might be a little longer than normal since the forecast stuff had to catch up.   Timings for subsequent runs look normal for here - a little over 3 minutes til the rsync stuff ends a 5-minute archive set of stuff...

syslog.txt.gz

Tom Keffer

unread,
Dec 12, 2021, 8:32:19 PM12/12/21
to weewx-user
Naturally, that excerpt didn't have any LOOP errors.

I can definitely see a scenario where an overloaded machine could miss LOOP packets. The LOOP packets are requested in big bunches, typically 200 at a time, then the driver blocks, waiting for them. If a reporting thread hogs the CPU, the driver may be starved for time and not get back in time to get the waiting packet. The Vantage then assumes the driver has gone away and stops sending the packets. 

You have a lot going on with your system. Is there anything that could prevent the Python runtime engine from switching threads?

--
You received this message because you are subscribed to the Google Groups "weewx-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to weewx-user+...@googlegroups.com.

John Kline

unread,
Dec 12, 2021, 9:00:19 PM12/12/21
to weewx...@googlegroups.com

I’ve been studying the LOOP errors for over a year now.  I see them on three different installs, 1 NUC7i5 (w/ console) and two Raspberry Pi 4s (1 w/ console, 1 w/ envoy).

I’ve got a modified and instrumented driver.  The important mod is to exit genDavisLoopPackets() when the error is hit.  The error is almost always “Expected to read 99 chars; got 0 instead”, but it does not have to be 0 that was actually received.  I you don’t exit out of that loop, and you don’t get lucky to be near the end of the loop, weewx will restart.  With the change, weewx never restarts.

Next is the question of why.  It may very well be what Tom suggests.  I’ve suspected as such and have been meaning to write a C program to read the serial (USB) port and feed weewx.

One thing I can say is that, for me, ON ALL THREE consoles, the error occurs *almost* always after a time set.

Here’s an example from the log:
 Vantage Clock Set/Short Reads Info:
   Dec 12 00:30:04 judy weewx[491] INFO user.vantagenext: Clock set to 2021-12-12 00:30:06 PST (1639297806) (9128, 1639297804.599486, 1.303189)
   Dec 12 00:30:09 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9129)
   Dec 12 00:30:13 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9130)
   Dec 12 00:30:23 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9134)
   Dec 12 00:30:27 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9135)
   Dec 12 00:30:31 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9136)
   Dec 12 00:30:35 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9137)
   Dec 12 00:30:39 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9138)
   Dec 12 00:30:44 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9139)
   Dec 12 00:30:49 judy weewx[491] INFO user.vantagenext: get_packet returned 0 bytes. (9141)
In the above, when the clock set happened, the driver had served 9128 successful loop packets.  It served one more after the clock set and then got the zero bytes error (9129).  The same is true for the next, one success and then one error.  After that, 4 successes and one error.  And so on.

Sometimes after a time set, I see no errors, but that is not often.  The number of errors varies.  This is on the high side.  1 - 4 errors might be more common.

Sometimes there is an error that occurs NOT after the time set.  This is very rare and, when it does happen, it is very often an error that I see on all three consoles.  And, when it does occur, it’s always a single error.

I’m still studying this (because it bothers the heck out of me).

On Dec 12, 2021, at 5:32 PM, Tom Keffer <tke...@gmail.com> wrote:



vince

unread,
Dec 12, 2021, 10:47:56 PM12/12/21
to weewx-user
On Sunday, December 12, 2021 at 5:32:19 PM UTC-8 Tom Keffer wrote:
You have a lot going on with your system. Is there anything that could prevent the Python runtime engine from switching threads?

Indeed.  And uncertain there. The cpu pegs to 100% when weewx does its every 5-minutes thing.   System is 30 MB in swap always and it's a little interesting that weewx VSS = the box's RAM - how much it's swapping.  That might be a coincidence though.

I do 'not' see the clock set pattern John reported in his reply although I do see varying distributions of rude and gentle wakeups.

Belchertown really pounds the system when it runs.  Let me disable it temporarily to give it a bit more breathing room and see if that matters any.

Stuff is working so at this point I guess it's a bit of a science project....
 

vince

unread,
Dec 12, 2021, 10:51:29 PM12/12/21
to weewx-user
oops => weewx VSS = box's RAM + the amount it's swapping.

Only other thing on the box is nginx which is never getting hit for requests, so it's just paged out I guess.  Zero load from nginx essentially.
 

vince

unread,
Dec 13, 2021, 1:35:49 AM12/13/21
to weewx-user
Tom - update.  I noticed that the 'page timestamp' keeps walking upward every interval.

Think it's time to clear the logger ?

(from "grep timestamp /var/log/syslog)
Dec 12 21:45:17 debian weewx[12817] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 00:30:00 PST (1638606600) less than final timestamp 2021-12-12 21:45:00 PST (1639374300)
Dec 12 21:50:18 debian weewx[13726] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 00:35:00 PST (1638606900) less than final timestamp 2021-12-12 21:50:00 PST (1639374600)
Dec 12 21:55:21 debian weewx[13726] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 00:40:00 PST (1638607200) less than final timestamp 2021-12-12 21:55:00 PST (1639374900)
Dec 12 22:00:20 debian weewx[13726] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 00:45:00 PST (1638607500) less than final timestamp 2021-12-12 22:00:00 PST (1639375200)
Dec 12 22:10:20 debian weewx[13726] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 00:55:00 PST (1638608100) less than final timestamp 2021-12-12 22:10:00 PST (1639375800)
Dec 12 22:15:20 debian weewx[13726] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 01:00:00 PST (1638608400) less than final timestamp 2021-12-12 22:15:00 PST (1639376100)
Dec 12 22:20:20 debian weewx[13726] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 01:05:00 PST (1638608700) less than final timestamp 2021-12-12 22:20:00 PST (1639376400)
Dec 12 22:25:21 debian weewx[13726] DEBUG weewx.drivers.vantage: DMPAFT complete: page timestamp 2021-12-04 01:10:00 PST (1638609000) less than final timestamp 2021-12-12 22:25:00 PST (1639376700)


gjr80

unread,
Dec 13, 2021, 4:07:26 AM12/13/21
to weewx-user
I believe the ‘DMPAFT complete’ message is often misinterpreted as indicating a problem. For most users that see it it is a sign of normal system operation. As Tom said key in interpreting the message is what else is going on around it - classic case showing why we ask users for a longer log extract not just one liners.

The logger operates in a round robin fashion where the oldest archive record is overwritten with the newest. Under normal operation the vantage driver reads archive records from the logger that are timestamped after a given time, typically the last archive record received by WeeWX. The vantage driver reads archive records from the logger accepting archive records that are newer than this given time, this continues until the driver encounters an archive record that is older than this given time(*) - this is key in implementing the ‘catch up’ function - this indicates the end of the round robin buffer has been reached and signals to the driver to stop reading archive records from the logger. In a properly functioning system each successive archive period will see the timestamp of the ‘old’ archive record encountered increment by the logger archive period - exactly what is shown in the log entries below/above.

In a system with ‘corrupt station memory’ the same process occurs but no new archive records are found and (usually) the ‘old’ archive record timestamp does not increment each archive period. You will almost always see no archive records being saved to database - because no new archive records were read from the logger(compare this with a properly functioning system where one or more archive records will be saved to archive). In a normal log the lack of archive records being saved to database is the first hint of corrupt station memory; looking at a debug log and noting the timestamps in the ‘DMPAFT complete’ entries is the next clue.

It’s worth pointing out this is a debug log event, so even though the ‘DMPAFT complete’ log entry is hit by all vantage users using hardware record generation, it is seldom seen as most users operate with debug = 0. If it was more serious you you’ll likely find it would not be included as a debug level log entry but something more prominent.

So in the case above/below the fix (for the ‘DMPAFT complete’ issue) is to simply set debug = 0. No need to clear the station memory.

Gary

* - actually there is an adjustment for DST as well but for the the purposes of this explanation the simplified approach is adequate

Tom Keffer

unread,
Dec 13, 2021, 7:13:06 AM12/13/21
to weewx-user
John: good sleuthing!

Although, I didn't quite understand your comment about exiting genDavisLoopPackets(). Are you saying you might as well set max_tries=1 because it never recovers?

It's possible that after a time sync, the logger is occupied for a bit and unable to generate new packets. A possible solution might be to sleep for 5 or 10 seconds after the sync.

In any case, I'm 1,200 miles from my station, so I can't offer much for a few months.

-tk



John Kline

unread,
Dec 13, 2021, 9:51:42 AM12/13/21
to weewx...@googlegroups.com
Hi Tom,

In particular, if I get the read too few bytes exception, I simply return from genDavisLoopPackets().  genDavisLoopPackets() will be immediately called again asking for another 200 packets.

Just in case I’m not explaining it well, here’s the code that checks for such an error and simply returns.  With this code, weewx never restarts.
for loop in range(N):
for count in range(self.max_tries):
try:
loop_packet = self._get_packet()
log.debug('%s: Loop packet success!' % weeutil.weeutil.timestamp_to_string(loop_packet['dateTime']))
except weewx.WeeWxIOError as e:
if str(e).endswith('instead'):
if str(e).endswith('got 0 instead'):
log.info('get_packet returned 0 bytes. (%d)' % self.pkt_count)
else:
log.info('get_packet returned too few bytes. (%d)' % self.pkt_count)
# If already on a bad read, log that fact.
if self.on_bad_read:
log.info('genDavisLoopPackets: repeated bad read.')
self.on_bad_read = True
return
log.error("LOOP try #%d; error: %s", count + 1, e)
time.sleep(self.wait_before_retry)
else:
self.on_bad_read = False
self.pkt_count += 1
yield loop_packet
break
else:
log.error("LOOP max tries (%d) exceeded.", self.max_tries)
raise weewx.RetriesExceeded("Max tries exceeded while getting LOOP data.")


On Dec 13, 2021, at 4:13 AM, Tom Keffer <tke...@gmail.com> wrote:



Tom Keffer

unread,
Dec 13, 2021, 10:00:03 AM12/13/21
to weewx-user
Makes sense. That could easily be simplified with a subclass of WeeWxIOError. Call it WeeWxZeroBytesError.

Reply all
Reply to author
Forward
0 new messages