Jun 22 09:33:00 modred radmrouted[398]: <1245677580596> : radBufferRls: trying t
o release already free buffer or corrupt header
And then things go to crap, and nothing works until I restart everything.
I haven't yet dived into the code to figure out the problem, but does
anyone have an idea where I should start?
--Ken
It's a TS-7200 from Technologic Systems (http://www.embeddedarm.com,
I have no interest in them other than being a customer). It doesn't
_feel_ like a memory problem ... in my experience those crop up as
random problems, and I have certainly run a lot of stuff on this
box. I don't have any issues with it crashing or anything. Nevertheless,
it's not like I can try other memory; it's all soldered on. And this exact
same problem has happened 3 times now. I could be wrong, of course, but
I personally don't like saying "memory problem" until I've eliminated
everything else. I was just hoping to not to have to dive into the
radlib code just yet.
--Ken
Respectfully ... yes, I know that. But some of the backstory is
that I've used this box for plenty of other things, and many of
them were memory-intensive. If there was a memory error, I would
have expected it to show up before now. Yes, sure, a memory error
could have developed since I did that work (but it wasn't that long
ago). But in my experience, memory errors are really rare. It
just seems like that's the wrong place to start looking. Others
may differ in that opinion, of course.
>You also apparently assume you are the third or fourth person to ever
>use radlib. Otherwise, your "feel" would direct you to questions
>concerning what is different for your platform than the many other
>systems who run happily with radlib for years (yes years). To assume
>first it is a radlib bug does not give me much confidence in your
>debugging intuition.
Again, respectfully ... I would refer you back to my original note.
In it I said a) I'm getting this error, b) I haven't yet started
to debug it, c) does anyone have any idea where to start? In a
followup note, yes, I did say that I was hoping to not have to "dive
into the radlib code just yet". But I did not ever say anywhere, "Hey,
I think this is a bug in radlib". Sure, I did acknowledge that I was
going to have to dive into the radlib code, but by that I meant I
was going to have to look at the code, understand what that error
message _meant_, and what could be the possible cause of that error.
But I am not going in with any preconceptions as to the cause of this
error, and I don't believe that I implied that that a bug in radlib
was the cause.
>Debugging is a process of reduction - that is reducing the number of
>possibilities by ruling them out. It may be a radlib bug, but is that
>the most likely cause given its maturity?
Respectfully ... I have no idea of the history of radlib; I had
never heard of it before I installed wview. I mean that as no
slight against radlib; there are plenty of thing I haven't heard
of. I only mention that to indicate that I had no idea how mature
of a package it was. And as for ruling things out ... well, given
the situation with this hardware, short of buying another box I
have no easy way of testing the memory, since it's soldered on.
And I suspect that I'm probably one of the few people using radlib
on a NetBSD/arm system (certainly on this particular hardware). So
I'm sort of on the fringe here, and I know that. The obvious (to
me) debugging solution in this case would be to start with running
a debugger on radmrouted and see where that takes me. I am not
adverse to the problem being with NetBSD (I did build a debugging
version of the C library for this system to track down the alignment
problem I reported in an earlier email). If there is something
else you would suggest on how to debug this problem, I would gladly
give it a try.
--Ken
Hrm ... maybe I'm missing something, but will that actually help
in this particular case? I think the "interesting" stuff happens
the first time that radBufferRls() fails; I'm not sure it's easily
possible for me to have gcore run when that happens, especially since no
signal is posted. Perhaps the simplest thing to do is have it radlib
call abort() for that error and examine the core file after that happens.
One additional data point: before I wrote a startup script for
NetBSD, I wasn't starting up wvpmond (I was just starting up the
daemons by hand), and things ran without any problems (probably at
the most a week or two between reboots). It was only after I wrote
the startup script and started using wvpmond that I ran into this
problem. Starting Friday I decided to disable wvpmond, and so far
I haven't yet had radmrouted fail. I'll give that another week to
see if that's affects this problem or not, then I'll go from there.
--Ken