Environment : vxWorks 5.4 with T2.0
i connect via a NE2000 driver and FTP-FS from Target to host.
i have a large C++ based application. the application (nearly same code
except for some low level system classes) runs fine under Win*.
under vxWorks my application runs (after hours without problem) suddenly
into pagefaults (with dead target machines), into reboots, and similar
crashes. sometime just a task traps and changes into state suspended.
in this last case i can check the location of the trap etc, but its always
another location and sometimes its deep in the vxWorks kernel.
i build a binary which overwrites new, new[], delete, delete [], malloc,
free, realloc and calloc with vxWorks memory routines. my variant allocates
some more bytes with header and sign chars at begin / end of data block. and
i save pointers to all allocated memory into a fixed array. (there are not
so many memory allocations).
i start another tasks which checks once a second that all memory blocks are
ok (header + markers at begin / end).
additional i check the correctness whenever one of my functions is called.
i load my code before i load my application.
when i would overwrite memory this is immediate caught (tested).
there is no memory problem caught by these functions.
everythings runs fine until the sudden death.
when i have control over the machine afer a tasks is suspended, i looked at
the stack sizes too. there is no task near the margins of the stack.
anyone with an idea of how i can analyze this problem? i am out of ideas!
--
mit freundlichen Grüßen/best regards
mario semo
David
"David Lindauer" <cam...@bluegrass.net> schrieb im Newsbeitrag
news:419D347A...@bluegrass.net...
> well you know it is either in your low-level classes or somewhere in the
I/O
> drivers... I would tend to suspect the latter since you can never pin it
down.
> What is your throughput on the ethernet connections? are you loading them
down
The more communication i do the more the thing traps. we do a lot of TCP or
UDP communication from Target to host and v.v.
(we transfer data from target to host and display the data on the host. )
We cannot reproduce the problem on all (target) machines. on some machines
the stuff crashes, on some not.
> enough that the buffers fill up?
> Do you have any custom devices?
no.
> you might do
> a search on the NE2000 driver to see if there are known bugs, I have heard
that
where can i search? we do not have WindRiver Maintaince contract anymore.
> at least one of the VxWorks eithernet drivers can crash the thing under
certain
> conditions (no I don't know if it was the NE2000) and I have seen another
driver
> distributed by a leading component manufacturer with blatant bugs. I
don't know
> of any problems with the FTP but I haven't stressed it, we connect long
enough
> to load files onto the board then disconnect.
thx a lot for your reply.
> > well you know it is either in your low-level classes or somewhere in the
> I/O
> > drivers... I would tend to suspect the latter since you can never pin it
> down.
Not sure it is that clearly in these places. Stray pointers from
uninitialised areas of memory, freed and reused areas can manifest in
many strange ways, and perhaps do not show up on Windows in the same
way.
> The more communication i do the more the thing traps. we do a lot of TCP or
> UDP communication from Target to host and v.v.
> (we transfer data from target to host and display the data on the host. )
Do you have a test harness for your low level classes? Might be an
idea to write one if not that can generate the same types of usage
pattern that you expect in full application, but in a smaller, more
controlled application. For example, if your application has lots of
task all using the network simulataneously through these low level
classes then make sure your test application has lots of tasks too.
> We cannot reproduce the problem on all (target) machines. on some machines
> the stuff crashes, on some not.
My experience here suggests one of the following:
a) Stack overflow (you said you'd checked this and it was OK)
b) Uninitialised pointers and/or index values.
c) Use of previously freed memory
Some ideas for how to deal with this...
1) Look into MemScope from RTI. A copy of this may well pay for itself
in the time saved debugging this one problem alone!
2) You mention that you can build the app on Windoze, could you do the
same on Linux? If so, take a look at Valgrind
(http://valgrind.kde.org/). This is an excellent tool for tracking
down memory problems on Linux systems.
3) You could poison most of the memory in your system using the
bootloader. This is a long shot, but it might catch something
interesting.
4) You could replace the delete operator to poison the memory being
freed.
HTH,
John...
=====
Contribute to the VxWorks Cookbook at: http://books.bluedonkey.org/
>
> 3) You could poison most of the memory in your system using the
> bootloader. This is a long shot, but it might catch something
> interesting.
What kind of problem do you expect here? I have not done many changes in
config.h. i just added the definition of my network card:
#define IO_ADRS_ENE 0x300 /*!!!MS!!!!*/
#define INT_LVL_ENE 0x05 /*!!!MS!!!!*/
#define INCLUDE_PCI /* they all are PCI based */ /*!!!MS!!!*/
as i remember i changed some other part with help of WindRiver after we
bought the licence (years ago):
in syslib.c:
...
#if 0 /* orginal code */
pciConfigLibInit (PCI_MECHANISM_1, PCI_CONFIG_ADDR, PCI_CONFIG_DATA,
NONE);
#else
pciConfigLibInit (PCI_MECHANISM_2, PCI_CONFIG_CSE,PCI_CONFIG_FORWARD,
PCI_CONFIG_BASE);
#endif
pciIntLibInit ();
...
and i have disabled the scsi and ncr code:
#if 0
#define INCLUDE_SCSI /* include SCSI driver */
#define INCLUDE_NCR810 /* NCR 810 library */
#undef SCSI_AUTO_CONFIG
#define INCLUDE_SCSI2 /* select SCSI2 not SCSI1 */
#define NCR810_MEMBASE 0xfbfef000
#define NCR810_MEMSIZE 0x1000
#endif
>
> 4) You could replace the delete operator to poison the memory being
> freed.
thats what i have already done.
i replaced new, new[], malloc, realloc, calloc, free, delete and delete[]
with my own versions and do the following:
a) add some more memory at begin and end (marker, additional infos (size,
...))
b) store pointers to all allocated memory in a table and verify the memory
(check markers are still ok) at each free and additional each second.
c) init memory with 0x5a (after allocation and before freeing)
On thing you haven't mentioned: Have you done anything to prevent
failures when memory becomes so fragmented that contiguous blocks
of the size needed no longer exist? VxWorks, by default, has no
mechanism to prevent this from happening.
Speaking only for myself,
Joe Durusau
1) Are you using C++ with interrupts or other async processes?
2) Can you type "->tt task" when the task dies? It is quite possible
that the "root" place of a crash is inside the vxWork kernel.
However, the higher calling routines will probably give a better
hint as to what the problem is. A stack trace is always helpful.
Regards,
Bill Pringlemeir.
--
No matter where you go, there you are. - Buckeroo Bonzai.
vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
>
> 1) Are you using C++ with interrupts or other async processes?
i use C++. i use multiple processes. and i have a watchdog. (to post a
semaphore after a given time).
>
> 2) Can you type "->tt task" when the task dies? It is quite possible
> that the "root" place of a crash is inside the vxWork kernel.
> However, the higher calling routines will probably give a better
> hint as to what the problem is. A stack trace is always helpful.
different things occur:
a) the machine reboots
b) the machine freezes completly
c) just a single task trapps. in these situation i always do tt taskName.
Here is the callstack of the last 2 traps
tNetTask
ipintr
tcp_input
tcp_output
ip_output
ipDetach
ipDrvCtrl
muxSend
ne2000EndLoad
SysOutWordString
CRASH.
the last 3 (muxLoad, ne2000..., SysOutWordString ) where at the top of the
stack in the last 10 crashes. But its not always tNetTask which crashes.
Here is another one (here one of my task crashes)
vxTaskEntry
LCoosThread
global Constructors keyed to LCoplcCommHandlerRead::ctor
LCoplcCommHandlerRead::asyncFun
LCoplcCommHandlerRead::handle
LCoplcRouter::handle
LCoplcRouter::sendPacketTo
LCoplcResourceCommandLayer_BackEnd::Handle
LCoplcPacket::sendReply
LCoplcRouter::handle
LCoplcRouter::sendPacketTo
LCoplcRouter::sendPacketToMachine
LCoplcCommunicationService::write
LCoosTcpDriver::write
shutdown
bsdShutdown
sosshutdown
tcp_usrreq
tcp_output
ipDetach
ipAttach
muxSend
ne2000EndLoad
sysOutWordString
-- page fault --
> What kind of problem do you expect here? I have not done many changes in
> config.h. i just added the definition of my network card:
By poisoning the RAM before your application runs you can catch things
like unitialised class/structure members which might be harder to
detect if the memory is zeroed by the bootloader.
> > 4) You could replace the delete operator to poison the memory being
> > freed.
>
> thats what i have already done.
> i replaced new, new[], malloc, realloc, calloc, free, delete and delete[]
> with my own versions and do the following:
> a) add some more memory at begin and end (marker, additional infos (size,
> ...))
> b) store pointers to all allocated memory in a table and verify the memory
> (check markers are still ok) at each free and additional each second.
> c) init memory with 0x5a (after allocation and before freeing)
If that did not find anything, then I would seriously think about
something like Valgrind or MemScope. Unless you have really good
reason to assume that it is not application code going astray, but
perhaps a driver or something.
If you suspect the ethernet driver (and who wouldn't with the NE2000),
you could try another ethernet card (assuming that your system has
slots of somekind). The Intel fei (82557/9 based cards always seemed
pretty solid).
> By poisoning the RAM before your application runs you can catch things
> like unitialised class/structure members which might be harder to
> detect if the memory is zeroed by the bootloader.
thats what me lcmem.bin does. whenever memory is allocated i initialize it
with 0x5a.
>
> If you suspect the ethernet driver (and who wouldn't with the NE2000),
is the NE2000 driver in vxWorks 5.4 so buggy? i cannot check the windriver
bug lists anymore.
maybe this is the reason that the problems i have do not occur on other
machines which do not run a NE2000.
> you could try another ethernet card (assuming that your system has
> slots of somekind). The Intel fei (82557/9 based cards always seemed
> pretty solid).
i will try a 3com pci card now.
but a question raises now: what is tNetTask doing? from where / how does
tNetTask get the data to transfer?
i assume that that tNetTask is the task which transfers the data lowlevel.
so, when i have a socket (highlevel) and do a send(socket,...), what happens
with the data?
as i wrote the trap occurs in sysOutWordString a routine which just sets up
some registers (ecx (counter), esi (pointer), edx (target pointer) and do a
"rep outsw" (the vxWorks disasm says "rep OSIZE outs").
the rep outsw traps. at the moment of the trap eax is still 0xae389, an
unbelievable large value. and esi (the source pointer) is 0xffff0000 - who
wonders that the code traps.
So the question is : where does tNetTask get the data from?
will it be a problem to do something like:
void myfoo()
{
char data[50];
.... // init data
send(socket,data,size)
}
here i send a local data item. the send code either has to send the data
immediate or has to buffer the data. when send stores the pointer (data) in
a queue, sends it to tNetTask and tNetTask uses the pointer a bit later,
tNetTask will access stack memory already used by someone else.
[snip try another ethernet card; good advice]
> but a question raises now: what is tNetTask doing? from where / how
> does tNetTask get the data to transfer? i assume that that tNetTask
> is the task which transfers the data lowlevel. so, when i have a
> socket (highlevel) and do a send(socket,...), what happens with the
> data?
>
> as i wrote the trap occurs in sysOutWordString a routine which just
> sets up some registers (ecx (counter), esi (pointer), edx (target
> pointer) and do a "rep outsw" (the vxWorks disasm says "rep OSIZE
> outs"). the rep outsw traps. at the moment of the trap eax is still
> 0xae389, an unbelievable large value. and esi (the source pointer)
> is 0xffff0000 - who wonders that the code traps.
[snip]
> here i send a local data item. the send code either has to send the
> data immediate or has to buffer the data. when send stores the
> pointer (data) in a queue, sends it to tNetTask and tNetTask uses
> the pointer a bit later, tNetTask will access stack memory already
> used by someone else.
The "sysOutWordString" is like memcpy(), strcpy(), etc. If you pass a
garbage pointer to these functions, they will crash. This is not the
fault of these routines. From your stack traces, it appears that some
network buffers are getting messed up. When you post the stack trace
please include the offset. Ie, is it "tcpOutput" or "tcpOutput
+0x1daf", etc. The trace takes the closest routine it knowns about.
Only if the value is zero is it really calling that routine... but
still from experience, I would say a zbuf/mbuf was getting messed up.
Are you mixing socket descriptor between tasks? Are you using "zero
buffered" sockets?
Please see
"http://www.xs4all.nl/~borkhuis/vxworks/netPerformance.txt", and
"http://www.xs4all.nl/~borkhuis/vxworks/troubleshooting.txt". Along
with "http://www.xs4all.nl/~borkhuis/vxworks/vxw_pt4.html", you should
be able to help yourself. This is much more valuable than having
anyone here help you.
If you read the first link, you will know what the tNetTask is doing.
While lots of people have knowledge of vxWorks networking, only you
know about your system and all of the C++/memory allocation, etc that
you have done. If you read this, you might understand why I asked the
two questions above.
hth,
Bill Pringlemeir.
--
Dear child, go enjoy the world while it is still a mystery. -
Metroloafer
vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
thx a lot for these references. it helps me to understand tNetTask etc.
is it not ok to create a socket in task1 and use it in task2? as far as i
understand it, a socketHandle is just a data structure which is not nec.
created in the same task as used. but we just use it in one task at one
time.
and i have no idea what zero buffered sockets are. Has this something to do
with setSockOpt ? i use SO_LINGEr and SO_REUSEADDR.
best regards,
mario semo
> is it not ok to create a socket in task1 and use it in task2? as far
> as i understand it, a socketHandle is just a data structure which is
> not nec. created in the same task as used. but we just use it in
> one task at one time.
It is supported to use sockets in multiple tasks. However, a
difficulty results when you want to close the socket. In this case,
it is easy for the other thread to still try and use it.
Your problem does sound like a memory corruption of the network pools.
However, it is not clear how the corruption occurs. What happened
with the other drivers?
> and i have no idea what zero buffered sockets are. Has this
> something to do with setSockOpt ? i use SO_LINGEr and SO_REUSEADDR.
Sorry, this was probably a red herring. I was thinking that you could
use zbufs with the regular sockets. Look at the vxWorks documentation
for "zbufLib" and "zbufSockLib". You can't be using these; it would
be obvious to you.
fwiw,
Bill Pringlemeir.
--
bank 64 Vauxhall Cross counter intelligence warfare munitions FTS2000
Consul Leitrim government BLU-114/B Watergate codes smuggle
arrangements Taiwan
Life is a sexually transmitted disease. - Unknown
vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
> It is supported to use sockets in multiple tasks. However, a
> difficulty results when you want to close the socket. In this case,
> it is easy for the other thread to still try and use it.
i tested this too. i intercepted all calls with socket parameters and
close(). i checked this way that noone uses a not more (or never) existing
socket.
this does not happen.
>
> Your problem does sound like a memory corruption of the network pools.
> However, it is not clear how the corruption occurs. What happened
> with the other drivers?
nothing. it just trapps in tNetTask.
now i have 2 other environments to test
a) i have a new 3com pci card and i will use this card instead of my ne2000
b) i have a vxWorks 5.5 (T2.21) to test my app with ne2000 and 3com.
i will tests these 2 envirs today.
thx.
mario.