I've got a TCP communications problem in VxWorks (6.5) that has me
stumped.
I've got a Windows XP machine running a LabView program that sends 500-
byte TCP/IP packets to the VxWorks app (running on a single-board
computer w/ a Motorola PowerPC 7447 processor, elsewhere on the
network) at 10Hz. The VxWorks app reads some temperature sensors from
its A/D boards, packages up that data (about 1kB), and sends it back
to the LabView app, also at 10Hz. There is another task on the VxWorks
app that performs some simple calculations on the data, packages up
the results of those calculations (again, into packets about 1kB in
size) and sends them to the LabView app, also at about 10Hz. The
socket for the TCP connection is a global variable that is used by
both VxWorks tasks, and protected by a mutex.
Here is the problem. After about 70 hours of communication, the
connection fails. A packet sniffer (Wireshark) revealed that the
VxWorks app suddenly stops hearing from the LabView app. Both parties
continue to send packets until (1) the LabView app gets nervous that
VxWorks hasn't been incrementing its ACKs and so begins retransmitting
old packets, and (2) the VxWorks app gets nervous that it hasn't
received anything from the LabView app and so begins retransmitting
old packets. Soon the apps fill their respective send buffers, the
connection times out, and it's game over. The VxWorks program seems to
lock up and the app must be restarted.
This problem happened after 70 hours of flawless communication. After
rebooting VxWorks, we ran 22 hours until the next failure. After
rebooting again, 28 hours until the next.
At first, it seemed like this behavior might be caused by an
intermittently-broken receive wire on the VxWorks side, but an
inspection suggests that the hardware (at least outside the single-
board computer) is fine. We are begrudgingly confident that the
problem is not on the Windows side, because the packet data
demonstrates that the problem starts when the VxWorks app first stops
listening. I am hesitant to blame the application-level software,
because it appears from the packet data that the VxWorks TCP stack
simply does not receive (or ignores) the LabView packets, and so the
poor application simply sees the stream of packets stop. Lastly, we
attempted a workaround where the LabView app closes and reopens the
connection every two hours. After implementing this code, the
connection failed 28 hours later. In an attempt to simplify
communications, we enabled TCP_NODELAY on both sides, to no avail.
I have no more tricks up my sleeve. I am not familiar with the VxWorks
TCP stack; perhaps there are buffers (aside from the TCP send and
receive buffers) that can fill up or degrade over time? Is there any
way I can probe into the TCP stack and determine its health? Do you
think I'm barking up the wrong tree? A friend suggested setting the
TCP window size to something very small in order to minimize the
number of packets "in the air" on the network, but I'm not sure how to
set this in VxWorks. Does anyone know?
In short: Any suggestions? Has anyone encountered this behavior
before?
Thank you all in advance for your suggestions and time,
Justin
Oh, and I failed to mention that the VxWorks app is doing the
listening in the TCP connection. The LabView app connects to it.
My application requires sending tons of data at very fast rates over a
Gigabit ethernet. Essentially I am trying to send 4,800,000 bytes per
second (this may prove unfeasible). Anyways, what I do is read voltage
samples off an A/D card connected to my Single Board computer via the
PC104-Plus interface, at a rate of 400k samples per second for 6
channels. Each sample has 16 bits, or 2 bytes of data. However, the A/
D card's data buffer can only hold about 64K samples worth of data
before overflowing, so I need to make sure I am taking data off the A/
D card at a constant rate and then sending that data out in large
packets over the ethernet.
After 6000 samples accumulate in the data buffer, I have a PCI
interrupt trip which copies the sample data off the PCI buffer and
into another buffer, then sends a message to my ethernet application
telling it to send the data on the second buffer out over the TCP
socket.
My connection keeps up fine for the first 10 or so interrupt cycles,
then the network write task begins falling behind. After some
troubleshooting, I determined that after sending about 128k bytes out
over the TCP socket, my problems start. The same holds true when I did
a test with UDP (just to verify that it wasn't the receive side
causing the issue). If I shrink the packets, or send less data at a
time, I still eventually run into the same problem at 128kbytes every
time.
The issue seems to be (just a theory right now) that the vxworks
network data buffer has filled up at this point and needs to free the
memory and re-initialize, so the tNetTask is pending and my network
send call has to wait. The 128k-bytes total seems to match up with the
default network memory block total. I am trying a new method using
something called the zbufSocket Library. This library basically allows
you to send data over a socket without using vxwork's network data
buffer. However, WindRiver took zBuf support out of vxworks version
6.5, so I am wondering whether this is a dead end.
Try this link http://slac.stanford.edu/exp/glast/flight/sw/vxdocs/vxworks/netguide/c-tcpip.html
and go to section "4.3.3 Network Memory Pool Configuration". This
covers how the network memory is set up. I am still trying to digest
some of it myself to get a better idea. There does appear to be some
way to diagnose network memory usage.
If I come across something, I'll post a better update.
There are two important points here:
1) You're using VxWorks 6.5, which has a new TCP/IP stack (the IPNET
stack from Interpeak, which Wind River acquired). The documentation
link you posted is for an older version of VxWorks that used a BSD-
derived stack. Some of the information in that documentation is no
longer valid for the new stack, in particular that which pertains to
the stack's internal buffer management: there is no "system pool" and
"data pool" any longer. If you're looking at netPoolShow(), I wouldn't
bother. You can use that to check the ethernet driver's netpool, but
internally IPNET uses a totally different buffer management scheme, so
the netBufLib debugging stuff won't help you.
2) You failed to specify what ethernet controller you're using. I know
that the MPC7447 doesn't include built in networking hardware, so your
board must have some other ethernet chip on it (standalone, or part of
some combined I/O controller device). Please explain what controller
it is, exactly. This is important, because your problem might not be a
general networking issue, but a bug in the ethernet driver. If it
helps, tell us what BSP or single board computer your design is based
on.
And no, you can't use zbufs with the IPNET stack. The major design
difference between IPNET and the BSD-derived stack is that the BSD
code allows internally stored packet data to be fragmented across
multiple buffers (mbufs) while the IPNET stack requires all packets to
fit into a single contiguous buffer. zBufs are a VxWorks-specific
extension to the BSD-derived stack that allow an application supplied
buffer (or buffers) to be directly mated to an mbuf (or mbuf chain)
instead of having to allocate a whole mbuf tuple and copying the data
from the application buffers into the mbuf cluster buffers. The
problem is, sometimes an application may do a large write which has to
be broken up into smaller packets (if an app write()s 64K of data to a
socket, that has to eventually be split up into 1500 byte chunks for
transmission over ethernet). With the BSD mbuf scheme, it's not that
hard to just allocate a bunch of mbufs and set them to point to
different sub-buffers within the bounds of the single large buffer
provided by the application and then chain them together. But because
IPNET requires all packets to fit into a single contiguous buffer, it
doesn't support the ability to chain multiple fragments together. This
means you can't help but copy the data in order to get it formatted
correctly for transmission over the wire, which sort of defeats the
purpose of zero copy buffers. There is talk, however, of finally
implementing scatter/gather within IPNET so that zBufs can be brought
back.
There are arguments for and against both designs. The BSD mbuf-based
design is more flexible and can be more frugal with memory, but the
code is more complex. The IPNET ipnet_packet design is not as
flexible, but the code is simpler, and using its own internal buffer
handling scheme makes it more OS-agnostic, which was one of the IPNET
design requirements. (Personally I prefer the BSD design. Critics
complain that the days when you had to run BSD on a PDP-11 with
minimal memory -- which is what necessitated the more frugal buffer
management scheme in the first place -- are long over, and that I
should learn stop worrying and love large amounts of RAM. I contend
that just because you have a lot of RAM doesn't mean you shouldn't
make frugal use of it, and besides, VxWorks does sometimes have to run
on hardware with minimal memory.)
-Bill
Hi Bill,
Thanks for your thoughtful post.
Here are some stats on the system we're using:
- Curtiss-Wright 124 single-board computer
- The 124 uses an END Ethernet driver, and the physical device is on a
chip called the MV64460 (or Discovery III). I couldn't find any info
on the driver versions. I figure it's linked to our BSP version
number.
You were very helpful to another poster* regarding a problem that you
determined to be a buggy ethernet driver, made by DY-4 Systems. The
original poster from that thread also seemed to be using a Curtiss-
Wright (then DY-4 Systems) single-board computer. From our packet
logs, we found that the MAC address of the single-board computer
resolves to something that starts with DY-4... coincidence? Do you
think that we're running into the same problem as the poster from the
other thread because we're using the same crummy DY-4 ethernet driver?
Thanks for your continued help,
Justin
No, it's not the same driver.
Drivers usually come from one of two places: either Wind River, or a
3rd party BSP supplier. Wind River supplies drivers for some commonly
available NICs (i.e. the Intel PRO/100 or PRO/1000 PCI cards) and for
some controllers in system-on-chip processors for which they provide
BSPs (i.e. the TSEC ethernet on the Freescale MPC8560). Some board
vendors supply their own BSPs and include their own ethernet drivers
if VxWorks doesn't provide driver support already. (If no driver
exists at all, you can either write one yourself, or pay Wind River
Professional Services to write one for you.)
In the other poster's case, his board had a 10Mbps NatSemi SONIC chip
and VxWorks did include a driver for it (if_sn). Unfortunately, that
driver turned out to be kinda crummy, and didn't hold up well under
load.
In your case, you're probably using the on-board gigabit MACs in the
Discovery III system controller. According to the documentation, it
has 3 gigabit ports. I'm pretty sure the network driver you have was
not written by Wind River, though I don't know if it was done by
Curtiss-Wright or Marvell. If I had to bet, I'd say that at least some
of the code was provided by Marvell.
Looking over your original post, I see that the time that elapses
before the failure is not consistent -- in one case it was 70 hours,
in others 20 and 28 hours. This could be a buffer exhaustion issue,
but to me the variation in reproducibility suggests a race condition
instead. It could be a driver bug, but you'll need to run some more
tests in order to know for sure.
Here are a couple of thoughts:
- When your application and LabView stop communicating, can you still
ping the target from the Windows XP machine? If yes, the ethernet
driver and the IP layer of the stack are still working, at least to
some extent, and it's something at the TCP layer that's gone wrong. If
no, then it could be the driver, or a serious problem in the stack.
- You say that it looks like the VxWorks target stops receiving
traffic from LabView. How did you determine this? I usually check for
receive operation by adding the target shell and the INCLUDE_IFCONFIG
component. At the shell, you can do:
-> ifconfig "motfcc0"
motfcc0 Link type:Ethernet HWaddr 00:04:9f:07:08:09 Queue:none
inet 147.11.46.192 mask 255.255.255.0 broadcast
147.11.46.255
UP RUNNING SIMPLEX BROADCAST MULTICAST
MTU:1500 metric:1 VR:0 ifindex:2
RX packets:24 mcast:7 errors:0 dropped:1
TX packets:6 mcast:0 errors:0
collisions:0 unsupported proto:0
RX bytes:2198 TX bytes:438
value = 0 = 0x0
If you run ifconfig a couple of times and you see the "RX packets" and
"RX bytes" incrementing, then this means the driver is still receiving
frames and passing them into the stack. (The stack maintains the
counters shown by ifconfig, not the driver, so you know the data is
making the transition across the driver/stack boundary.) If this is
the case, it means the receive path is working, but the LabView isn't
getting any response back from the target because the transmit path is
stalled. If you don't see the RX counters incrementing, then the
receiver is actually stuck.
Unfortunately, the TX counters are less useful: they indicate that the
stack sourced packets to the underlying driver, but that doesn't tell
you whether or not the outgoing frames were successfully transmited.
You can test if the driver is still able to send traffic by including
the INCLUDE_PING component in your image. When the target hangs, try
to do:
-> ping "xxx.xxx.xxx.255", 5
from the target shell, where xxx.xxx.xxx is your IP network. This
should (assuming your netmask is 255.255.255.0) cause the target to
send some broadcast packets onto the wire, which you can observe with
Wireshark.
If you see the broadcast packets sent, and the RX counters don't
increment, then the transmit path is working and the receiver is
stalled.
If the broadcast packets don't make it onto the wire, and the RX
counters do increment, then the receive path is working and the
transmitter is stalled.
If the RX counters don't increment and you don't see any broadcast
packets on the wire either, the interface is completely jammed (maybe
it's stopped getting interrupts).
- The board should have at least two ethernet ports (three, if Curtiss-
Wright wired up all 3 MACs). As a test, I would enable a second port
on the target and cable it to another machine. For example, add a
second NIC to your Windows XP host and connect it to the other port on
the target via crossover cable. (Don't cheat by plugging everything
into the same hub/switch/whatever -- ideally, you want the two ports
to be isolated.) Give the second link some dummy IP addresses, like
10.0.0.2 (target) and 10.0.0.1 (Windows XP host). You should be able
to ping the target from your Windows host over both the spare link
(ping 10.0.0.2) and the primary link. Now start your LabView app going
and wait until it fails again. Once it fails, try to ping the target
over both links. If pinging the primary IP address fails, but pinging
the spare IP succeeds, then the problem is almost certainly a driver
bug which has caused the primary interface to become stalled somehow.
(That is, the heavily loaded interface encountered a race condition or
some other error condition from which it did not recover and is now
wedged, while the unloaded spare link is still functional.) If both
links fail to respond to ping, then the problem is more likely a stack
issue. (It could still be a driver issue: each time you send a packet,
the driver takes temporary ownership of it until the TX DMA operation
completes -- if the driver sets up a large TX DMA ring and the
transmitter stalls while it still has ownership of a lot of the
stack's TX buffers, this could prevent the stack from being able to
transmit packets entirely.)
I have found that many vendor supplied drivers follow a pattern. The
companies that make the networking silicon also create an OS-
independent hardware abstraction library for managing the controller.
To make a driver, they port the HAL to the target OS API, and then add
a driver shim over the top of it. This is considered to be more
effective from their perspective since it means that they can support
several OSes with the same piece of core library code. If they find a
bug specifically related to their ethernet hardware, they can then
just patch the HAL code once and fix the bug in all their drivers at
the same time (once they've tested on one OS, they can just recompile
the drivers for all the others to pick up the fix).
This sounds like a great idea, but there are drawbacks. Writing truly
portable code is hard: often the HAL ends up polluted with many
spaghetti #ifdefs to deal with platform differences, which complicates
maintenance. Also, the object code can end up being very large (and
some of it might be dead code that doesn't even apply to your
plarform). If you're targeting Windows or UNIX, you might not care,
but with VxWorks, small footprint is key. And VxWorks has some special
requirements compared to other OSes: sometimes "portable" designs fail
to take those differences into account.
From an OS developer's perspective, the best driver is one that's
small, easy to read, easy to maintain and which makes the best use of
available OS (and network stack) facilities.
Anyway, the point is that I wouldn't be surprised if the driver for
the Discovery III ethernet has a bug lurking in it somewhere. I also
bet a quarter that they didn't provide the source code for it either. :
(
-Bill
Bill,
I continue to appreciate your knowledgeable assistance on this
problem! You have been so helpful. To address the points you brought
up:
> I'm pretty sure the network driver you have was
> not written by Wind River, though I don't know if it was done by
> Curtiss-Wright or Marvell. If I had to bet, I'd say that at least some
> of the code was provided by Marvell.
You're absolutely right. To quote one of our contacts at Curtiss-
Wright,
"Each of our boards will have an ethernet driver specific
for the chip and the board. For the 124 board, the driver would have
been based on code from Marvell (the manufacturer of the Discovery III
bridge), and as updated by Curtiss Wright Controls Embedded Computing/
Dy
4 Systems."
So that answers that.
> - When your application and LabView stop communicating, can you still
> ping the target from the Windows XP machine? If yes, the ethernet
> driver and the IP layer of the stack are still working, at least to
> some extent, and it's something at the TCP layer that's gone wrong. If
> no, then it could be the driver, or a serious problem in the stack.
We have not tried this yet, but it's on our (now much longer) list of
things to try once the problem comes up again.
> - You say that it looks like the VxWorks target stops receiving
> traffic from LabView. How did you determine this? I usually check for
> receive operation by adding the target shell and the INCLUDE_IFCONFIG
> component.
We had Wireshark running on a separate machine that was watching all
the traffic on the network. Each time this anomaly happens, it starts
when the VxWorks box stops ACKing packets sent from the Windows box.
To be precise, the ACK number on the "VxWorks --> Windows" packets
stop increasing. Soon, the Windows box, noticing that the VxWorks box
is reporting the same ACK number, begins retransmitting packets.
However, the VxWorks box still does not increment its ACK number. At
the same time, the VxWorks box begins retransmitting data to the
Windows box, as though it didn't hear the incrementing ACKs that the
Windows box was sending to VxWorks.
In short, we deduced it from a bunch of Wireshark data. Do you think
this is a valid conclusion?
We feel much more prepared for the next anomaly, though. We have
enabled ifconfig() in our kernel and are running on the bench with our
fingers crossed.
Per your several suggestions, we'll try
1. Pinging the VxWorks box from another machine on the network
2. Pinging xxx.xxx.xxx.255 from the VxWorks box and seeing what
happens in Wireshark
3. Calling ifconfig() to see what the "Rx packets" and "Tx packets"
counters are doing.
> - The board should have at least two ethernet ports (three, if Curtiss-
> Wright wired up all 3 MACs). As a test, I would enable a second port
> on the target and cable it to another machine. For example,
[snip]
You are correct, Curtiss-Wright did wire up another NIC, but
unfortunately we've had a problem enabling it. Another group of my
coworkers is tackling that problem. If they get it working we'll try
setting up another machine on a two-node network, the way you
suggested.
Thanks again for all your help. We've been in contact with Curtiss-
Wright support and Wind River support, but this thread has provided us
with the most help so far. In fact, our Wind River support contact
provided a cornucopia of advice, almost all of which he copied from
your last post. He omitted to include phrases which would point to a
fault of Wind River's, like your phrase "a serious problem with the
stack." Blah.
Warm regards,
Justin
The two main suspects in this case are the VxWorks network stack and
the ethernet driver on our single-board computer. Does this new data
point to one over another?
Thanks,
Justin
(You don't explain how you implement this "panic if the traffic stops"
behavior. I'm sdduming you're using a VxWorks watchdog timer for this,
and that when the watchdog fires, it triggers another task to reset
the target.)
In my opinion, it tends to point to some problem with the TX code in
the driver.
You say the target is exchanging data with another box. This implies
that there should be both TX and RX activity. However, you also say
that the app is designed to panic only if it stops receiving data. In
your previous post you said you were using Wireshark to monitor the
traffic from the target and noted that the TCP transmissions from the
target had ceased, but you didn't say if the UDP traffic sent by this
application had stopped as well.
Again, the right way to check for continued RX activity is with the
ifconfig() utility. However, assuming thet the app is actually still
receiving UDP traffic (as opposed to having gotten blocked in a call
into the stack), then it means the RX path in the ethernet driver and
the stack are still nominally functional, and the TX path in the
driver has gotten wedged.
By the way, I have a simple and cheap stress test diagnostic I use to
test drivers in VxWorks. It's not as good as using a dedicated traffic
tester, but it can sometimes reveal interesting problems.
There's a very simple utility called TTCP. The UNIX version can be
downloaded from here:
ftp://ftp.sgi.com/src/sgi/ttcp
The Windows version can be downloaded from here:
http://www.pcausa.com/Utilities/pcattcp.htm
(I typically use the UNIX version, but it sounds like you're a Windows
shop.)
What I like to do is use this utility to bombard the target with small
UDP packets as follows:
% ttcp -s -u -l22 -n1000000 -t <IP address of target>
The options are:
-s: source a bunch of garbage data
-u: use UDP instead of TCP
-l22: use a UDP payload length of 22 (should result in a 60 byte
ethernet frame)
-n1000000: The number of UDP datagrams to send (a lot)
-t: transmit (as opposed to -r, receive)
I'm pretty sure the Windows version supports the same options as the
UNIX one.
This excercises both the RX and TX path of the network driver, and
some parts of the stack. By default, ttcp uses port 5000 to send
traffic. There normally isn't any application running on the VxWorks
target that's listening on this port, so when it receives a UDP
datagram for this port number, the stack responds by sending an ICMP
port unreachable message. If you send it a lot of datagrams, it will
respond with a lot of messages. This will force traffic through the
UDP receive path and the ICMP output path in the stack.
I find that this generates a bit more traffic than using a flood ping,
and it's helped me expose bugs in several VxWorks ethernet drivers in
the past. Using the FreeBSD host in my office, I can generate
something on the order of 200,000 frames/second on my gigabit ethernet
interface. The types of failure modes you might see are:
- exception in tNetTask (or possibly ipnetd using the new IPNET stack
in 6.5)
- exception in interrupt context (buggy interrupt service routine?)
- RX stall (possibly due to mishandled RX overrun in the driver)
- TX stall (possibly due to incorrectly implemented TX cleanup
handling, or mishandled
TX underrun)
- RX and TX stall (possibly due to interrupts getting masked off and
not re-enabled, or driver
state getting hosed due to a race condition)
- sluggish response on target shell (possibly due to driver doing too
much work in
interrupt context, or making excessive use of intLock()/intUnlock())
If you notice any of these (especially an exception) then you've found
a driver bug.
-Bill
> Thanks,
> Justin
My philosophy is: once you get a target into a failed state, gather as
much data as you can from it before you reset it. Sometimes that's not
a lot, but simple things like ping can often provide helpful clues.
> > - You say that it looks like the VxWorks target stops receiving
> > traffic from LabView. How did you determine this? I usually check for
> > receive operation by adding the target shell and the INCLUDE_IFCONFIG
> > component.
>
> We had Wireshark running on a separate machine that was watching all
> the traffic on the network. Each time this anomaly happens, it starts
> when the VxWorks box stops ACKing packets sent from the Windows box.
> To be precise, the ACK number on the "VxWorks --> Windows" packets
> stop increasing. Soon, the Windows box, noticing that the VxWorks box
> is reporting the same ACK number, begins retransmitting packets.
> However, the VxWorks box still does not increment its ACK number. At
> the same time, the VxWorks box begins retransmitting data to the
> Windows box, as though it didn't hear the incrementing ACKs that the
> Windows box was sending to VxWorks.
>
> In short, we deduced it from a bunch of Wireshark data. Do you think
> this is a valid conclusion?
Oh. That's interesting. Okay, so if I understand you correctly, the
target _is_ able to transmit packets when the problem occurs. This
definitely points to a failure in the RX path somewhere. The fact that
it's continuing to send TCP segments occasionally means that the
stack's TCP timers are still firing, and that the driver can transmit
frames onto the wire. If it stops receiving packets, the stack will
think the peer hasn't acknowledged the current segment yet and will
keep retransmitting it. (There are sometimes oddball cases where the
stack on one side or the other becomes desynchronized, which just
botches a single TCP stream while other traffic continues to flow
normally. These are rare though, and it doesn't sound like you're
doing anything that would trigger such a condition. In any case, this
is why I asked if you could ping the target once the anomaly occured.
My suspicion at this point is that you won't be able to.) Not being
able to receive packets could mean a couple of things:
- The RX state in the driver may have fallen out of sync with the chip
- The receiver encountered an error from which the driver couldn't
recover
- RX interrupts have stopped firing
- _all_ interrupts for that port have stopped firing (if TX interrupts
have also stopped,
the driver may still be able to send packets onto the wire for a
short time)
> We feel much more prepared for the next anomaly, though. We have
> enabled ifconfig() in our kernel and are running on the bench with our
> fingers crossed.
>
> Per your several suggestions, we'll try
> 1. Pinging the VxWorks box from another machine on the network
> 2. Pinging xxx.xxx.xxx.255 from the VxWorks box and seeing what
> happens in Wireshark
> 3. Calling ifconfig() to see what the "Rx packets" and "Tx packets"
> counters are doing.
Good. I'm curious to see the result. (And hopefully adding the
additional components won't just make the problem disappear.)
> > - The board should have at least two ethernet ports (three, if Curtiss-
> > Wright wired up all 3 MACs). As a test, I would enable a second port
> > on the target and cable it to another machine. For example,
>
> [snip]
>
> You are correct, Curtiss-Wright did wire up another NIC, but
> unfortunately we've had a problem enabling it. Another group of my
> coworkers is tackling that problem. If they get it working we'll try
> setting up another machine on a two-node network, the way you
> suggested.
If the second interface shows up when you do "muxShow()" then you
might just be able to do this:
-> ipcom_drv_eth_init "nameofdriver", 1, 0
-> ifconfig "nameofdriver1 10.0.0.1 netmask 255.255.255.0 up"
If it doesn't show up in muxShow(), then it probably needs to be
enabled in the BSP somewhere.
> Thanks again for all your help. We've been in contact with Curtiss-
> Wright support and Wind River support, but this thread has provided us
> with the most help so far. In fact, our Wind River support contact
> provided a cornucopia of advice, almost all of which he copied from
> your last post. He omitted to include phrases which would point to a
> fault of Wind River's, like your phrase "a serious problem with the
> stack." Blah.
That's politics for you.
-Bill
>
> Warm regards,
> Justin
Thanks again for your swift and knowledgeable responses. I've
downloaded pcattcp from here
http://www.pcausa.com/Utilities/ttcpdown1.htm
and tried it out. It appears as though I can blast my target with
zillions of packets and it continues, on the surface, to chug along
happily. The Windows machine receives all the frames it expects in a
timely manner, and running ifconfig() on the target does indeed show
the huge number of dropped Rx packets, as we expected. Does this mean
that none of the exceptions you mentioned, e.g.
- exception in tNetTask (or possibly ipnetd using the new IPNET stack
in 6.5)
- exception in interrupt context (buggy interrupt service routine?)
- RX stall (possibly due to mishandled RX overrun in the driver)
- TX stall (possibly due to incorrectly implemented TX cleanup
handling, or mishandled TX underrun)
- RX and TX stall (possibly due to interrupts getting masked off and
not re-enabled, or driver state getting hosed due to a race
condition)
- sluggish response on target shell (possibly due to driver doing too
much work in interrupt context, or making excessive use of intLock()/
intUnlock())
are occurring?
Also, I've got a tool called Colasoft Capsa Packet Builder, which lets
you construct and edit packets, then send them out. If we're thinking
that the driver or network stack might barf if it gets crummy packets,
I could maybe construct some malformed packets with this program and
ship them out on the network. Do you think this would be a fruitful
approach?
You mentioned some of the causes of being unable to receive packets:
- The RX state in the driver may have fallen out of sync with the chip
- The receiver encountered an error from which the driver couldn't
recover
- RX interrupts have stopped firing
- _all_ interrupts for that port have stopped firing (if TX interrupts
have also stopped, the driver may still be able to send packets onto
the wire for a short time)
Can you please suggest some ways I can test for these cases? My
knowledge of this level of detail of driver/OS architecture is pretty
sparse...!
Lastly, thanks for the heads up with muxShow(). The second NIC didn't
show up, and I remember one of my coworkers mentioning having to
enable it in the BSP.
Thanks again for your time and attention,
Justin
All but the last one: you didn't say if the shell became slow to
respond while ttcp was blasting the target with traffic.
> Also, I've got a tool called Colasoft Capsa Packet Builder, which lets
> you construct and edit packets, then send them out. If we're thinking
> that the driver or network stack might barf if it gets crummy packets,
> I could maybe construct some malformed packets with this program and
> ship them out on the network. Do you think this would be a fruitful
> approach?
I would hold off on this for now. I always tell people I can only
handle one catastrophe at a time. You have one failure scenario
involving your LabView app: focus on that problem rather than trying
too hard to provoke others.
> You mentioned some of the causes of being unable to receive packets:
>
> - The RX state in the driver may have fallen out of sync with the chip
> - The receiver encountered an error from which the driver couldn't
> recover
> - RX interrupts have stopped firing
> - _all_ interrupts for that port have stopped firing (if TX interrupts
> have also stopped, the driver may still be able to send packets onto
> the wire for a short time)
>
> Can you please suggest some ways I can test for these cases? My
> knowledge of this level of detail of driver/OS architecture is pretty
> sparse...!
I wouldn't worry about this just yet. What you really want to do is to
wait for the LabView app to fail again and collect some more data like
I suggested previously. Once you have that data, _then_ you can decide
what to look at next. (This is why I asked if you'd tried to ping the
target once the LabView app stopped working; by using Wireshark you
analyzed the network, but you didn't really do anything to analyze the
target. All you really know about the target is "it stops working."
You need to know more.)
-Bill
>Oh, and I just remembered another piece of the puzzle: The VxWorks
>machine is also exchanging data with another box on the network over
>UDP. We have timers in the VxWorks app that make it panic if it stops
>receiving UDP packets. It appears that during each of these anomalies,
>the VxWorks box continues to receive UDP packets just fine. That is,
>it appears as though it stops hearing from the TCP stream, but
>continues to receive UDP packets as normal.
Perhaps your ARP cache has become corrupt. I had a system which after
about 26 days of continuous connection would respond to ping but not
to telnet; it turned out that the ARP cache had become corrupted by a
nanosecond timer overflow. The mechanism of corruption is probably
not timer-related in your case but the end result seems similar. Can
you devise ARP diagnostics that can run periodically on the sending
device, both before and after the TCP fail?
Regards
James Cunnane
Thanks for reminding me. The shell continued at its normal pace and
showed no signs of slowing.
> What you really want to do is to
> wait for the LabView app to fail again and collect some more data like
> I suggested previously. Once you have that data, _then_ you can decide
> what to look at next. (This is why I asked if you'd tried to ping the
> target once the LabView app stopped working; by using Wireshark you
> analyzed the network, but you didn't really do anything to analyze the
> target. All you really know about the target is "it stops working."
> You need to know more.)
Understood. Thanks again for your help. I'll make sure to keep you
posted as we learn more.
Regards,
-Justin
Hmm... In your case you said the system would respond to ping, but
not telnet. It's hard to classify that as a problem with the ARP
cache, _if_ you tried to ping the target from the same host that you
also tried to telnet to it from. If you can ping target A from host
B, then ARP resolution between A and B is working (or at least, the
ARP entries haven't timed out yet). Ping (ICMP over IP) and telnet
(TCP over IP) both rely on ARP, so if it worked for one, it should
have worked for the other.
However, if you tried to ping target A from host B, and that worked,
but trying to telnet to target A from host C did not work, that could
be an ARP problem. (The target still had an unexpired ARP entry for
host B, but was unable to perform ARP resolution for the previously
unknown host C.)
In Justin's case, he said once his app got into its error state, he
could see the target still sending TCP segments to his Windows host
using Wireshark (but not responding to ACKs from the Windows host).
This implies the target's ARP entry for the Windows host was still
valid (otherwise it would have started sending ARP "who has" requests
instead).
-Bill
> Regards
>
> James Cunnane