In our system we get the "workQPanic: Kernel work queue overflow",
though very rarely. This is followed by a reboot.
I understand this is because the processor can't handle all the
interrupts. What I am wondering is if the netTask has anything to do
with it? I read in a letter from 1993 that the netTask is responsible
for clearing the work queue. I wonder if this still is the case, cause
I can't find anything about it in the documentation.
Unfortunately, our system has tasks running with higher priorities
than netTask, but according to the documentation this should only
affect the documentation.
Andreas F wrote:
You will crash the system if packets are arriving faster than netTask
can
handle them. One possible cause is tasks with pri higher than netTask
that are consuming too much time. With windView or some such,
you can monitor the system idle time. Most folks would suggest that
under normal conditions, a minimum of 50% idle time should exist.
Speaking only for myself,
Joe Durusau
Oops! The last line should be "this should only affect the debugging
capabilities." ;)
Also, I wonder if it is not the netTask who services the kernel work
queue, which task does?
> I understand this is because the processor can't handle all the
> interrupts. What I am wondering is if the netTask has anything to do
> with it? I read in a letter from 1993 that the netTask is responsible
> for clearing the work queue. I wonder if this still is the case, cause
> I can't find anything about it in the documentation.
Well, you have been misinformed since the netTask has never been
responsible for processing the kernel's work queue (having a network
stack is not even a requirement for VxWorks). The work queue contains
kernel operations, such as a semGive or a msgQSend, that occurred in
the ISR associated with an interrupt that happened while the system
was in kernel state.
OK, that was a complicated sentence, so here's an example:
Task 1 calls semGive().
semGive() enters what is called "kernel state" - a special
protected state
that prevents corruption of kernel data, but does not require
interrupts to
be disabled.
An interrupt occurs.
The ISR tries to give a semaphore to release a task later on.
Since the system is already in kernel state, that semGive is added
to the
work queue.
The ISR exits, returning control to the task level semGive (no
rescheduling
can happen here since we are in kernel state).
semGive completes and calls windExit to leave kernel state.
windExit processes any jobs that are pending in the work queue, and
then
either returns to the current task's code, or invokes the scheduler
(based
on whether the head of the ready queue has changed as a result of
either
of the semaphore operations).
So, in a way any task may process the work queue, although really it
is windExit that does so. It will always happen in the context of a
task though (whichever one entered kernel state just before the
interrupt arrived).
This design, like any, has pros and cons. The pros are much reduced
interrupt latency since there are few places that interrupts are
blocked. One of the cons is that the queue is a fixed depth, and if it
fills up before it can be processed then the system panics and
reboots.
Often these panic reboots are caused by a buggy interrupt handler (one
that does not exit for example, but keeps spinning in an attempt to
handle as many events as possible for a device). High speed network
devices are particularly vulnerable to this since they tend to loop
through the incoming slots in the hardware ring looking for the end -
if the device is filling the slots faster than the CPU processes them
(which can happen for high speed network devices on slower processor
systems), then this ISR might never exit, resulting in a work queue
overflow. A ping flood from a remote machine while a task is running
semGive in a tight loop is often a good way to check for these types
of problem.
HTH,
John...
=====
Contribute to the VxWorks Cookbook at: http://books.bluedonkey.org/
I suspect (part) of this misconception comes from the (somewhat) common
usage
of netJobAdd by programmers to occasionally kick offsomething short lived to
run
at the task level while in an ISR.
I have seen a few instances where someone thought it was a good idea do use
netJobAdd to run all non
network sporadic task level processing and it ended up filling up the
mailbox that queues these requests.
/Andreas
>Hello,
[x]
>
>Often these panic reboots are caused by a buggy interrupt handler (one
>that does not exit for example, but keeps spinning in an attempt to
>handle as many events as possible for a device). High speed network
>devices are particularly vulnerable to this since they tend to loop
>through the incoming slots in the hardware ring looking for the end -
>
You are right. We experienced such a problem when we used our device in F.O.
redundancy rings. There is a network protocol (spanning tree protocol) that
routers can use in order to break logically those rings; if it didn't exist, a
message could travel many times on the ring until its eventual extinction.
Unfortunately, this breaking of the loop is not immediate: it takes from one to
several seconds to break a loop. When we tested the case of a physically broken
(opened) ring being closed, we found that a single broadcast message sent by a
device during the first second after the re-closure could appear in all the
devices even hundred of thousands times in a second !
These flood made all our devices crash (with workQPanic). And yes, we eventually
found that the problem lay in a buggy network driver.
--
Ignacio G.T.
We have/had a similar problem here on an embedded i960 chip that uses
it's PCI bus messaging unit as it's network device for token ring
network packets. The other chipset that places messages in this poor
little i960's messaging unit can place messages faster than they can
be serviced, and in this situation the ISR that services the messaging
unit will continuously get called resulting in this workQ panic.
I know the real solution would be to throttle the host unit to not
place messages so quickly, but does anyone know of a clean way to
protect from this scenario in the ISR of the slower unit? I have
experimented with looking at the size of the workQ and breaking out of
the ISR when approaching the limit of 64, and it seems to be prevent
the crash, but packets will begin to experience latency during times
of burst traffic as they won't get serviced until the next messages
are placed.
David Strand
according to my observations, it seems to be the excTask that is
responsible for executing those jobs queued by ISRs.
Fritz
> according to my observations, it seems to be the excTask that is
> responsible for executing those jobs queued by ISRs.
Your observations are incorrect. The exception task handles a few
clean up operations for the OS (the end of taskDelete() processing
when a task is deleting itself for example), and some work relating to
the display of h/w exception messages.
The kernel's work queue, as I stated in an earlier post, is processed
by the scheduling code. There are actually several places where the
queue is checked and any contents processed, but all are in the
scheduling code. That essentially means that they are not running in
any task, though they will be using the stack of the interrupted task
in reality for 5.x (in AE things are a little more complex - getting
this right in a multi-address space environment took a lot of care!).
Finally, there is the network task's ring buffer, which is what
netJobAdd() drops things in. This is used by network driver interrupt
routines to defer processing to task level, and also by network
related watchdogs, such as phy/link monitoring tasks.
john_...@yahoo.com (John) wrote in message news:<488e459a.04102...@posting.google.com>...