Box has 8 dual-core CPUS running Sol 10.
If process A spawns 8 or more child threads, the following seems to
happen
* All CPUs are not at all saturated - mainly idle.
* I see the child threads all sat on a CPU at priority 59 and they are
sleeping for about 20 seconds until they do
something. Tracing it shows that they are sleeping doing reads on a
socket which seems to be connected to
parent process A.
* Process A is also at priority 59 but not on a CPU.
This 20 second or so delay doesn't happen if I set Process A to spawn
less than 8 threads. I am assuming that the problem is that the
children are waiting for data from the parent which can't get CPU time
because everything is at priority 59. After about 20 seconds,
something kicks one child off the CPU (what timer is this?) and the
parent gets some time to deliver data. I don't understand the "8
threads" limit to this behaviour as the box naturally reports 16 CPUs
because they are dual-core.
I have done a lot of tracing, dtracing etc. and everything seems to
support this hypothesis so far. I can't work out whether this is a
tuning problem or just badly written (expensive, commercial) software
which doesn't multi-thread very well? Of course processor binding/
renicing Process A to test this hypothesis doesn't work as this is
inherited by the children. Any thoughts?
> I currently have an app which seems to be a victim of some sort of
> process deadlock or possibly just bad design and wondered what
> thoughts others may have about this.
>
> Box has 8 dual-core CPUS running Sol 10.
That would be 16 processors.
> If process A spawns 8 or more child threads, the following seems to
> happen
>
> * All CPUs are not at all saturated - mainly idle.
> * I see the child threads all sat on a CPU at priority 59 and they are
> sleeping for about 20 seconds until they do
> something. Tracing it shows that they are sleeping doing reads on a
> socket which seems to be connected to
> parent process A.
> * Process A is also at priority 59 but not on a CPU.
>
> This 20 second or so delay doesn't happen if I set Process A to spawn
> less than 8 threads.
Given that your box has 16 processors, there is no reason to believe
that the number of processors has an influence on the behaviour of the
program. One of the basic tenets of programming is that application
programmers don't grok threads.
> I am assuming that the problem is that the
> children are waiting for data from the parent which can't get CPU time
> because everything is at priority 59.
Nope. On a 16 processor box, 9 threads cannot be starved for CPU time.
The programmers goofed and got the thread synchronisation wrong. It
happens all the time, even with thread-friendly languages such as Java.
> After about 20 seconds,
> something kicks one child off the CPU (what timer is this?)
Probably a timer in the process. Solaris doesn't have a magical 20
second timer. The application probably has one as a crutch to avoid
solving a real issue. Usually, timers are implemented as signals so
trussing for alarm calls would show this.
> and the parent gets some time to deliver data. I don't understand the
> "8 threads" limit to this behaviour as the box naturally reports 16
> CPUs because they are dual-core.
As said above, it's unlikely to have anything at all to do with the box
or the OS, and everything with programmers not understanding
multi-threading.
> I have done a lot of tracing, dtracing etc. and everything seems to
> support this hypothesis so far. I can't work out whether this is a
> tuning problem or just badly written (expensive, commercial) software
> which doesn't multi-thread very well?
Unlikely to be a mere tuning problem. Solaris does both multi-threading
and multi-tasking very well, thank you.
> Of course processor binding/renicing Process A to test this
> hypothesis doesn't work as this is inherited by the children. Any
> thoughts?
Children? Is this a multi-threading program or a program that spawns
child processes? If the latter, then it's likely that the programmers
got the parent/child synchronisation wrong, and never tested it on a
genuine multi-processor box so that it did not show (there are very
real differences in behaviour when co-operating programs are involved
when going from one to multiple processors). Alternatively, they have
only access to an 8-way system and have produced a kludge that
malfunctions when there are more processors.
If you (or your company) will be sued when you divulge the name of the
piece of crap causing the problem, stay mum. If not, pillory the thing;
you'll probably find someone at least to commiserate or, with a bit of
luck, with a workaround or solution.
--
Stefaan A Eeckels
--
"A ship in the harbor is safe. But that's not what ships are built for."
-- Rear Admiral Dr. Grace Murray Hopper.
> This 20 second or so delay doesn't happen if I set Process A to spawn
> less than 8 threads. I am assuming that the problem is that the
> children are waiting for data from the parent which can't get CPU time
> because everything is at priority 59.
I don't think this is right. Even if there were only 8 concurrent HW
threads in the system, which there aren't since it's dual core, so
there are 16, the moment a thread blocked on a read it frees up that
CPU (er, concurrent HW thread, what is the right term for this now,
virtual CPU?), so the parent would be able to do work, which would free
the deadlock.
One interesting thing would be: what is the parent doing during the 20
second pause? Is it blocked on something, and if so what, or has it
gone to sleep, or something like that.
An initial 20 second pause sounds suspiciously like the initial 30 second
pause that can happen when a client connects to a server and the server
tries and fails to get DNS to resolve the client's IP address to a host
name.
-Greg
--
Do NOT reply via e-mail.
Reply in the newsgroup.