weird thread behavior on itanium/hp-ux with pthread

Jeremy

unread,

Jul 28, 2009, 9:07:59 PM7/28/09

to

we have a code snippet where one thread wakes up the other thread
whenever necessary:

thread 0 thread 1

loop loop
do_something(); if
(somethingelse_happened) {
mutex.lock(); mutex.lock();
if ( !signaled) { if
(sleeping) {
sleeping = true; sleep =
false;
while (sleeping)
condition.signal();
condition.timedwait(1); }
} else { signaled
= true;
sleeping = false; mutex.unlock();
} }
mutex.unlock

however, using tusc, we observed that thread 0 did go to sleep first,
and 'something else' did happen afterwards, but thread 0 did not wake
up until it timed out.

could anybody explain why this can happen, or I simply missed
something?

Thanks,
Jeremy

Jeremy

unread,

Jul 28, 2009, 9:15:18 PM7/28/09

to

the pseudo code was totally messed up, here it is again:

thread 0

loop
mutex.lock();
do_something();
if ( !signaled) {
sleeping = true;
while (sleeping)
condition.timedwait(1);
} else {
sleeping = false;
}
mutex.unlock

thread 1 -----------------------

loop
if (somethingelse_happened) {
mutex.lock();
if (sleeping) {
sleep = false;
condition.signal();
}
signaled = true;
mutex.unlock();
}

Loïc Domaigné

unread,

Jul 29, 2009, 12:25:16 AM7/29/09

to

Hi Jeremy,

I guess, this should be "sleeping = false" in thread 1, right? It's
difficult to pinpoint problem using pseudo-code. Can you please post a
minimalistic example (which compiles) that shows your particular
problem?

TIA,
Loïc.
--
My Blog: http://www.domaigne.com/blog

“Computers are good at following instructions, but not at reading your
mind.” – Donald Knuth

David Schwartz

unread,

Jul 29, 2009, 1:03:09 AM7/29/09

to

Why do you have "while(sleeping)" when what you really want is "while(!
signaled)"? Your "sleeping" variable seems to serve no purpose but to
make this design at least twice as complex as it needs to be.

DS

Jeremy

unread,

Jul 29, 2009, 3:10:48 PM7/29/09

to

Thanks for pointing this out. the is just the legacy code that I need
to figure out why the behavior could ever happen. (btw, the
do_something() should be outside of the critical section)

Jeremy

unread,

Jul 29, 2009, 3:19:02 PM7/29/09

to

On Jul 28, 9:25 pm, Loïc Domaigné <loic.domai...@googlemail.com>
wrote:

Thanks Loic. Yes that was a typo. the problem only happens in a remote
site under particular conditions and could not be readily reproduced
locally. what interests me is whether this sleep/wakeup mechanism is
faulty on HP/Itanium for whatever reasons.

Jeremy

Loïc Domaigné

unread,

Jul 30, 2009, 4:31:56 AM7/30/09

to

Hello,

> > I guess, this should be "sleeping = false" in thread 1, right? It's
> > difficult to pinpoint problem using pseudo-code. Can you please post a
> > minimalistic example (which compiles) that shows your particular
> > problem?
>
> Thanks Loic. Yes that was a typo. the problem only happens in a remote
> site under particular conditions and could not be readily reproduced
> locally. what interests me is whether this sleep/wakeup mechanism is
> faulty on HP/Itanium for whatever reasons.

I see. As pointed out by David, the scheme used is unecessary
complicated, and could have some logic errors resp. might be
inefficient. For instance, I don't see where the signaled variable is
reset to false? Second, thread1 would spin as long as sleeping is
false?

Regards,
Loïc
--
My blog: http://www.domaigne.com/blog

“The best thing about a boolean is even if you are wrong, you are only
off by a bit.”

David Schwartz

unread,

Aug 1, 2009, 7:32:48 AM8/1/09

to

On Jul 28, 6:07 pm, Jeremy <fc2...@gmail.com> wrote:

> however, using tusc, we observed that thread 0 did go to sleep first,
> and 'something else' did happen afterwards, but thread 0 did not wake
> up until it timed out.

You are misreading the output of tusc. It is not telling that thread 0
did go to sleep first. It is telling you that thread 0 *decided* to go
to sleep first. It may have actually gone to sleep significantly
later. Since "something_happened" occurs outside of the mutex, it is
impossible to tell from the tusc output whether thread 0 actually did
go to sleep before or after something happened. However, the next
mutex acquisition in the other thread cannot occur until thread 0 goes
to sleep, because going to sleep and releasing the mutex is atomic.

DS

Jeremy

unread,

Aug 2, 2009, 2:55:29 AM8/2/09

to

since we have time stamps of each sys calls, including when thread 0
woke up and when something_happened (which is really a poll call), we
deduced using thread 0 timeout value that it went to sleep earlier
than when poll was initiated.

furthermore, if something_happened earlier, thread 0 shall never go to
sleep.

to be more clear, I rectified the pseudo code as below:

thread 0----------------

loop
process_sockets_with_events();
mutex.lock();

if ( !signaled) {
sleeping = true;
while (sleeping)
condition.timedwait(1);
} else {
sleeping = false;

signaled = false;
}
mutex.unlock

thread 1 -----------------------

loop
waitfor_events_poll();

if (event_happened) {
mark_sockets_with_events();
mutex.lock();
if (sleeping) {
sleeping = false;

Jeremy

unread,

Aug 2, 2009, 2:55:44 AM8/2/09

to

On Aug 1, 4:32 am, David Schwartz <dav...@webmaster.com> wrote:

since we have time stamps of each sys calls, including when thread 0

David Schwartz

unread,

Aug 3, 2009, 2:30:54 AM8/3/09

to

On Aug 1, 11:55 pm, Jeremy <fc2...@gmail.com> wrote:

> since we have time stamps of each sys calls, including when thread 0
> woke up and when something_happened (which is really a poll call), we
> deduced using thread 0 timeout value that it went to sleep earlier
> than when poll was initiated.

This is an erroneous deduction. The time stamp must either be the time
stamp of when the system call was made or when the system call
completed. You can tell which by making a system call that takes
several seconds (such as nanosleep or select).

I'm not specifically familiar with tusc, so it could be either way.
Most such tools don't log the system call until it returns (so they
can include the return value) but the timestamp is the time the system
call was made (because they call some 'get time' function before they
make the system call). So when you see the call that put the thread to
sleep, what you are actually (most likely) seeing is when that the
thread woke up.

For example, if you see:
1) Send a signal to wake a thread
2) Thread goes to sleep

What actually happened was:

1) Thread called a sleep function.
2) Another thread called the wake function, returned, and logged it.
3) The first thread returns from the sleep function and logs the
return value.

So it appears that no thread was blocked when the signal was sent. But
actually, the signal woke the thread whose "go to sleep" function
appears *after*.

Similar misreadings are possible if 'tusc' logs in other ways. It is
very dangerous to read the output of such tools without being *very*
careful.

> thread 0----------------
>
> loop
> process_sockets_with_events();
> mutex.lock();
> if ( !signaled) {
> sleeping = true;
> while (sleeping)
> condition.timedwait(1);
> } else {
> sleeping = false;
> signaled = false;
> }
> mutex.unlock
>
> thread 1 -----------------------
>
> loop
> waitfor_events_poll();
>
> if (event_happened) {
> mark_sockets_with_events();
> mutex.lock();
> if (sleeping) {
> sleeping = false;
> condition.signal();
> }
> signaled = true;
> mutex.unlock();
> }

This is overly complicated. You want to wait until signaled, the
sleeping variable serves no purpose but to make the code more
complicated.

There is also something odd about your usage. It seems like you have
associated a condition variable with a mutex that serves no purpose.
While this is legal and sometimes reasonable, this doesn't seem like a
case where it is sensible. These threads communicate some information
about socket events. How is that information protected? If by a mutex,
why isn't that mutex associated with the condition variable instead?

You should just be able to do:

1) Wait for events.
2) Acquire mutex.
3) Mark events.
4) Release mutex.
5) Signal condition variable.
6) Go to step 1

and

1) Acquire mutex
2) If there is an event to do, release the mutex, do the event, go to
step 1.
3) Block on the condition variable, releasing the mutex.
4) Go to step 1.

This eliminates one needless mutex and two needless state variables.

Also, it's hard to comment without seeing your full design, but why do
you bother telling another thread to do something when you have a
perfectly useful thread that has just figured out what needs to be
done? Why not just do it?

A common anti-pattern is:

1) See what needs to be done.
2) Find another thread to do it.
3) Go to step 1.

A better one is:

1) See what needs to be done.
2) Do it.
3) Go to step 1.
(And let other threads do the same thing.)

Not only does the second pattern not require a dispatch on every event
discovery, but it allows all your threads to be following the same
pattern rather than having two different types of threads.

(Though this comment may be completely inapplicable to your larger
design. It's hard to be sure.)

DS

Loïc Domaigné

unread,

Aug 3, 2009, 8:12:01 AM8/3/09

to

Hello Jeremy,

Without knowing the particular semantic you try to achieve, it is
difficult to help you specifically. For instance, if the event happens
while thread0 executes the statement
sleeping=false;
signaled=false;
this lead to thread0 to execute the loop once more without waiting on
the condvar (since signaled=true) and at the next iteration to
eventually sleep.

Can you guarantee with tusc that the thread didn't looped twice before
waiting on the condvar?

Further ideas:
- what about posting the tusc output?
- what about raising the subject to a HP specific forum, like
http://forums.itrc.hp.com/ ?
- if it's a OS problem, what about writing a stress test?

HTH,
Loïc.

--
My blog: http://www.domaigne.com/blog

"The trouble with programmers is that you can never tell what a
programmer is doing until it’s too late." -- Seymour Cray.