e100 oops on resume

Stefan Seyfried

unread,

Jan 24, 2006, 6:01:15 PM1/24/06

to Linux Kernel Mailing List, net...@vger.kernel.org

Hi,
since 2.6.16rc1-git3, e100 dies on resume (regardless if from disk, ram
or
runtime powermanagement). Unfortunately i only have a bad photo of
the oops right now, it is available from
https://bugzilla.novell.com/attachment.cgi?id=64761&action=view
I have reproduced this on a second e100 machine and can get a serial
console log from this machine tomorrow if needed.
It did resume fine with 2.6.15-git12
--
Stefan Seyfried \ "I didn't want to write for pay. I
QA / R&D Team Mobile Devices \ wanted to be paid for what I write.
"
SUSE LINUX Products GmbH, Nürnberg \ -- Leonard Co
hen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mattia Dongili

unread,

Jan 24, 2006, 6:22:05 PM1/24/06

to Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

On Tue, Jan 24, 2006 at 11:59:19PM +0100, Stefan Seyfried wrote:
> Hi,
> since 2.6.16rc1-git3, e100 dies on resume (regardless if from disk, ram or
> runtime powermanagement). Unfortunately i only have a bad photo of
> the oops right now, it is available from
> https://bugzilla.novell.com/attachment.cgi?id=64761&action=view
> I have reproduced this on a second e100 machine and can get a serial
> console log from this machine tomorrow if needed.
> It did resume fine with 2.6.15-git12

I experienced the same today, I was planning to get a photo tomorrow :)
I'm running 2.6.16-rc1-mm2 and the last working kernel was 2.6.15-mm4
(didn't try .16-rc1-mm1 being scared of the reiserfs breakage).

--
mattia
:wq!

Olaf Kirch

unread,

Jan 25, 2006, 4:03:39 AM1/25/06

to Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

On Wed, Jan 25, 2006 at 12:21:42AM +0100, Mattia Dongili wrote:
> I experienced the same today, I was planning to get a photo tomorrow :)
> I'm running 2.6.16-rc1-mm2 and the last working kernel was 2.6.15-mm4
> (didn't try .16-rc1-mm1 being scared of the reiserfs breakage).

I think that's because the latest driver version wants to wait for
the ucode download, and e100_exec_cb_wait before allocating any
control blocks.

static inline int e100_exec_cb_wait(struct nic *nic, struct sk_buff *skb,
void (*cb_prepare)(struct nic *, struct cb *, struct sk_buff *))
{
int err = 0, counter = 50;
struct cb *cb = nic->cb_to_clean;

if ((err = e100_exec_cb(nic, NULL, e100_setup_ucode)))
DPRINTK(PROBE,ERR, "ucode cmd failed with error %d\n", err);
/* NOTE: the oops shows that e100_exec_cb fails with ENOMEM,
* which also means there are no cbs */

/* ... other stuff...
* and then we die here because cb is NULL: */
while (!(cb->status & cpu_to_le16(cb_complete))) {
msleep(10);
if (!--counter) break;
}

I'm not sure what the right fix would be. e100_resume would probably
have to call e100_alloc_cbs early on, while e100_up should avoid
calling it a second time if nic->cbs_avail != 0. A tentative patch
for testing is attached.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
ok...@suse.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

e100-resume-fix

Olaf Kirch

unread,

Jan 25, 2006, 7:11:52 AM1/25/06

to Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

On Wed, Jan 25, 2006 at 10:02:40AM +0100, Olaf Kirch wrote:
> I'm not sure what the right fix would be. e100_resume would probably
> have to call e100_alloc_cbs early on, while e100_up should avoid
> calling it a second time if nic->cbs_avail != 0. A tentative patch
> for testing is attached.

Reportedly, the patch fixes the crash on resume.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
ok...@suse.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

Howard Chu

unread,

Jan 25, 2006, 8:52:44 AM1/25/06

to Linux Kernel Mailing List, hanc...@shaw.ca

Robert Hancock wrote:
> Howard Chu wrote:
> > POSIX requires a reschedule to occur, as noted here:
> > http://blog.firetree.net/2005/06/22/thread-yield-after-mutex-unlock/
>
> No, it doesn't:
>
> >
> > The relevant SUSv3 text is here
> >
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_mutex_unlock.html

>
> "If there are threads blocked on the mutex object referenced by mutex
> when pthread_mutex_unlock() is called, resulting in the mutex becoming
> available, the scheduling policy shall determine which thread shall
> acquire the mutex."
>
> This says nothing about requiring a reschedule. The "scheduling policy"
> can well decide that the thread which just released the mutex can
> re-acquire it.

No, because the thread that just released the mutex is obviously not one
of the threads blocked on the mutex. When a mutex is unlocked, one of
the *waiting* threads at the time of the unlock must acquire it, and the
scheduling policy can determine that. But the thread the released the
mutex is not one of the waiting threads, and is not eligible for
consideration.

> > I suppose if pthread_mutex_unlock() actually behaved correctly we
could
> > remove the other sched_yield() hacks that didn't belong there in the
> > first place and go on our merry way.
>
> Generally, needing to implement hacks like this is a sign that there are
> problems with the synchronization design of the code (like a mutex which
> has excessive contention). Programs should not rely on the scheduling
> behavior of the kernel for proper operation when that behavior is not
> defined.
>
> --
> Robert Hancock Saskatoon, SK, Canada
> To email, remove "nospam" from hanc...@nospamshaw.ca
> Home Page: http://www.roberthancock.com/

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

Robert Hancock

unread,

Jan 25, 2006, 9:40:59 AM1/25/06

to Howard Chu, Linux Kernel Mailing List

Howard Chu wrote:
> No, because the thread that just released the mutex is obviously not one
> of the threads blocked on the mutex. When a mutex is unlocked, one of
> the *waiting* threads at the time of the unlock must acquire it, and the
> scheduling policy can determine that. But the thread the released the
> mutex is not one of the waiting threads, and is not eligible for
> consideration.

That statement does not imply that any reschedule needs to happen at the
time of the mutex unlock at all, only that the other threads waiting on
the mutex can attempt to reacquire it when the scheduler allows them to.
In all likelihood, what tends to happen is that either the thread that
had the mutex previously still has time left in its timeslice and is
allowed to keep running and reacquire the mutex, or another thread is
woken up (perhaps on another CPU) but doesn't reacquire the mutex before
the original thread carries on and acquires it, and therefore goes back
to sleep.

Forcing the mutex to ping-pong between different threads would be quite
inefficient (especially on SMP machines), and is not something that
POSIX requires.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hanc...@nospamshaw.ca
Home Page: http://www.roberthancock.com/

Christopher Friesen

unread,

Jan 25, 2006, 12:51:01 PM1/25/06

to Howard Chu, Linux Kernel Mailing List, hanc...@shaw.ca

Howard Chu wrote:
>
> Robert Hancock wrote:

> > This says nothing about requiring a reschedule. The "scheduling policy"
> > can well decide that the thread which just released the mutex can
> > re-acquire it.
>
> No, because the thread that just released the mutex is obviously not one
> of the threads blocked on the mutex. When a mutex is unlocked, one of
> the *waiting* threads at the time of the unlock must acquire it, and the
> scheduling policy can determine that. But the thread the released the
> mutex is not one of the waiting threads, and is not eligible for
> consideration.

Is it *required* that the new owner of the mutex is determined at the
time of mutex release?

If the kernel doesn't actually determine the new owner of the mutex
until the currently running thread swaps out, it would be possible for
the currently running thread to re-aquire the mutex.

Chris

Howard Chu

unread,

Jan 25, 2006, 1:38:12 PM1/25/06

to Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Christopher Friesen wrote:
> Howard Chu wrote:
>>
>> Robert Hancock wrote:
>
>> > This says nothing about requiring a reschedule. The "scheduling
>> policy"
>> > can well decide that the thread which just released the mutex can
>> > re-acquire it.
>>
>> No, because the thread that just released the mutex is obviously not
>> one of the threads blocked on the mutex. When a mutex is unlocked,
>> one of the *waiting* threads at the time of the unlock must acquire
>> it, and the scheduling policy can determine that. But the thread the
>> released the mutex is not one of the waiting threads, and is not
>> eligible for consideration.
>
> Is it *required* that the new owner of the mutex is determined at the
> time of mutex release?
>
> If the kernel doesn't actually determine the new owner of the mutex
> until the currently running thread swaps out, it would be possible for
> the currently running thread to re-aquire the mutex.

The SUSv3 text seems pretty clear. It says "WHEN pthread_mutex_unlock()
is called, ... the scheduling policy SHALL decide ..." It doesn't say
MAY, and it doesn't say "some undefined time after the call." There is
nothing optional or implementation-defined here. The only thing that is
not explicitly stated is what happens when there are no waiting threads;
in that case obviously the running thread can continue running.

re: forcing the mutex to ping-pong between different threads - if that
is inefficient, then the thread scheduler needs to be tuned differently.
Threads and thread context switches are supposed to be cheap, otherwise
you might as well just program with fork() instead. (And of course, back
when Unix was first developed, *processes* were lightweight, compared to
other extant OSs.)

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Nick Piggin

unread,

Jan 25, 2006, 2:00:45 PM1/25/06

to Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

But it doesn't say the unlocking thread must yield to the new mutex
owner, only that the scheduling policy shall determine the which
thread aquires the lock.

It doesn't say that decision must be made immediately, either (eg.
it could be made as a by product of which contender is chosen to run
next).

I think the intention of the wording is that for deterministic policies,
it is clear that the waiting threads are actually worken and reevaluated
for scheduling. In the case of SCHED_OTHER, it means basically nothing,
considering the scheduling policy is arbitrary.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

Howard Chu

unread,

Jan 25, 2006, 2:36:39 PM1/25/06

to Nick Piggin, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Nick Piggin wrote:

> Howard Chu wrote:
>> The SUSv3 text seems pretty clear. It says "WHEN
>> pthread_mutex_unlock() is called, ... the scheduling policy SHALL
>> decide ..." It doesn't say MAY, and it doesn't say "some undefined
>> time after the call." There is nothing optional or
>> implementation-defined here. The only thing that is not explicitly
>> stated is what happens when there are no waiting threads; in that
>> case obviously the running thread can continue running.
>>
>
> But it doesn't say the unlocking thread must yield to the new mutex
> owner, only that the scheduling policy shall determine the which
> thread aquires the lock.

True, the unlocking thread doesn't have to yield to the new mutex owner
as a direct consequence of the unlock. But logically, if the unlocking
thread subsequently calls mutex_lock, it must block, because some other
thread has already been assigned ownership of the mutex.

> It doesn't say that decision must be made immediately, either (eg.
> it could be made as a by product of which contender is chosen to run
> next).

A straightforward reading of the language here says the decision happens
"when pthread_mutex_unlock() is called" and not at any later time. There
is nothing here to support your interpretation.

>
> I think the intention of the wording is that for deterministic policies,
> it is clear that the waiting threads are actually worken and reevaluated
> for scheduling. In the case of SCHED_OTHER, it means basically nothing,
> considering the scheduling policy is arbitrary.
>

Clearly the point is that one of the waiting threads is waken and gets
the mutex, and it doesn't matter which thread is chosen. I.e., whatever
thread the scheduling policy chooses. The fact that SCHED_OTHER can
choose arbitrarily is immaterial, it still can only choose one of the
waiting threads.

The fact that SCHED_OTHER's scheduling behavior is undefined is not free
license to implement whatever you want. Scheduling policies are an
optional feature; the basic thread behavior must still be consistent
even on systems that don't implement scheduling policies.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Jesse Brandeburg

unread,

Jan 25, 2006, 2:38:13 PM1/25/06

to Olaf Kirch, Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

On 1/25/06, Olaf Kirch <ok...@suse.de> wrote:
> On Wed, Jan 25, 2006 at 10:02:40AM +0100, Olaf Kirch wrote:
> > I'm not sure what the right fix would be. e100_resume would probably
> > have to call e100_alloc_cbs early on, while e100_up should avoid
> > calling it a second time if nic->cbs_avail != 0. A tentative patch
> > for testing is attached.
>
> Reportedly, the patch fixes the crash on resume.

Cool, thanks for the research, I have a concern about this however.

its an interesting patch, but it raises the question why does
e100_init_hw need to be called at all in resume? I looked back
through our history and that init_hw call has always been there. I
think its incorrect, but its taking me a while to set up a system with
the ability to resume.

everywhere else in the driver alloc_cbs is called before init_hw so it
just seems like a long standing bug.

comments? anyone want to test? i compile tested this, but it is untested.

e100_resume_no_init.diff

Olaf Kirch

unread,

Jan 25, 2006, 3:15:19 PM1/25/06

to Jesse Brandeburg, Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

On Wed, Jan 25, 2006 at 11:37:40AM -0800, Jesse Brandeburg wrote:
> its an interesting patch, but it raises the question why does
> e100_init_hw need to be called at all in resume? I looked back
> through our history and that init_hw call has always been there. I
> think its incorrect, but its taking me a while to set up a system with
> the ability to resume.

I'll ask the folks here to give it a try tomorrow. But I suspect at
least some of it will be needed. For instance I assume you'll
have to reload to ucode when bringing the NIC back from sleep.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
ok...@suse.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

Lee Revell

unread,

Jan 25, 2006, 4:07:27 PM1/25/06

to Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On Wed, 2006-01-25 at 10:26 -0800, Howard Chu wrote:
> The SUSv3 text seems pretty clear. It says "WHEN
> pthread_mutex_unlock()
> is called, ... the scheduling policy SHALL decide ..." It doesn't say
> MAY, and it doesn't say "some undefined time after the call."

This does NOT require pthread_mutex_unlock() to cause the scheduler to
immediately pick a new runnable process. It only says it's up the the
scheduling POLICY what to do. The policy could be "let the unlocking
thread finish its timeslice then reschedule".

Lee

Howard Chu

unread,

Jan 25, 2006, 5:16:08 PM1/25/06

to Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Lee Revell wrote:
> On Wed, 2006-01-25 at 10:26 -0800, Howard Chu wrote:
>
>> The SUSv3 text seems pretty clear. It says "WHEN
>> pthread_mutex_unlock()
>> is called, ... the scheduling policy SHALL decide ..." It doesn't say
>> MAY, and it doesn't say "some undefined time after the call."
>>
>
> This does NOT require pthread_mutex_unlock() to cause the scheduler to
> immediately pick a new runnable process. It only says it's up the the
> scheduling POLICY what to do. The policy could be "let the unlocking
> thread finish its timeslice then reschedule".
>

This is obviously some very old ground.

http://groups.google.com/groups?threadm=etai7.108188%24B37.2381726%40news1.rdc1.bc.home.com

Kaz's post clearly interprets the POSIX spec differently from you. The
policy can decide *which of the waiting threads* gets the mutex, but the
releasing thread is totally out of the picture. For good or bad, the
current pthread_mutex_unlock() is not POSIX-compliant. Now then, if
we're forced to live with that, for efficiency's sake, that's OK,
assuming that valid workarounds exist, such as inserting a sched_yield()
after the unlock.

http://groups.google.com/group/comp.programming.threads/msg/16c01eac398a1139?hl=en&

But then we have to deal with you folks' bizarre notion that
sched_yield() can legitimately be a no-op, which also defies the POSIX
spec. Again, in SUSv3 "The /sched_yield/() function shall force the
running thread to relinquish the processor until it again becomes the
head of its thread list. It takes no arguments." There is no language
here saying "sched_yield *may* do nothing at all." There are of course
cases where it will have no effect, such as when called in a
single-threaded program, but those are the exceptions that define the
rule. Otherwise, the expectation is that some other runnable thread will
acquire the CPU. Again, note that sched_yield() is a core function of
the Threads specification, while scheduling policies are an optional
feature. The function's core behavior (give up the CPU and make some
other runnable thread run) is invariant; the current thread gives up the
CPU regardless of which scheduling policy is in effect or even if
scheduling policies are implemented at all. The only behavior that's
open to implementors is which *of the other runnable threads* is chosen
to take the place of the current thread.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Jesse Brandeburg

unread,

Jan 25, 2006, 5:29:09 PM1/25/06

to Olaf Kirch, Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

On 1/25/06, Olaf Kirch <ok...@suse.de> wrote:

> On Wed, Jan 25, 2006 at 11:37:40AM -0800, Jesse Brandeburg wrote:
> > its an interesting patch, but it raises the question why does
> > e100_init_hw need to be called at all in resume? I looked back
> > through our history and that init_hw call has always been there. I
> > think its incorrect, but its taking me a while to set up a system with
> > the ability to resume.
>
> I'll ask the folks here to give it a try tomorrow. But I suspect at
> least some of it will be needed. For instance I assume you'll
> have to reload to ucode when bringing the NIC back from sleep.

I totally agree thats what it looks like, but unless I'm missing
something e100_up will take care of everything, and if the interface
is not up, e100_open->e100_up afterward will take care of it.

we have to be really careful about what might happen when resuming on
a system with a SMBUS link to a BMC, as there are some tricky
transitions in the hardware that can be easily violated.

Jesse

Robert Hancock

unread,

Jan 25, 2006, 7:12:01 PM1/25/06

to Howard Chu, Christopher Friesen, Linux Kernel Mailing List

Howard Chu wrote:
> The SUSv3 text seems pretty clear. It says "WHEN pthread_mutex_unlock()
> is called, ... the scheduling policy SHALL decide ..." It doesn't say
> MAY, and it doesn't say "some undefined time after the call." There is
> nothing optional or implementation-defined here. The only thing that is
> not explicitly stated is what happens when there are no waiting threads;
> in that case obviously the running thread can continue running.

It says the scheduling policy will decide who gets the mutex. It does
not say that such a decision must be made immediately. That seems rather
implementation defined to me.

>
> re: forcing the mutex to ping-pong between different threads - if that
> is inefficient, then the thread scheduler needs to be tuned differently.
> Threads and thread context switches are supposed to be cheap, otherwise
> you might as well just program with fork() instead. (And of course, back
> when Unix was first developed, *processes* were lightweight, compared to
> other extant OSs.)

This is nothing to do with the thread scheduler being inefficient. It is
inherently inefficient to context-switch repeatedly no matter how good
the kernel is. It trashes the CPU pipeline, at the very least, can cause
thrashing of the CPU caches, and can cause cache lines to be pushed back
and forth across the bus on SMP machines which really kills performance.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hanc...@nospamshaw.ca
Home Page: http://www.roberthancock.com/

Robert Hancock

unread,

Jan 25, 2006, 7:19:56 PM1/25/06

to Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List

Howard Chu wrote:
> Kaz's post clearly interprets the POSIX spec differently from you. The
> policy can decide *which of the waiting threads* gets the mutex, but the
> releasing thread is totally out of the picture. For good or bad, the
> current pthread_mutex_unlock() is not POSIX-compliant. Now then, if
> we're forced to live with that, for efficiency's sake, that's OK,
> assuming that valid workarounds exist, such as inserting a sched_yield()
> after the unlock.
>
> http://groups.google.com/group/comp.programming.threads/msg/16c01eac398a1139?hl=en&

Did you read the rest of this post?

"In any event, all the mutex fairness in the world won't solve the
problem. Consider if this lock/unlock cycle is inside a larger
lock/unlock cycle. Yielding at the unlock or blocking at the lock will
increase the dreadlock over the larger mutex.

The fact is, the threads library can't read the programmer's mind. So
it shouldn't try to, especially if that makes the common cases much
worse for the benefit of excruciatingly rare cases."

And earlier in that thread ("old behavior" referring to an old
LinuxThreads version which allowed "unfair" locking):

"Notice however that even the old "unfair" behavior is perfectly
acceptable with respect to the POSIX standard: for the default
scheduling policy, POSIX makes no guarantees of fairness, such as "the
thread waiting for the mutex for the longest time always acquires it
first". Properly written multithreaded code avoids that kind of heavy
contention on mutexes, and does not run into fairness problems. If you
need scheduling guarantees, you should consider using the real-time
scheduling policies SCHED_RR and SCHED_FIFO, which have precisely
defined scheduling behaviors. "

If you indeed have some thread which is trying to do an essentially
infinite amount of work, you really should not have that thread locking
a mutex, which other threads need to acquire, for a large part of each
cycle. Correctness aside, this is simply not efficient.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hanc...@nospamshaw.ca
Home Page: http://www.roberthancock.com/

Jesse Brandeburg

unread,

Jan 25, 2006, 7:29:21 PM1/25/06

to Olaf Kirch, Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
to show that my patch that just removes e100_init_hw works okay for
me. Let me know how it goes for you, I think this is a good fix.

Jesse

Lee Revell

unread,

Jan 25, 2006, 8:04:51 PM1/25/06

to Howard Chu, Robert Hancock, Christopher Friesen, Linux Kernel Mailing List

On Wed, 2006-01-25 at 16:49 -0800, Howard Chu wrote:
> Basic "fairness" isn't the issue. Fairness is concerned with which of
> *multiple waiting threads* gets the mutex, and that is certainly
> irrelevant here. The issue is that the releasing thread should not be
> a candidate.
>

You seem to be making 2 controversial assertions:

1. pthread_mutex_unlock must cause an immediate reschedule if other
threads are blocked on the mutex, and
2. if the unlocking thread immediately tries to relock the mutex,
another thread must get it first

I disagree with #1, which makes #2 irrelevant. It would lead to
obviously incorrect behavior, pthread_mutex_unlock would no longer be an
RT safe operation for example.

Also consider a SCHED_FIFO policy - static priorities and the scheduler
always runs the highest priority runnable thread - under your
interpretation of POSIX a high priority thread unlocking a mutex would
require the scheduler to run a lower priority thread which violates
SCHED_FIFO semantics.

Lee

Howard Chu

unread,

Jan 25, 2006, 8:07:20 PM1/25/06

to Robert Hancock, Lee Revell, Christopher Friesen, Linux Kernel Mailing List

Robert Hancock wrote:
> Howard Chu wrote:
>> Kaz's post clearly interprets the POSIX spec differently from you.
>> The policy can decide *which of the waiting threads* gets the mutex,
>> but the releasing thread is totally out of the picture. For good or
>> bad, the current pthread_mutex_unlock() is not POSIX-compliant. Now
>> then, if we're forced to live with that, for efficiency's sake,
>> that's OK, assuming that valid workarounds exist, such as inserting a
>> sched_yield() after the unlock.
>>
>> http://groups.google.com/group/comp.programming.threads/msg/16c01eac398a1139?hl=en&
>
>
> Did you read the rest of this post?
>
> "In any event, all the mutex fairness in the world won't solve the
> problem. Consider if this lock/unlock cycle is inside a larger
> lock/unlock cycle. Yielding at the unlock or blocking at the lock will
> increase the dreadlock over the larger mutex.

Basic "fairness" isn't the issue. Fairness is concerned with which of

*multiple waiting threads* gets the mutex, and that is certainly
irrelevant here. The issue is that the releasing thread should not be a
candidate.

The mutex functions are a core part of the thread specification; they
have a fundamental behavior, and the definition says if there are
blocked threads waiting on a mutex when it gets unlocked, one of the
waiting threads gets the mutex. Which of the waiting threads gets it is
unspecified in the core spec. On a system that implements the scheduling
option, the scheduling policy specifies which thread. The scheduling
policy is an optional feature, it serves only to refine the core
functionality. A program written to the basic core specification should
not break when run in an environment that implements optional features.

The spec may be mandating a non-optimal behavior, but that's a
side-issue - someone should file an objection with the Open Group to get
it redefined if it's such a bad idea. But for now, the NPTL
implementation is non-conformant.

Standards aren't just academic exercises. They're meant to be useful. If
the standard is too thinly specified, is ambiguous, or allows
nonsensical behavior, it's not useful and should be fixed at the source,
not just ignored and papered over in implementations.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

David Schwartz

unread,

Jan 25, 2006, 8:09:35 PM1/25/06

to Linux Kernel Mailing List, hanc...@shaw.ca

> Robert Hancock wrote:

> > "If there are threads blocked on the mutex object referenced by mutex
> > when pthread_mutex_unlock() is called, resulting in the mutex becoming
> > available, the scheduling policy shall determine which thread shall
> > acquire the mutex."
> >
> > This says nothing about requiring a reschedule. The "scheduling policy"
> > can well decide that the thread which just released the mutex can
> > re-acquire it.

> No, because the thread that just released the mutex is obviously not one
> of the threads blocked on the mutex.

So what?

> When a mutex is unlocked, one of
> the *waiting* threads at the time of the unlock must acquire it, and the
> scheduling policy can determine that.

This is false and is nowhere found in the standard.

> But the thread the released the
> mutex is not one of the waiting threads, and is not eligible for
> consideration.

Where are you getting this from? Nothing requires the scheduler to schedule
any threads when the mutex is released.

All that must happen is that the mutex must be unlocked. The scheduler is
permitted to allow any thread it wants to run at that point, or no thread.
Nothing says the thread that released the mutex can't continue running and
nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
before any other thread gets around to getting it.

In general, it is very bad karma for the scheduler to stop a thread before
its timeslice is up if it doesn't have to. Consider one CPU and two threads,
each needing to do 100 quick lock/unlock cycles. Why force 200 context
switches?

DS

Howard Chu

unread,

Jan 25, 2006, 8:38:08 PM1/25/06

to Lee Revell, Robert Hancock, Christopher Friesen, Linux Kernel Mailing List

Lee Revell wrote:
> On Wed, 2006-01-25 at 16:49 -0800, Howard Chu wrote:
>
>> Basic "fairness" isn't the issue. Fairness is concerned with which of
>> *multiple waiting threads* gets the mutex, and that is certainly
>> irrelevant here. The issue is that the releasing thread should not be
>> a candidate.
>>
>>
>
> You seem to be making 2 controversial assertions:
>
> 1. pthread_mutex_unlock must cause an immediate reschedule if other
> threads are blocked on the mutex, and
> 2. if the unlocking thread immediately tries to relock the mutex,
> another thread must get it first
>
> I disagree with #1, which makes #2 irrelevant. It would lead to
> obviously incorrect behavior, pthread_mutex_unlock would no longer be an
> RT safe operation for example.
>

Actually no, I see that #1 is unnecessary, and already acknowledged as such
http://groups.google.com/group/fa.linux.kernel/msg/89da66017d53d496

But #2 still holds.

> Also consider a SCHED_FIFO policy - static priorities and the scheduler
> always runs the highest priority runnable thread - under your
> interpretation of POSIX a high priority thread unlocking a mutex would
> require the scheduler to run a lower priority thread which violates
> SCHED_FIFO semantics

See the Mutex Initialization Scheduling Attributes section which
specifically addresses priority inversion:
http://www.opengroup.org/onlinepubs/000095399/xrat/xsh_chap02.html#tag_03_02_09

If point #2 were not true, then there would be no need to bother with
any of that. Instead that text ends with "it is important that
IEEE Std 1003.1-2001 provide these interfaces for those cases in which
it is necessary."

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

David Schwartz

unread,

Jan 25, 2006, 9:06:57 PM1/25/06

to Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

> Kaz's post clearly interprets the POSIX spec differently from you. The
> policy can decide *which of the waiting threads* gets the mutex, but the
> releasing thread is totally out of the picture. For good or bad, the
> current pthread_mutex_unlock() is not POSIX-compliant. Now then, if
> we're forced to live with that, for efficiency's sake, that's OK,
> assuming that valid workarounds exist, such as inserting a sched_yield()
> after the unlock.

My thanks to David Hopwood for providing me with the definitive refutation
of this position. The response is that the as-if rules allows the
implementation to violate the specification internally provided no compliant
application could tell the difference.

When you call 'pthread_mutex_lock', there is no guarantee regarding how
long it will or might take until you are actually waiting for the mutex. So
no conforming application can ever tell whether or not it is waiting for the
mutex or about to wait for the mutex.

So you cannot write an application that can tell the difference.

His exact quote is, "It could have been the case that the other threads ran
more slowly, so that they didn't reach the point of blocking on the mutex
before the pthread_mutex_unlock()."

You can find it on comp.programming.threads if you like.

DS

Mark Lord

unread,

Jan 25, 2006, 9:49:22 PM1/25/06

to dav...@webmaster.com, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

David Schwartz wrote:
>> Kaz's post clearly interprets the POSIX spec differently from you. The
>> policy can decide *which of the waiting threads* gets the mutex, but the
>> releasing thread is totally out of the picture. For good or bad, the
>> current pthread_mutex_unlock() is not POSIX-compliant. Now then, if
>> we're forced to live with that, for efficiency's sake, that's OK,
>> assuming that valid workarounds exist, such as inserting a sched_yield()
>> after the unlock.
>
> My thanks to David Hopwood for providing me with the definitive refutation
> of this position. The response is that the as-if rules allows the
> implementation to violate the specification internally provided no compliant
> application could tell the difference.
>
> When you call 'pthread_mutex_lock', there is no guarantee regarding how
> long it will or might take until you are actually waiting for the mutex. So
> no conforming application can ever tell whether or not it is waiting for the
> mutex or about to wait for the mutex.
>
> So you cannot write an application that can tell the difference.

Not true. The code for the relinquishing thread could indeed tell the difference.

-ml

David Schwartz

unread,

Jan 25, 2006, 10:31:30 PM1/25/06

to lk...@rtr.ca, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

> > So you cannot write an application that can tell the difference.

> Not true. The code for the relinquishing thread could indeed
> tell the difference.
>
> -ml

It can tell the difference between the other thread getting the mutex first
and it getting the mutex first. But it cannot tell the difference between an
implementation that puts random sleeps before calls to 'pthread_mutex_lock'
and an implementation that has the allegedly non-compliant behavior. That
makes the behavior compliant under the 'as-if' rule.

If you don't believe me, try to write a program that prints 'non-compliant'
on a system that has the alleged non-compliance but is guaranteed not to do
so on any compliant system. It cannot be done.

In order to claim the alleged compliance, you would have to know that a
thread waiting for a mutex did not get it. But there is no possible way you
can know that another thread is waiting for the mutex (as opposed to being
about to wait for it). So you can never detect the claimed non-compliance,
so it's not non-compliance.

This is definitive, really. It 100% refutes the claim.

DS

Samuel Masham

unread,

Jan 25, 2006, 10:49:36 PM1/25/06

to dav...@webmaster.com, lk...@rtr.ca, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On 26/01/06, David Schwartz <dav...@webmaster.com> wrote:
>
> > > So you cannot write an application that can tell the difference.
>
> > Not true. The code for the relinquishing thread could indeed
> > tell the difference.
> >
> > -ml
>
> It can tell the difference between the other thread getting the mutex first
> and it getting the mutex first. But it cannot tell the difference between an
> implementation that puts random sleeps before calls to 'pthread_mutex_lock'
> and an implementation that has the allegedly non-compliant behavior. That
> makes the behavior compliant under the 'as-if' rule.
>
> If you don't believe me, try to write a program that prints 'non-compliant'
> on a system that has the alleged non-compliance but is guaranteed not to do
> so on any compliant system. It cannot be done.

Just putting priority inheritance on then in the running thread check
your priority, if it goes up then the waiting thread in really
waiting.

Then if you can release + get the lock again its non compliant.... no?

ie pthread_mutexattr_setprotocol(pthread_mutexattr_t *attr, int
protocol); with PTHREAD_PRIO_INHERIT

comment:
As a rt person I don't like the idea of scheduler bounce so the way
round seems to be have the mutex lock acquiring work on a FIFO like
basis.

Samuel Masham

unread,

Jan 25, 2006, 11:03:28 PM1/25/06

to dav...@webmaster.com, lk...@rtr.ca, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On 26/01/06, Samuel Masham <samuel...@gmail.com> wrote:
> comment:
> As a rt person I don't like the idea of scheduler bounce so the way
> round seems to be have the mutex lock acquiring work on a FIFO like
> basis.

which is obviously wrong...

Howeve my basic point stands but needs to be clarified a bit:

I think I can print non-compliant if the mutex acquisition doesn't
respect the higher priority of the waiter over the current process
even if the mutex is "available".

OK?

Lee Revell

unread,

Jan 25, 2006, 11:54:50 PM1/25/06

to Samuel Masham, dav...@webmaster.com, lk...@rtr.ca, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On Thu, 2006-01-26 at 13:02 +0900, Samuel Masham wrote:
> On 26/01/06, Samuel Masham <samuel...@gmail.com> wrote:
> > comment:
> > As a rt person I don't like the idea of scheduler bounce so the way
> > round seems to be have the mutex lock acquiring work on a FIFO like
> > basis.
>
> which is obviously wrong...
>
> Howeve my basic point stands but needs to be clarified a bit:
>
> I think I can print non-compliant if the mutex acquisition doesn't
> respect the higher priority of the waiter over the current process
> even if the mutex is "available".
>
> OK?

I don't think using an optional feature (PI) counts...

Lee

Samuel Masham

unread,

Jan 26, 2006, 1:15:30 AM1/26/06

to Lee Revell, dav...@webmaster.com, lk...@rtr.ca, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On 26/01/06, Lee Revell <rlre...@joe-job.com> wrote:
> On Thu, 2006-01-26 at 13:02 +0900, Samuel Masham wrote:
> > On 26/01/06, Samuel Masham <samuel...@gmail.com> wrote:
> > > comment:
> > > As a rt person I don't like the idea of scheduler bounce so the way
> > > round seems to be have the mutex lock acquiring work on a FIFO like
> > > basis.
> >
> > which is obviously wrong...
> >
> > Howeve my basic point stands but needs to be clarified a bit:
> >
> > I think I can print non-compliant if the mutex acquisition doesn't
> > respect the higher priority of the waiter over the current process
> > even if the mutex is "available".
> >
> > OK?
>
> I don't think using an optional feature (PI) counts...
>
> Lee

So when acquiring a mutex with pi enabled must involve scheduler...

.. and you can skip that bit with it disabled as one can argue that
the user can't tell if the time slice hit between the call to acquire
the mutex and the actual mutex wait itself?

sounds a bit of a fudge to me....

I assume that mutexes will must never support a the wchan (proc)
interface or the like?

On the other hand the basic point about high contention around mutexes
and relying on this being a bad idea is fine by me.

Samuel

Helge Hafting

unread,

Jan 26, 2006, 3:25:49 AM1/26/06

to dav...@webmaster.com, Linux Kernel Mailing List, hanc...@shaw.ca

David Schwartz wrote:

>>Robert Hancock wrote:
>>
>>
>>But the thread the released the
>>mutex is not one of the waiting threads, and is not eligible for
>>consideration.
>>
>>
>
> Where are you getting this from? Nothing requires the scheduler to schedule
>any threads when the mutex is released.
>
>

Correct.

> All that must happen is that the mutex must be unlocked. The scheduler is
>permitted to allow any thread it wants to run at that point, or no thread.
>Nothing says the thread that released the mutex can't continue running and
>
>

Correct. The releasing thread may keep running.

>nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
>before any other thread gets around to getting it.
>
>

Wrong.
The spec says that the mutex must be given to a waiter (if any) at the
moment of release. The waiter don't have to be scheduled at that
point, it may keep sleeping with its freshly unlocked mutex. So the
unlocking thread may continue - but if it tries to reaquire the mutex
it will find the mutex taken and go to sleep at that point. Then other
processes will schedule, and at some time the one now owning the mutex
will wake up and do its work.

> In general, it is very bad karma for the scheduler to stop a thread before
>its timeslice is up if it doesn't have to. Consider one CPU and two threads,
>each needing to do 100 quick lock/unlock cycles. Why force 200 context
>switches?
>

Good point, except it is a strange program that do this. Lock the mutex
once,
do 100 operations, then unlock is the better way. :-)

Helge Hafting

Nick Piggin

unread,

Jan 26, 2006, 3:51:39 AM1/26/06

to Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Howard Chu wrote:
> Nick Piggin wrote:
>
>> Howard Chu wrote:
>>
>>> The SUSv3 text seems pretty clear. It says "WHEN
>>> pthread_mutex_unlock() is called, ... the scheduling policy SHALL
>>> decide ..." It doesn't say MAY, and it doesn't say "some undefined
>>> time after the call." There is nothing optional or
>>> implementation-defined here. The only thing that is not explicitly
>>> stated is what happens when there are no waiting threads; in that
>>> case obviously the running thread can continue running.
>>>
>>
>> But it doesn't say the unlocking thread must yield to the new mutex
>> owner, only that the scheduling policy shall determine the which
>> thread aquires the lock.
>
>
> True, the unlocking thread doesn't have to yield to the new mutex owner
> as a direct consequence of the unlock. But logically, if the unlocking
> thread subsequently calls mutex_lock, it must block, because some other
> thread has already been assigned ownership of the mutex.
>
>> It doesn't say that decision must be made immediately, either (eg.
>> it could be made as a by product of which contender is chosen to run
>> next).
>
>
> A straightforward reading of the language here says the decision happens
> "when pthread_mutex_unlock() is called" and not at any later time. There
> is nothing here to support your interpretation.
>

OK, so what happens if my scheduling policy decides _right then_, that
the next _running_ thread that was being blocked on or tries to aquire
the mutex, is the next owner?

This is the logical way for a *scheduling* policy to determine which
thread gets the mutex. I don't know any other way that the scheduling
policy could determine the next thread to get the mutex.

>>
>> I think the intention of the wording is that for deterministic policies,
>> it is clear that the waiting threads are actually worken and reevaluated
>> for scheduling. In the case of SCHED_OTHER, it means basically nothing,
>> considering the scheduling policy is arbitrary.
>>
> Clearly the point is that one of the waiting threads is waken and gets
> the mutex, and it doesn't matter which thread is chosen. I.e., whatever
> thread the scheduling policy chooses. The fact that SCHED_OTHER can
> choose arbitrarily is immaterial, it still can only choose one of the
> waiting threads.
>

I don't know that it exactly says one of the waiting threads must get the
mutex.

> The fact that SCHED_OTHER's scheduling behavior is undefined is not free
> license to implement whatever you want. Scheduling policies are an
> optional feature; the basic thread behavior must still be consistent
> even on systems that don't implement scheduling policies.
>

It just so happens that normal tasks in Linux run in SCHED_OTHER. It
is irrelevant whether it might be an optional feature or not.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

Nick Piggin

unread,

Jan 26, 2006, 3:55:12 AM1/26/06

to Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

How many times have we been over this? What do you think the "head of
its thread list" might mean?

> here saying "sched_yield *may* do nothing at all." There are of course

There is language saying SCHED_OTHER is arbitrary, including how the
thread list is implemented and how a task might become on the head of
it.

They obviously don't need to redefine exactly what sched_yield may do
under each scheduling policy, do they?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

Nick Piggin

unread,

Jan 26, 2006, 4:01:45 AM1/26/06

to Helge Hafting, dav...@webmaster.com, Linux Kernel Mailing List, hanc...@shaw.ca

Helge Hafting wrote:
> David Schwartz wrote:

>> nothing says that it can't call pthread_mutex_lock and re-acquire the
>> mutex
>> before any other thread gets around to getting it.
>>
>>
> Wrong.
> The spec says that the mutex must be given to a waiter (if any) at the
> moment of release.

Repeating myself here...

To me it says that the scheduling policy decides at the moment of release.
What if the scheduling policy decides *right then* to give the mutex to
the next running thread that tries to aquire it?

That would be the logical way for a scheduling policy to decide the next
owner of the mutex.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

Pavel Machek

unread,

Jan 26, 2006, 4:33:11 AM1/26/06

to Jesse Brandeburg, Olaf Kirch, Stefan Seyfried, Linux Kernel Mailing List, net...@vger.kernel.org

On St 25-01-06 16:28:48, Jesse Brandeburg wrote:
> On 1/25/06, Jesse Brandeburg <jesse.br...@gmail.com> wrote:
> > On 1/25/06, Olaf Kirch <ok...@suse.de> wrote:
> > > On Wed, Jan 25, 2006 at 10:02:40AM +0100, Olaf Kirch wrote:
> > > > I'm not sure what the right fix would be. e100_resume would probably
> > > > have to call e100_alloc_cbs early on, while e100_up should avoid
> > > > calling it a second time if nic->cbs_avail != 0. A tentative patch
> > > > for testing is attached.
> > >
> > > Reportedly, the patch fixes the crash on resume.
> >
> > Cool, thanks for the research, I have a concern about this however.
> >
> > its an interesting patch, but it raises the question why does
> > e100_init_hw need to be called at all in resume? I looked back
> > through our history and that init_hw call has always been there. I
> > think its incorrect, but its taking me a while to set up a system with
> > the ability to resume.
> >
> > everywhere else in the driver alloc_cbs is called before init_hw so it
> > just seems like a long standing bug.
> >
> > comments? anyone want to test? i compile tested this, but it is untested.
>
> Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> to show that my patch that just removes e100_init_hw works okay for
> me. Let me know how it goes for you, I think this is a good fix.

S1 preserves hardware state, .suspend/.resume routines can be NULL for
S1. Try with swsusp or S3.
Pavel
--
Thanks, Sharp!

Nikita Danilov

unread,

Jan 26, 2006, 5:39:12 AM1/26/06

to Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Howard Chu writes:

[...]

>
> A straightforward reading of the language here says the decision happens
> "when pthread_mutex_unlock() is called" and not at any later time. There
> is nothing here to support your interpretation.
> >
> > I think the intention of the wording is that for deterministic policies,
> > it is clear that the waiting threads are actually worken and reevaluated
> > for scheduling. In the case of SCHED_OTHER, it means basically nothing,
> > considering the scheduling policy is arbitrary.
> >
> Clearly the point is that one of the waiting threads is waken and gets
> the mutex, and it doesn't matter which thread is chosen. I.e., whatever

Note that this behavior directly leads to "convoy formation": if that
woken thread T0 does not immediately run (e.g., because there are higher
priority threads) but still already owns the mutex, then other running
threads contending for this mutex will block waiting for T0, forming a
convoy.

> thread the scheduling policy chooses. The fact that SCHED_OTHER can
> choose arbitrarily is immaterial, it still can only choose one of the
> waiting threads.

Looks like a good time to submit Defect Report to the Open Group.

>
> The fact that SCHED_OTHER's scheduling behavior is undefined is not free
> license to implement whatever you want. Scheduling policies are an
> optional feature; the basic thread behavior must still be consistent
> even on systems that don't implement scheduling policies.
>
> --
> -- Howard Chu

Nikita.

Nikita Danilov

unread,

Jan 26, 2006, 5:45:07 AM1/26/06

to Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Howard Chu writes:

[...]

>
> But then we have to deal with you folks' bizarre notion that
> sched_yield() can legitimately be a no-op, which also defies the POSIX
> spec. Again, in SUSv3 "The /sched_yield/() function shall force the
> running thread to relinquish the processor until it again becomes the
> head of its thread list. It takes no arguments." There is no language
> here saying "sched_yield *may* do nothing at all." There are of course

As have been pointed to you already, while there is no such language,
the effect may be the same, if --for example-- scheduling policy decides
to put current thread back to "the head of its thread list" immediately
after sched_yield(). Which is a valid behavior for SCHED_OTHER.

Nikita.

Nikita Danilov

unread,

Jan 26, 2006, 5:51:07 AM1/26/06

to Helge Hafting, Linux Kernel Mailing List, hanc...@shaw.ca

Helge Hafting writes:

[...]

>
> >nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
> >before any other thread gets around to getting it.
> >
> >
> Wrong.
> The spec says that the mutex must be given to a waiter (if any) at the
> moment of release. The waiter don't have to be scheduled at that
> point, it may keep sleeping with its freshly unlocked mutex. So the
> unlocking thread may continue - but if it tries to reaquire the mutex
> it will find the mutex taken and go to sleep at that point. Then other

You just described a convoy formation: a phenomenon that all reasonable
mutex implementation try to avoid at all costs. If that's what standard
prescribes---the standard has to be amended.

>
> Helge Hafting

Nikita.

Kyle Moffett

unread,

Jan 26, 2006, 9:15:55 AM1/26/06

to Nick Piggin, Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Haven't you OpenLDAP guys realized that the pthread model you're
actually looking for is this? POSIX mutexes are not designed to
mandate scheduling requirements *precisely* because this achieves
your scheduling goals by explicitly stating what they are.

s: pthread_mutex_lock(&mutex);
s: pthread_cond_wait(&wake_slave, &mutex);

m: [do some work]
m: pthread_cond_signal(&wake_slave);
m: pthread_cond_wait(&wake_master, &mutex);

s: [return from pthread_cond_wait]
s: [do some work]
s: pthread_cond_signal(&wake_master);
s: pthread_cond_wait(&wake_slave, &mutex);

Of course, if that's the model you're looking for, you could always
do this instead:

void master_func() {
while (1) {
[do some work]
slave_func();
}
}

void slave_func() {
[do some work]
}

The semantics are effectively the same.

Cheers,
Kyle Moffett

--
Premature optimization is the root of all evil in programming
-- C.A.R. Hoare

Howard Chu

unread,

Jan 26, 2006, 9:25:17 AM1/26/06

to Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Nick Piggin wrote:
> Howard Chu wrote:

>> But then we have to deal with you folks' bizarre notion that
>> sched_yield() can legitimately be a no-op, which also defies the
>> POSIX spec. Again, in SUSv3 "The /sched_yield/() function shall force
>> the running thread to relinquish the processor until it again becomes
>> the head of its thread list. It takes no arguments." There is no
>> language
>
> How many times have we been over this? What do you think the "head of
> its thread list" might mean?
>
>> here saying "sched_yield *may* do nothing at all." There are of course
>
> There is language saying SCHED_OTHER is arbitrary, including how the
> thread list is implemented and how a task might become on the head of
> it.
>
> They obviously don't need to redefine exactly what sched_yield may do
> under each scheduling policy, do they?
>

As Dave Butenhof says so often, threading is a cooperative programming
model, not a competitive one. The sched_yield function exists for a
specific purpose, to let one thread decide to allow some other thread to
run. No matter what the scheduling policy, or even if there is no
scheduling policy at all, the expectation is that the current thread
will not continue to run unless there are no other runnable threads in
the same process. The other important point here is that the yielding
thread is only cooperating with other threads in its process. The 2.6
kernel behavior effectively causes the entire process to give up its
time slice, since the yielding thread has to wait for other processes in
the system before it can run again. Again, if folks wanted process
scheduling behavior they would have used fork().

By the way, I've already raised an objection with the Open Group asking
for more clarification here.
http://www.opengroup.org/austin/aardvark/latest/xshbug2.txt request
number 120.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Howard Chu

unread,

Jan 26, 2006, 9:45:03 AM1/26/06

to Kyle Moffett, Nick Piggin, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Kyle Moffett wrote:
> Haven't you OpenLDAP guys realized that the pthread model you're
> actually looking for is this? POSIX mutexes are not designed to
> mandate scheduling requirements *precisely* because this achieves your
> scheduling goals by explicitly stating what they are.

This isn't about OpenLDAP. Yes, we had a lot of yield() calls scattered
through the code, leftovers from when we only supported non-preemptive
threading. Those calls have been removed. There are a few remaining,
that are only in code paths for unusual errors, so what they do has no
real performance impact.

The point of this discussion is that the POSIX spec says one thing and
you guys say another; one way or another that should be resolved. The
2.6 kernel behavior is a noticable departure from previous releases. The
2.4/LinuxThreads guys believed their implementation was correct. If you
believe the 2.6 implementation is correct, then you should get the spec
amended or state up front that the "P" in "NPTL" doesn't really mean
anything.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Nick Piggin

unread,

Jan 26, 2006, 9:54:41 AM1/26/06

to Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Howard Chu wrote:
> Nick Piggin wrote:

>> They obviously don't need to redefine exactly what sched_yield may do
>> under each scheduling policy, do they?
>>
> As Dave Butenhof says so often, threading is a cooperative programming
> model, not a competitive one. The sched_yield function exists for a
> specific purpose, to let one thread decide to allow some other thread to
> run. No matter what the scheduling policy, or even if there is no

Yes, and even SCHED_OTHER in Linux attempts to do this as part of
the principle of least surprise. That it doesn't _exactly_ match
what you want it to do just means you need to be using something
else.

> scheduling policy at all, the expectation is that the current thread
> will not continue to run unless there are no other runnable threads in
> the same process. The other important point here is that the yielding
> thread is only cooperating with other threads in its process. The 2.6

No I don't think so. POSIX 1.b where sched_yield is defined are the
realtime extensions, are they not?

sched_yield explicitly makes reference to the realtime priority system
of thread lists does it not? It is pretty clear that it is used for
realtime processes to deterministically give up their timeslices to
others of the same priority level.

Linux's SCHED_OTHER behaviour is arguably the best interpretation,
considering SCHED_OTHER is defined to have a single priority level.

> kernel behavior effectively causes the entire process to give up its
> time slice, since the yielding thread has to wait for other processes in
> the system before it can run again. Again, if folks wanted process

It yields to all other SCHED_OTHER processes (which are all on the
same thread priority list) and not to any other processes of higher
realtime priority.

> scheduling behavior they would have used fork().
>

It so happens that processes and threads use the same scheduling
policy in Linux. Is that forbidden somewhere?

> By the way, I've already raised an objection with the Open Group asking
> for more clarification here.
> http://www.opengroup.org/austin/aardvark/latest/xshbug2.txt request
> number 120.
>

--

Send instant messages to your online friends http://au.messenger.yahoo.com

-

Howard Chu

unread,

Jan 26, 2006, 10:24:44 AM1/26/06

to Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Nick Piggin wrote:
> Howard Chu wrote:

>> scheduling policy at all, the expectation is that the current thread
>> will not continue to run unless there are no other runnable threads
>> in the same process. The other important point here is that the
>> yielding thread is only cooperating with other threads in its
>> process. The 2.6
>
> No I don't think so. POSIX 1.b where sched_yield is defined are the
> realtime extensions, are they not?
>
> sched_yield explicitly makes reference to the realtime priority system
> of thread lists does it not? It is pretty clear that it is used for
> realtime processes to deterministically give up their timeslices to
> others of the same priority level.

The fact that sched_yield came originally from the realtime extensions
is just a historical artifact. There was a pthread_yield() function
specifically for threads and it was merged with sched_yield(). Today
sched_yield() is a core part of the basic Threads specification,
independent of the realtime extensions. The fact that it is defined
solely in the language of the realtime priorities is an obvious flaw in
the spec, since the function itself exists independently of realtime
priorities. The objection I raised with the Open Group specifically
addresses this flaw.

> Linux's SCHED_OTHER behaviour is arguably the best interpretation,
> considering SCHED_OTHER is defined to have a single priority level.

It appears that you just read the spec and blindly followed it without
thinking about what it really said and failed to say. The best
interpretation would come from saying "hey, this spec is only defined
for realtime behavior, WTF is it supposed to do for the default
non-realtime case?" and getting a clear definition in the spec.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Nick Piggin

unread,

Jan 26, 2006, 10:51:46 AM1/26/06

to Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Howard Chu wrote:
> Nick Piggin wrote:
>
>> Howard Chu wrote:
>>
>>> scheduling policy at all, the expectation is that the current thread
>>> will not continue to run unless there are no other runnable threads
>>> in the same process. The other important point here is that the
>>> yielding thread is only cooperating with other threads in its
>>> process. The 2.6
>>
>>
>> No I don't think so. POSIX 1.b where sched_yield is defined are the
>> realtime extensions, are they not?
>>
>> sched_yield explicitly makes reference to the realtime priority system
>> of thread lists does it not? It is pretty clear that it is used for
>> realtime processes to deterministically give up their timeslices to
>> others of the same priority level.
>
>
> The fact that sched_yield came originally from the realtime extensions
> is just a historical artifact. There was a pthread_yield() function
> specifically for threads and it was merged with sched_yield(). Today
> sched_yield() is a core part of the basic Threads specification,
> independent of the realtime extensions. The fact that it is defined
> solely in the language of the realtime priorities is an obvious flaw in
> the spec, since the function itself exists independently of realtime
> priorities. The objection I raised with the Open Group specifically
> addresses this flaw.
>

Either way, it by no means says anything about yielding to other
threads in the process but nobody else. Where did you get that
from?

>> Linux's SCHED_OTHER behaviour is arguably the best interpretation,
>> considering SCHED_OTHER is defined to have a single priority level.
>
>
> It appears that you just read the spec and blindly followed it without
> thinking about what it really said and failed to say. The best

No, a spec is something that is written unambiguously, and generally
the wording leads me to believe they attempted to make it so (it
definitely isn't perfect - your mutex unlock example is one that could
be interpreted either way). If they failed to say something that should
be there then the spec needs to be corrected -- however in this case
I don't think you've shown what's missing.

And actually your reading things into the spec that "they failed to say"
is wrong I believe (in the above sched_yield example).

> interpretation would come from saying "hey, this spec is only defined
> for realtime behavior, WTF is it supposed to do for the default
> non-realtime case?" and getting a clear definition in the spec.
>

However they do not omit to say that. They quite explicitly say that
SCHED_OTHER is considered a single priority class in relation to its
interactions with other realtime classes, and is otherwise free to
be implemented in any way.

I can't see how you still have a problem with that...

--
Send instant messages to your online friends http://au.messenger.yahoo.com

-

Howard Chu

unread,

Jan 26, 2006, 11:45:48 AM1/26/06

to Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Nick Piggin wrote:
> No, a spec is something that is written unambiguously, and generally
> the wording leads me to believe they attempted to make it so (it
> definitely isn't perfect - your mutex unlock example is one that could
> be interpreted either way). If they failed to say something that should
> be there then the spec needs to be corrected -- however in this case
> I don't think you've shown what's missing.

What is missing: sched_yield is a core threads function but it's defined
using language that only has meaning in the presence of an optional
feature (Process Scheduling.) Since the function must exist even in the
absence of these options, the definition must be changed to use language
that has meaning even in the absence of these options.

> And actually your reading things into the spec that "they failed to say"
> is wrong I believe (in the above sched_yield example).
>
>> interpretation would come from saying "hey, this spec is only defined
>> for realtime behavior, WTF is it supposed to do for the default
>> non-realtime case?" and getting a clear definition in the spec.
>
> However they do not omit to say that. They quite explicitly say that
> SCHED_OTHER is considered a single priority class in relation to its
> interactions with other realtime classes, and is otherwise free to
> be implemented in any way.
>
> I can't see how you still have a problem with that...
>

I may be missing the obvious, but I couldn't find this explicit
statement in the SUS docs. Also, it would not address the core
complaint, that sched_yield's definition has no meaning when the Process
Scheduling option doesn't exist.

The current Open Group response to my objection reads:
>>>

Add to APPLICATION USAGE
Since there may not be more than one thread runnable in a process
a call to sched_yield() might not relinquish the processor at all.
In a single threaded application this will always be case.

<<<
The interesting point one can draw from this response is that
sched_yield is only intended to yield to other runnable threads within a
single process. This response is also problematic, because restricting
it to threads within a process makes it useless for Process Scheduling.
E.g., the Process Scheduling language would imply that a single-threaded
app could yield the processor to some other process. As such, I think
this response is also flawed, and the definition still needs more work.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

linux-os (Dick Johnson)

unread,

Jan 26, 2006, 12:35:01 PM1/26/06

to Howard Chu, Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

To fix the current problem, you can substitute usleep(0); It will
give the CPU to somebody if it's computable, then give it back to
you. It seems to work in every case that sched_yield() has
mucked up (perhaps 20 to 30 here).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

Nick Piggin

unread,

Jan 26, 2006, 2:00:42 PM1/26/06

to linux-os (Dick Johnson), Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

linux-os (Dick Johnson) wrote:
>
> To fix the current problem, you can substitute usleep(0); It will
> give the CPU to somebody if it's computable, then give it back to
> you. It seems to work in every case that sched_yield() has
> mucked up (perhaps 20 to 30 here).
>

That sounds like a terrible hack.

What cases has sched_yield mucked up for you, and why do you
think the problem is sched_yield mucking up? Can you solve it
using mutexes?

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

-

Stefan Seyfried

unread,

Jan 26, 2006, 2:04:12 PM1/26/06

to Jesse Brandeburg, Olaf Kirch, Linux Kernel Mailing List, net...@vger.kernel.org

On Wed, Jan 25, 2006 at 04:28:48PM -0800, Jesse Brandeburg wrote:

> Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> to show that my patch that just removes e100_init_hw works okay for
> me. Let me know how it goes for you, I think this is a good fix.

worked for me in the Compaq Armada e500 and reportedly also fixed the
SONY that originally uncovered it.

Will be in the next SUSE betas, so if anything breaks, we'll notice
it.

Thanks.
--
Stefan Seyfried \ "I didn't want to write for pay. I
QA / R&D Team Mobile Devices \ wanted to be paid for what I write.
"
SUSE LINUX Products GmbH, Nürnberg \ -- Leonard Co
hen

Olaf Kirch

unread,

Jan 26, 2006, 2:10:39 PM1/26/06

to Stefan Seyfried, Jesse Brandeburg, Linux Kernel Mailing List, net...@vger.kernel.org

On Thu, Jan 26, 2006 at 08:02:37PM +0100, Stefan Seyfried wrote:
> Will be in the next SUSE betas, so if anything breaks, we'll notice
> it.

I doubt it. As Jesse mentioned, e100_hw_init is called from e100_up,
so the call from e100_resume was really superfluous.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
ok...@suse.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

linux-os (Dick Johnson)

unread,

Jan 26, 2006, 2:15:22 PM1/26/06

to Nick Piggin, Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On Thu, 26 Jan 2006, Nick Piggin wrote:

> linux-os (Dick Johnson) wrote:
>>
>> To fix the current problem, you can substitute usleep(0); It will
>> give the CPU to somebody if it's computable, then give it back to
>> you. It seems to work in every case that sched_yield() has
>> mucked up (perhaps 20 to 30 here).
>>
>
> That sounds like a terrible hack.
>
> What cases has sched_yield mucked up for you, and why do you
> think the problem is sched_yield mucking up? Can you solve it
> using mutexes?
>
> Thanks,
> Nick

Somebody wrote code that used Linux Threads. We didn't know
why it was so slow so I was asked to investigate. It was
a user-interface where high-speed image data gets put into
a buffer (using DMA) and one thread manipulates it. Another
thread copies and crunches the data, then displays it. The
writer insisted that he was doing the correct thing, however
the response sucked big time. I ran top and found that the
threaded processes were always grabbing big chunks of
CPU time. Searching for every instance of sched_yield(), I
was going to replace it with a diagnostic. However, the code
ran beautifully when the 'fprintf(stderr, "Message\n"' was
in the code! The call to write() sleeps. That gave the
CPU to somebody who was starving. The 'quick-fix" was
to replace sched_yield() with usleep(0).

The permanent fix was to not use threads at all.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

David Schwartz

unread,

Jan 26, 2006, 2:59:16 PM1/26/06

to h...@symas.com, Linux Kernel Mailing List

> The point of this discussion is that the POSIX spec says one thing and
> you guys say another; one way or another that should be resolved. The
> 2.6 kernel behavior is a noticable departure from previous releases. The
> 2.4/LinuxThreads guys believed their implementation was correct. If you
> believe the 2.6 implementation is correct, then you should get the spec
> amended or state up front that the "P" in "NPTL" doesn't really mean
> anything.

There is disagreement over what the POSIX specification says. You have
already seen three arguments against your interpretation, any one of which
is, IMO, sufficient to demolish it.

First, there's the as-if issue. You cannot write a program that can print
"non-compliant" with the behavior you claim is non-compliant that is
guaranteed not to do so by the standard because there is no way to know that
another thread is blocked on the mutex (except for PI mutexes).

Second, there's the plain langauge of the standard. It says "If X is so at
time T, then Y". This does not require Y to happen at time T. It is X
happening at time T that requires Y, but the time for Y is not specified.

If a law says, for example, "if there are two or more bids with the same
price lower than all other bids at the close of bidding, the first such bid
to be received shall be accepted". The phrase "at the close of bidding"
refers to the time the rule is deteremined to apply to the situation, not
the time at which the decision as to which bid to accept is made.

Third, there's the ambiguity of the standard. It says the "sceduling
policy" shall decide, not that the scheduler shall decide. If the policy is
to make a conditional or delayed decision, that is still perfectly valid
policy. "Whichever thread requests it first" is a valid scheduler policy.

DS

Howard Chu

unread,

Jan 26, 2006, 3:29:18 PM1/26/06

to dav...@webmaster.com, Linux Kernel Mailing List

David Schwartz wrote:
>> The point of this discussion is that the POSIX spec says one thing and
>> you guys say another; one way or another that should be resolved. The
>> 2.6 kernel behavior is a noticable departure from previous releases. The
>> 2.4/LinuxThreads guys believed their implementation was correct. If you
>> believe the 2.6 implementation is correct, then you should get the spec
>> amended or state up front that the "P" in "NPTL" doesn't really mean
>> anything.
>>
>
> There is disagreement over what the POSIX specification says. You have
> already seen three arguments against your interpretation, any one of which
> is, IMO, sufficient to demolish it.
>

> First, there's the as-if issue. You cannot write a program that can print
> "non-compliant" with the behavior you claim is non-compliant that is
> guaranteed not to do so by the standard because there is no way to know that
> another thread is blocked on the mutex (except for PI mutexes).
>

The exception here demolishes this argument, IMO. Moreover, if the
unlocker was a lower priority thread and there are higher priority
threads blocked on the mutex, you really want the higher priority thread
to run.

> Second, there's the plain langauge of the standard. It says "If X is so at
> time T, then Y". This does not require Y to happen at time T. It is X
> happening at time T that requires Y, but the time for Y is not specified.
>
> If a law says, for example, "if there are two or more bids with the same
> price lower than all other bids at the close of bidding, the first such bid
> to be received shall be accepted". The phrase "at the close of bidding"
> refers to the time the rule is deteremined to apply to the situation, not
> the time at which the decision as to which bid to accept is made.
>

The time at which the decision takes effect is immaterial; the point is
that the decision can only be made from the set of options available at
time T.

Per your analogy, if a new bid comes in at time T+1, it can't have any
effect on which of the bids shall be accepted.

> Third, there's the ambiguity of the standard. It says the "sceduling
> policy" shall decide, not that the scheduler shall decide. If the policy is
> to make a conditional or delayed decision, that is still perfectly valid
> policy. "Whichever thread requests it first" is a valid scheduler policy.

I am not debating what the policy can decide. Merely the set of choices
from which it may decide.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Nick Piggin

unread,

Jan 26, 2006, 3:47:39 PM1/26/06

to Howard Chu, dav...@webmaster.com, Linux Kernel Mailing List

Howard Chu wrote:
> David Schwartz wrote:
>

> The time at which the decision takes effect is immaterial; the point is
> that the decision can only be made from the set of options available at
> time T.
>
> Per your analogy, if a new bid comes in at time T+1, it can't have any
> effect on which of the bids shall be accepted.
>
>> Third, there's the ambiguity of the standard. It says the "sceduling
>> policy" shall decide, not that the scheduler shall decide. If the
>> policy is
>> to make a conditional or delayed decision, that is still perfectly valid
>> policy. "Whichever thread requests it first" is a valid scheduler policy.
>
>
> I am not debating what the policy can decide. Merely the set of choices
> from which it may decide.
>

OK, you believe that the mutex *must* be granted to a blocking thread
at the time of the unlock. I don't think this is unreasonable from the
wording (because it does not seem to be completely unambiguous english),
however think about this -

A realtime system with tasks A and B, A has an RT scheduling priority of
1, and B is 2. A and B are both runnable, so A is running. A takes a mutex
then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
some point it drops the mutex and then tries to take it again.

What happens?

I haven't programmed realtime systems of any complexity, but I'd think it
would be undesirable if A were to block and allow B to run at this point.

Now this has nothing to do with PI or SCHED_OTHER, so behaviour is exactly
determined by our respective interpretations of what it means for "the
scheduling policy to decide which task gets the mutex".

What have I proven? Nothing ;) but perhaps my question could be answered
by someone who knows a lot more about RT systems than I.

Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

Nick Piggin

unread,

Jan 26, 2006, 4:13:13 PM1/26/06

to linux-os (Dick Johnson), Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

linux-os (Dick Johnson) wrote:
> On Thu, 26 Jan 2006, Nick Piggin wrote:

>>What cases has sched_yield mucked up for you, and why do you
>>think the problem is sched_yield mucking up? Can you solve it
>>using mutexes?
>>
>>Thanks,
>>Nick
>
>
> Somebody wrote code that used Linux Threads. We didn't know
> why it was so slow so I was asked to investigate. It was
> a user-interface where high-speed image data gets put into
> a buffer (using DMA) and one thread manipulates it. Another
> thread copies and crunches the data, then displays it. The
> writer insisted that he was doing the correct thing, however
> the response sucked big time. I ran top and found that the
> threaded processes were always grabbing big chunks of
> CPU time. Searching for every instance of sched_yield(), I
> was going to replace it with a diagnostic. However, the code
> ran beautifully when the 'fprintf(stderr, "Message\n"' was
> in the code! The call to write() sleeps. That gave the
> CPU to somebody who was starving. The 'quick-fix" was
> to replace sched_yield() with usleep(0).
>
> The permanent fix was to not use threads at all.
>

This sounds like a trivial producer consumer problem that you
would find in any basic books on synchronisation, threading, or
operating systems.

If it was not a realtime system, then I can't believe it has any
usages of sched_yield in there at all. If it is a realtime system,
then replacing them with something else could easily have broken
it.

Also, I'm not sure that you can rely on write or usleep for 0
microseconds to sleep.

> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
> Warning : 98.36% of all statistics are fiction.
> .
>
> ****************************************************************
> The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
>
> Thank you.
>

Any chance you can get rid of that crazy disclaimer when posting
to lkml, please?

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

linux-os (Dick Johnson)

unread,

Jan 26, 2006, 4:33:36 PM1/26/06

to Nick Piggin, Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On Thu, 26 Jan 2006, Nick Piggin wrote:

[SNIPPED...]

> Any chance you can get rid of that crazy disclaimer when posting
> to lkml, please?
>
> Thanks,
> Nick
> --

> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
>

I tried. The "!@#(*$%^~!" IT/Legal Department(s) don't have a clue.
I asked the "mail-filter" guy on linux-kernel if he could just
exclude everything after a "." in the first column, just like
/bin/mail and, for that matter, sendmail. I was just told that
"It doesn't...." even though I can run sendmail by hand, using
telnet port 25, over the network, and know that the "." in the
first column is the way it knows the end-of-message after it
receives the "DATA" command.

Hoping that somebody, sometime, will implement my suggestion,
I continue to put a dot in the first column after my signature.
I know that if I send my mail around the lab without going through
the "*(_!@#&%" MicroWorm mail-grinder, the dot gets rid of
everything thereafter.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

Howard Chu

unread,

Jan 26, 2006, 4:33:36 PM1/26/06

to Nick Piggin, dav...@webmaster.com, Linux Kernel Mailing List

Nick Piggin wrote:
> OK, you believe that the mutex *must* be granted to a blocking thread
> at the time of the unlock. I don't think this is unreasonable from the
> wording (because it does not seem to be completely unambiguous english),
> however think about this -
>
> A realtime system with tasks A and B, A has an RT scheduling priority of
> 1, and B is 2. A and B are both runnable, so A is running. A takes a
> mutex
> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
> some point it drops the mutex and then tries to take it again.
>
> What happens?
>
> I haven't programmed realtime systems of any complexity, but I'd think it
> would be undesirable if A were to block and allow B to run at this point.

But why does A take the mutex in the first place? Presumably because it
is about to execute a critical section. And also presumably, A will not
release the mutex until it no longer has anything critical to do;
certainly it could hold it longer if it needed to.

If A still needed the mutex, why release it and reacquire it, why not
just hold onto it? The fact that it is being released is significant.

> Now this has nothing to do with PI or SCHED_OTHER, so behaviour is
> exactly
> determined by our respective interpretations of what it means for "the
> scheduling policy to decide which task gets the mutex".
>
> What have I proven? Nothing ;) but perhaps my question could be answered
> by someone who knows a lot more about RT systems than I.

In the last RT work I did 12-13 years ago, there was only one high
priority producer task and it was never allowed to block. The consumers
just kept up as best they could (multi-proc machine of course). I've
seldom seen a need for many priority levels. Probably not much you can
generalzie from this though.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Nick Piggin

unread,

Jan 26, 2006, 4:41:39 PM1/26/06

to Howard Chu, dav...@webmaster.com, Linux Kernel Mailing List

Howard Chu wrote:
> Nick Piggin wrote:
>
>> OK, you believe that the mutex *must* be granted to a blocking thread
>> at the time of the unlock. I don't think this is unreasonable from the
>> wording (because it does not seem to be completely unambiguous english),
>> however think about this -
>>
>> A realtime system with tasks A and B, A has an RT scheduling priority of
>> 1, and B is 2. A and B are both runnable, so A is running. A takes a
>> mutex
>> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
>> some point it drops the mutex and then tries to take it again.
>>
>> What happens?
>>
>> I haven't programmed realtime systems of any complexity, but I'd think it
>> would be undesirable if A were to block and allow B to run at this point.
>
>
> But why does A take the mutex in the first place? Presumably because it
> is about to execute a critical section. And also presumably, A will not
> release the mutex until it no longer has anything critical to do;
> certainly it could hold it longer if it needed to.
>
> If A still needed the mutex, why release it and reacquire it, why not
> just hold onto it? The fact that it is being released is significant.
>

Regardless of why, that is just the simplest scenario I could think
of that would give us a test case. However...

Why not hold onto it? We sometimes do this in the kernel if we need
to take a lock that is incompatible with the lock already being held,
or if we discover we need to take a mutex which nests outside our
currently held lock in other paths. Ie to prevent deadlock.

Another reason might be because we will be running for a very long
time without requiring the lock. Or we might like to release it because
we expect a higher priority process to take it.

>> Now this has nothing to do with PI or SCHED_OTHER, so behaviour is
>> exactly
>> determined by our respective interpretations of what it means for "the
>> scheduling policy to decide which task gets the mutex".
>>
>> What have I proven? Nothing ;) but perhaps my question could be answered
>> by someone who knows a lot more about RT systems than I.
>
>
> In the last RT work I did 12-13 years ago, there was only one high
> priority producer task and it was never allowed to block. The consumers
> just kept up as best they could (multi-proc machine of course). I've
> seldom seen a need for many priority levels. Probably not much you can
> generalzie from this though.
>

--

SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

Howard Chu

unread,

Jan 26, 2006, 4:57:08 PM1/26/06

to Nick Piggin, dav...@webmaster.com, Linux Kernel Mailing List

In those cases, A cannot retake the mutex anyway. I.e., you just said
that you released the first mutex because you want to acquire a
different one. So those cases don't fit this example very well.

> Another reason might be because we will be running for a very long
> time without requiring the lock.

And again in this case, A should not be immediately reacquiring the lock
if it doesn't actually need it.

> Or we might like to release it because
> we expect a higher priority process to take it.

And in this case, the expected behavior is the same as I've been pursuing.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Christopher Friesen

unread,

Jan 26, 2006, 5:00:45 PM1/26/06

to Howard Chu, Nick Piggin, dav...@webmaster.com, Linux Kernel Mailing List

Howard Chu wrote:

> But why does A take the mutex in the first place? Presumably because it
> is about to execute a critical section. And also presumably, A will not
> release the mutex until it no longer has anything critical to do;
> certainly it could hold it longer if it needed to.

Suppose A is pulling job requests off a queue.

A takes the mutex because it is going to modify data protected by the
mutex. It then gives up the mutex when it's done modifying the data.

> If A still needed the mutex, why release it and reacquire it, why not
> just hold onto it? The fact that it is being released is significant.

Suppose A then pulls another job request off the queue. It just so
happens that this job requires touching some data protected by the same
mutex. It would need to take the mutex again.

A doesn't necessarily know what data the various jobs will require it to
access, so it doesn't know a priori what mutexes will be required.

Chris

Nick Piggin

unread,

Jan 26, 2006, 5:25:27 PM1/26/06

to Howard Chu, dav...@webmaster.com, Linux Kernel Mailing List

Howard Chu wrote:
> Nick Piggin wrote:

>> Regardless of why, that is just the simplest scenario I could think
>> of that would give us a test case. However...
>>
>> Why not hold onto it? We sometimes do this in the kernel if we need
>> to take a lock that is incompatible with the lock already being held,
>> or if we discover we need to take a mutex which nests outside our
>> currently held lock in other paths. Ie to prevent deadlock.
>
>
> In those cases, A cannot retake the mutex anyway. I.e., you just said
> that you released the first mutex because you want to acquire a
> different one. So those cases don't fit this example very well.
>

Umm yes, then *after* aquiring the different one, A would like to
retake the original mutex.

>> Another reason might be because we will be running for a very long
>> time without requiring the lock.
>
>
> And again in this case, A should not be immediately reacquiring the lock
> if it doesn't actually need it.
>

No, not immediately, I said "for a very long time". As in: A does not
need the exclusion provided by the lock for a very long time so it
drops it to avoid needless contention, then reaquires it when it finally
does need the lock.

>> Or we might like to release it because
>> we expect a higher priority process to take it.
>
>
> And in this case, the expected behavior is the same as I've been pursuing.
>

No, we're talking about what happens when A tries to aquire it again.

Just accept that my described scenario is legitimate then consider it in
isolation rather than getting caught up in the superfluous details of how
such a situation might come about.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

David Schwartz

unread,

Jan 26, 2006, 9:18:02 PM1/26/06

to h...@symas.com, Linux Kernel Mailing List

> David Schwartz wrote:

> > First, there's the as-if issue. You cannot write a program
> > that can print
> > "non-compliant" with the behavior you claim is non-compliant that is
> > guaranteed not to do so by the standard because there is no way
> > to know that
> > another thread is blocked on the mutex (except for PI mutexes).

> The exception here demolishes this argument, IMO.

You're saying the authors of the standard intended that clause to be read
in light of the possibility of PI mutexes?! That's just nuts.

> Moreover, if the
> unlocker was a lower priority thread and there are higher priority
> threads blocked on the mutex, you really want the higher priority thread
> to run.

Yes, I agree.

> > Second, there's the plain langauge of the standard. It says
> > "If X is so at
> > time T, then Y". This does not require Y to happen at time T. It is X
> > happening at time T that requires Y, but the time for Y is not
> specified.

> > If a law says, for example, "if there are two or more bids
> > with the same
> > price lower than all other bids at the close of bidding, the
> > first such bid
> > to be received shall be accepted". The phrase "at the close of bidding"
> > refers to the time the rule is deteremined to apply to the
> > situation, not
> > the time at which the decision as to which bid to accept is made.

> The time at which the decision takes effect is immaterial; the point is
> that the decision can only be made from the set of options available at
> time T.
>
> Per your analogy, if a new bid comes in at time T+1, it can't have any
> effect on which of the bids shall be accepted.

Only because of the specifics of this analogy. If the rule said "if there
are two or more such bids with the same price at the close of bidding, the
winning bad shall be determined by the board of directors policy", nothing
prevents the board of directors from having a policy of going back to the
bidders and asking if they can lower their bids further.

Nothing prevents them from rebidding the project if they want. In other
words, it doesn't place any restrictions on what the board can do.

> > Third, there's the ambiguity of the standard. It says the "sceduling
> > policy" shall decide, not that the scheduler shall decide. If
> > the policy is
> > to make a conditional or delayed decision, that is still perfectly valid
> > policy. "Whichever thread requests it first" is a valid
> > scheduler policy.

> I am not debating what the policy can decide. Merely the set of choices
> from which it may decide.

Which is a restriction not found in the standard. A "policy" is a way of
deciding, not a decision. Scheduling policy can be to let whoever asks first
get it.

DS

Steven Rostedt

unread,

Jan 26, 2006, 11:14:45 PM1/26/06

to Howard Chu, Linux Kernel Mailing List, dav...@webmaster.com, Nick Piggin

On Thu, 2006-01-26 at 13:32 -0800, Howard Chu wrote:
> Nick Piggin wrote:
> > OK, you believe that the mutex *must* be granted to a blocking thread
> > at the time of the unlock. I don't think this is unreasonable from the
> > wording (because it does not seem to be completely unambiguous english),
> > however think about this -
> >
> > A realtime system with tasks A and B, A has an RT scheduling priority of
> > 1, and B is 2. A and B are both runnable, so A is running. A takes a
> > mutex
> > then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
> > some point it drops the mutex and then tries to take it again.
> >
> > What happens?
> >
> > I haven't programmed realtime systems of any complexity, but I'd think it
> > would be undesirable if A were to block and allow B to run at this point.
>
> But why does A take the mutex in the first place? Presumably because it
> is about to execute a critical section. And also presumably, A will not
> release the mutex until it no longer has anything critical to do;
> certainly it could hold it longer if it needed to.

A while back I discovered that the -rt patch did just this with the
spin_lock to rt_mutexes. Here's the scenario that happened amazingly too
much.

Three tasks A, B, C: A with highest prio (say 3), B is middle (say 2)
and C is lowest (say 1). And all this with PI (although without PI it
can happen even easier. see my explanation here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111165425915947&w=4 )

C grabs mutex X
B preempts C and tries to grab mutex X and blocks (C inherits from B)
A comes along and preempts C and blocks on X (C now inherits from A)
C lets go of mutex X and gives it to A.
A does some work then releases mutex X (B although not running aquires
it).
A needs to grab X again but B owns it. Since B has the lock, high
priority task A must give up the CPU for a lower priority task B.

I implemented a "lock stealing" for this very case and cut down
unnecessary schedules and latencies tremendously. If A goes to grab X
again, but B has it (but hasn't woken up yet) it can "steal" it from B
and continue.

Hmm, this may still be under the POSIX if what you say is that a
"waiting" process must get the lock. If A comes back before B wakes up,
A is now a waiting process and may take it. OK maybe I'm stretching it a
little, but that's what RT wants.

>
> If A still needed the mutex, why release it and reacquire it, why not
> just hold onto it? The fact that it is being released is significant.

There's several reasons. Why hold a mutex when you don't need to. This
could be a SMP machine and B could grab the mutex in the small time that
A releases it. Also locks are released and reaquired a lot to prevent
deadlocks.

It's good practice to always release a mutex (or any lock) when not
needed, even if you plan on grabbing it again right a way. For anything,
a higher priority process my be waiting to get it.

>
> > Now this has nothing to do with PI or SCHED_OTHER, so behaviour is
> > exactly
> > determined by our respective interpretations of what it means for "the
> > scheduling policy to decide which task gets the mutex".
> >
> > What have I proven? Nothing ;) but perhaps my question could be answered
> > by someone who knows a lot more about RT systems than I.
>
> In the last RT work I did 12-13 years ago, there was only one high
> priority producer task and it was never allowed to block. The consumers
> just kept up as best they could (multi-proc machine of course). I've
> seldom seen a need for many priority levels. Probably not much you can
> generalzie from this though.

That seems to be a very simple system. I usually deal with 4 or 5
priority levels and that can easily create headaches.

-- Steve

Steven Rostedt

unread,

Jan 26, 2006, 11:28:53 PM1/26/06

to Howard Chu, Linux Kernel Mailing List, dav...@webmaster.com, Nick Piggin

On Thu, 2006-01-26 at 13:56 -0800, Howard Chu wrote:
> Nick Piggin wrote:

> >>
> >> But why does A take the mutex in the first place? Presumably because
> >> it is about to execute a critical section. And also presumably, A
> >> will not release the mutex until it no longer has anything critical
> >> to do; certainly it could hold it longer if it needed to.
> >>
> >> If A still needed the mutex, why release it and reacquire it, why not
> >> just hold onto it? The fact that it is being released is significant.
> >>
> >
> > Regardless of why, that is just the simplest scenario I could think
> > of that would give us a test case. However...
> >
> > Why not hold onto it? We sometimes do this in the kernel if we need
> > to take a lock that is incompatible with the lock already being held,
> > or if we discover we need to take a mutex which nests outside our
> > currently held lock in other paths. Ie to prevent deadlock.
>
> In those cases, A cannot retake the mutex anyway. I.e., you just said
> that you released the first mutex because you want to acquire a
> different one. So those cases don't fit this example very well.

Lets say you have two locks X and Y. Y nests inside of X. To do block1
you need to have lock Y and to do block2 you need to have both locks X
and Y, and block 1 must be done first without holding lock X.

func()
{
again:
mutex_lock(Y);
block1();
if (!mutex_try_lock(X)) {
mutex_unlock(Y);
mutex_lock(X);
mutex_lock(Y);
if (block1_has_changed()) {
mutex_unlock(Y);
mutex_unlock(X);
goto again;
}
}
block2();
mutex_unlock(X);
mutex_unlock(Y);
}

Stuff like the above actually is done (it's done in the kernel). So you
can see here that Y can be released and reacquired right away. If
another task was waiting on Y (of lower priority) we don't want to give
up the lock, since we would then block and the chances of
block1_has_changed goes up even more.

>
> > Another reason might be because we will be running for a very long
> > time without requiring the lock.
>
> And again in this case, A should not be immediately reacquiring the lock
> if it doesn't actually need it.

I'm not sure what Nick means here, but I'm sure he didn't mean it to
come out that way ;)

>
> > Or we might like to release it because
> > we expect a higher priority process to take it.
>
> And in this case, the expected behavior is the same as I've been pursuing.

But you can't know if a higher or lower priority process is waiting.
Sure it works like what you say when a higher priority process is
waiting, but it doesn't when it's a lower priority process waiting.

-- Steve

Valdis.K...@vt.edu

unread,

Jan 27, 2006, 2:07:40 AM1/27/06

to linux-os (Dick Johnson), Nick Piggin, Howard Chu, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On Thu, 26 Jan 2006 16:31:28 EST, "linux-os (Dick Johnson)" said:

> "It doesn't...." even though I can run sendmail by hand, using
> telnet port 25, over the network, and know that the "." in the
> first column is the way it knows the end-of-message after it
> receives the "DATA" command.

Right. That's how an MTA talks to another MTA. However, your mail
needs to be properly escaped. RFC821, section 4.5.2:

4.5.2. TRANSPARENCY

Without some provision for data transparency the character
sequence "<CRLF>.<CRLF>" ends the mail text and cannot be sent
by the user. In general, users are not aware of such
"forbidden" sequences. To allow all user composed text to be
transmitted transparently the following procedures are used.

1. Before sending a line of mail text the sender-SMTP checks
the first character of the line. If it is a period, one
additional period is inserted at the beginning of the line.

2. When a line of mail text is received by the receiver-SMTP
it checks the line. If the line is composed of a single
period it is the end of mail. If the first character is a
period and there are other characters on the line, the first
character is deleted.

In other words, the on-the-wire protocol is specifically designed so that
you *cant* accidentally lose the rest of the message by sending a bare '.'.
The fact that some programs implement it when talking to the user is
merely a convenience hack on the program's part.

Howard Chu

unread,

Jan 27, 2006, 3:08:52 AM1/27/06

to Nick Piggin, dav...@webmaster.com, Linux Kernel Mailing List

Nick Piggin wrote:
> Howard Chu wrote:
>

>>> Another reason might be because we will be running for a very long
>>> time without requiring the lock.
>>
>>
>> And again in this case, A should not be immediately reacquiring the
>> lock if it doesn't actually need it.
>>
>
> No, not immediately, I said "for a very long time". As in: A does not
> need the exclusion provided by the lock for a very long time so it
> drops it to avoid needless contention, then reaquires it when it finally
> does need the lock.

OK. I think this is really a separate situation. Just to recap: A takes
lock, does some work, releases lock, a very long time passes, then A
takes the lock again. In the "time passes" part, that mutex could be
locked and unlocked any number of times by other threads and A won't
know or care. Particularly on an SMP machine, other threads that were
blocked on that mutex could do useful work in the interim without
impacting A's progress at all. So here, when A leaves the mutex unlocked
for a long time, it's desirable to give the mutex to one of the waiters
ASAP.

>>> Or we might like to release it because
>>> we expect a higher priority process to take it.
>>
>>
>> And in this case, the expected behavior is the same as I've been
>> pursuing.
>>
>
> No, we're talking about what happens when A tries to aquire it again.
>
> Just accept that my described scenario is legitimate then consider it in
> isolation rather than getting caught up in the superfluous details of how
> such a situation might come about.

OK. I'm not trying to be difficult here. In much of life, context is
everything; very little can be understood in isolation.

Back to the scenario:

> A realtime system with tasks A and B, A has an RT scheduling priority of
> 1, and B is 2. A and B are both runnable, so A is running. A takes a
> mutex
> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
> some point it drops the mutex and then tries to take it again.
>
> What happens?

As I understand the spec, A must block because B has acquired the mutex.
Once again, the SUS discussion of priority inheritance would never need
to have been written if this were not the case:

>>>
In a priority-driven environment, a direct use of traditional primitives
like mutexes and condition variables can lead to unbounded priority
inversion, where a higher priority thread can be blocked by a lower
priority thread, or set of threads, for an unbounded duration of time.
As a result, it becomes impossible to guarantee thread deadlines.
Priority inversion can be bounded and minimized by the use of priority
inheritance protocols. This allows thread deadlines to be guaranteed
even in the presence of synchronization requirements.
<<<

The very first sentence indicates that a higher priority thread can be
blocked by a lower priority thread. If your interpretation of the spec
were correct, then such an instance would never occur. Since your
scenario is using realtime threads, then we can assume that the Priority
Ceiling feature is present and you can use it if needed. (
http://www.opengroup.org/onlinepubs/000095399/xrat/xsh_chap02.html#tag_03_02_09_06
Realtime Threads option group )

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Howard Chu

unread,

Jan 27, 2006, 3:20:09 AM1/27/06

to dav...@webmaster.com, Linux Kernel Mailing List

David Schwartz wrote:
>>> Third, there's the ambiguity of the standard. It says the "sceduling
>>> policy" shall decide, not that the scheduler shall decide. If
>>> the policy is
>>> to make a conditional or delayed decision, that is still perfectly valid
>>> policy. "Whichever thread requests it first" is a valid
>>> scheduler policy.
>>>

>> I am not debating what the policy can decide. Merely the set of choices
>> from which it may decide.
>>
>
> Which is a restriction not found in the standard. A "policy" is a way of
> deciding, not a decision. Scheduling policy can be to let whoever asks first
> get it.
>

If we just went with "whoever asks first" then clearly one of the
blocked threads asked before the unlocker made its new request. You're
arguing for my point, then.

Other ambiguities aside, one thing is clear - a decision is triggered by
the unlock. What you seem to be arguing is the equivalent of saying that
the decision is made based on the next lock operation. The spec doesn't
say that mutex_lock is to behave this way. Why do you suppose that is?
Perhaps you should raise this question with the Open Group.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Philipp Matthias Hahn

unread,

Jan 27, 2006, 2:25:52 PM1/27/06

to Howard Chu, dav...@webmaster.com, Linux Kernel Mailing List

Hello!

On Fri, Jan 27, 2006 at 12:08:13AM -0800, Howard Chu wrote:
> >No, not immediately, I said "for a very long time". As in: A does not
> >need the exclusion provided by the lock for a very long time so it
> >drops it to avoid needless contention, then reaquires it when it finally
> >does need the lock.
>
> OK. I think this is really a separate situation. Just to recap: A takes
> lock, does some work, releases lock, a very long time passes, then A
> takes the lock again. In the "time passes" part, that mutex could be
> locked and unlocked any number of times by other threads and A won't
> know or care. Particularly on an SMP machine, other threads that were
> blocked on that mutex could do useful work in the interim without
> impacting A's progress at all. So here, when A leaves the mutex unlocked
> for a long time, it's desirable to give the mutex to one of the waiters
> ASAP.

When you release a lock, you unblock at most one thread, which is
waiting for that lock and put that released thread in the runnable
state.
Than it's up to the scheduler, what happens next:
- if you have multiple processors, you _can_ run the released thread on
anther processor, so both thread run.
- if you are single processor or don't want to schedule the released
thread on a second cpu, you must decide to
- either _continue running the releasing thread_ and let the released
thread stay some more time in the runnable queue,
- or _preempt the releasing thread_ to the runnable queue and make the
released thread running.
If you have different priorities, your decision is easy: run the most
important thread.
But if you don't have priorities, you base your decision on other
metrics: Since it takes more time to switch a thread (save/restore
state) compared to continue running the same thread, from a throuput
perspective you'll prefer to not change threads.

Similar thinking for yield(): You put the running thread back to the
runnable queue and choose one thread from it as the new running thread.
Note, that you might choose the old thread as the new thread again,
since with SCHED_OTHER this is perfectly fine, if you decided to honor
throuput more than fairness.
Other with SCHED_FIFO/RR, since there you are forced to put the old
thread at the end of your runnable queue and choose the new one from the
front of the queue, so all other threads with the same priority will run
before you yielding thread gets the cpu again.

Summary: yield() only makes sense with a SCHED_FIFO/RR policy, because
with SCHED_OTHER you know too little about the exact policy to make any
use of it.

BYtE
Philipp
--
Dipl.-Inform. Philip...@informatik.uni-oldenburg.de
Abteilung Systemsoftware und verteilte Systeme, Fk. II
Carl von Ossietzky Universitaet Oldenburg, 26111 Oldenburg, Germany
http://www.svs.informatik.uni-oldenburg.de/contact/pmhahn/
Telefon: +49 441 798-2866 Telefax: +49 441 798-2756

David Schwartz

unread,

Jan 27, 2006, 2:51:43 PM1/27/06

to h...@symas.com, Linux Kernel Mailing List

> If we just went with "whoever asks first" then clearly one of the
> blocked threads asked before the unlocker made its new request. You're
> arguing for my point, then.

Huh? I am saying the policy can be anything at all. We could just go with
"whoever asks first", but we are not required to. And, in any event, I meant
whoever asks for the mutex first, not whoever blocks first. (Note that I
didn't say "whoever asked first" which would mean something totally
different.)

> Other ambiguities aside, one thing is clear - a decision is triggered by
> the unlock. What you seem to be arguing is the equivalent of saying that
> the decision is made based on the next lock operation.

The spec says that the decision is triggered by a particular condition that
exists at the time of the unlock. That does not mean the decision is made at

the time of the unlock.

> The spec doesn't

> say that mutex_lock is to behave this way.

We don't agree on what the specification says.

> Why do you suppose that is?

Why do I suppose what? I find the specification perfectly clear and your
reading of it incredibly strained for the three reasons I stated.

> Perhaps you should raise this question with the Open Group.

I don't think it's unclear.

DS

Howard Chu

unread,

Jan 27, 2006, 3:13:54 PM1/27/06

to dav...@webmaster.com, Linux Kernel Mailing List

David Schwartz wrote:
> We don't agree on what the specification says.
>
>
>> Why do you suppose that is?
>>
>
> Why do I suppose what? I find the specification perfectly clear and your
> reading of it incredibly strained for the three reasons I stated.
>

Oddly enough, you said
http://groups.google.com/group/comp.programming.threads/msg/28b58e91886a3602?hl=en&
"Unfortunately, it sounds reasonable" so I can't lend credence to your
stating that my reading is incredibly strained. The fact that
LinuxThreads historically adhered to my reading of it lends more weight
to my argument. The fact that people accepted this interpretation for so
many years lends further weight. In light of this, it is your current
interpretation that is incredibly strained, and I would say, broken.

You have essentially created a tri-state mutex. (Locked, unlocked, and
sort-of-unlocked-but-really-reserved.) That may be a good and useful
thing in its own right, but it should not be the default behavior.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

David Schwartz

unread,

Jan 27, 2006, 4:07:45 PM1/27/06

to h...@symas.com, Linux Kernel Mailing List

> David Schwartz wrote:
> > We don't agree on what the specification says.
> >
> >
> >> Why do you suppose that is?
> >>
> >
> > Why do I suppose what? I find the specification perfectly
> clear and your
> > reading of it incredibly strained for the three reasons I stated.
> >

> Oddly enough, you said
> http://groups.google.com/group/comp.programming.threads/msg/28b58e
> 91886a3602?hl=en&
> "Unfortunately, it sounds reasonable" so I can't lend credence to your
> stating that my reading is incredibly strained. The fact that
> LinuxThreads historically adhered to my reading of it lends more weight
> to my argument. The fact that people accepted this interpretation for so
> many years lends further weight. In light of this, it is your current
> interpretation that is incredibly strained, and I would say, broken.

After collecting other opinions from comp.programming.threads, and being
unable to find other people who considered it reasonable, I've changed my
opinion. I was far too generous and deferential before.

The more I consider it, the more absurd I find it. POSIX and SuS were so
careful not to dictate scheduler policy (or even hint at any notion of
fairness) that to argue that they intended to prohibit a thread from
releasing and reacquiring a mutex while another thread was blocked on it is
not tenable.

You are essentially arguing that they intended to prohibit the most natural
and highest performing implementation. This is totally inconsistent with
POSIX's overall design intention to provide the lightest and
highest-performing primitives and allow users to add features with overhead
if they needed those features and could tolerate the overhead.

> You have essentially created a tri-state mutex. (Locked, unlocked, and
> sort-of-unlocked-but-really-reserved.) That may be a good and useful
> thing in its own right, but it should not be the default behavior.

Huh?

I'm suggesting the most natural implementation: When a thread tries to
acquire a mutex, it is blocked if a higher-priority thread is already
waiting for a mutex. When a thread releases a mutex, the highest-priority
thread waiting for the mutex is woken (but not necessarily guaranteed the
mutex, the mutex is simply marked available). When a thread tries to acquire
a mutex, it gets it unless a higher-priority thread is already registered as
wanting it. When a thread tries to acquire a mutex, it loops until it
acquires it and on each iteration blocks if the mutex is taken or a
higher-priority thread is waiting for it, otherwise it takes the mutex.

A thread that is descheduled should never get priority over a thread that
is already running (unless a scheduling priority mechanism requires it).

DS

Howard Chu

unread,

Jan 27, 2006, 4:24:15 PM1/27/06

to dav...@webmaster.com, Linux Kernel Mailing List

David Schwartz wrote:
> After collecting other opinions from comp.programming.threads, and being
> unable to find other people who considered it reasonable, I've changed my
> opinion. I was far too generous and deferential before.
>

David, you specifically have been faced with this question before:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/2184ba84f911d9dd/a6e4f7cf13bbec2d#a6e4f7cf13bbec2d
and you didn't dispute the interpretation then. The wording for
pthread_mutex_unlock hasn't changed between 2001 and now.

And here:
http://groups.google.com/group/comp.programming.threads/msg/89cc5d600e34e88a?hl=en&

If those statements were incorrect, I have a feeling someone would have
corrected them at the time. Certainly you can attest to that.
http://groups.google.com/group/comp.programming.threads/msg/d5b2231ca57bb102?hl=en&

Clearly at this point there's nothing to be gained from pursuing this
any further. The 2.6 kernel has been out for too long; if it were to be
"fixed" again it would just make life ugly for another group of people,
and I don't want to write the autoconf tests to detect the
flavor-of-the-week. We've wasted enough time arguing futilely over it,
I'll stop.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

David Schwartz

unread,

Jan 27, 2006, 6:33:14 PM1/27/06

to h...@symas.com, Linux Kernel Mailing List

> David, you specifically have been faced with this question before:
> http://groups.google.com/group/comp.programming.threads/browse_frm
> /thread/2184ba84f911d9dd/a6e4f7cf13bbec2d#a6e4f7cf13bbec2d
> and you didn't dispute the interpretation then. The wording for
> pthread_mutex_unlock hasn't changed between 2001 and now.

This was a totally different question. This was about the implementation,
not the interpretation. You'll note that I objected to the implementation.

> And here:
>
http://groups.google.com/group/comp.programming.threads/msg/89cc5d600e34e88a
?hl=en&

Again, I don't see that I commented on the interpretation. This was an
unfortunate missed oppurtunity. Kaz is incorrect here.

> If those statements were incorrect, I have a feeling someone would have
> corrected them at the time. Certainly you can attest to that.

Obviously not, since they are incorrect and nobody did.

>
http://groups.google.com/group/comp.programming.threads/msg/d5b2231ca57bb102
?hl=en&

Again, this had nothing whatsoever to do with whether the interpretation is
correct or not.

> Clearly at this point there's nothing to be gained from pursuing this
> any further. The 2.6 kernel has been out for too long; if it were to be
> "fixed" again it would just make life ugly for another group of people,
> and I don't want to write the autoconf tests to detect the
> flavor-of-the-week. We've wasted enough time arguing futilely over it,
> I'll stop.

The problem is that this interpretation is simply incorrect and results in
maximally inefficient implementations.

David Butenhof recently posted to comp.programming.threads and indicated
that disagreed with this implementation. That's about as close to
authoritative as you're likely to get.

POSIX had no intention to constrain the scheduler to compel inefficient
behavior. In fact, they went out of their way to create the lightest
possible primitives.

DS

Mattia Dongili

unread,

Jan 28, 2006, 6:53:38 AM1/28/06

to Stefan Seyfried, Jesse Brandeburg, Olaf Kirch, Linux Kernel Mailing List, net...@vger.kernel.org

On Thu, Jan 26, 2006 at 08:02:37PM +0100, Stefan Seyfried wrote:

> On Wed, Jan 25, 2006 at 04:28:48PM -0800, Jesse Brandeburg wrote:
>
> > Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> > to show that my patch that just removes e100_init_hw works okay for
> > me. Let me know how it goes for you, I think this is a good fix.
>
> worked for me in the Compaq Armada e500 and reportedly also fixed the
> SONY that originally uncovered it.

confirmed here too. The patch fixes S3 resume on this Sony (GR7/K)
running 2.6.16-rc1-mm3.

0000:02:08.0 Ethernet controller: Intel Corporation 82801CAM (ICH3) PRO/100 VE (LOM) Ethernet Controller (rev 41)
Subsystem: Sony Corporation Vaio PCG-GR214EP/GR214MP/GR215MP/GR314MP/GR315MP
Flags: bus master, medium devsel, latency 66, IRQ 9
Memory at d0204000 (32-bit, non-prefetchable) [size=4K]
I/O ports at 4000 [size=64]
Capabilities: <available only to root>

thanks
--
mattia
:wq!

Jesse Brandeburg

unread,

Jan 28, 2006, 2:53:28 PM1/28/06

to Jeff Garzik, Stefan Seyfried, Jesse Brandeburg, Olaf Kirch, Linux Kernel Mailing List, net...@vger.kernel.org, Jesse Brandeburg, Jeff Kirsher

On 1/28/06, Mattia Dongili <mala...@linux.it> wrote:
> On Thu, Jan 26, 2006 at 08:02:37PM +0100, Stefan Seyfried wrote:
> > On Wed, Jan 25, 2006 at 04:28:48PM -0800, Jesse Brandeburg wrote:
> >
> > > Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> > > to show that my patch that just removes e100_init_hw works okay for
> > > me. Let me know how it goes for you, I think this is a good fix.
> >
> > worked for me in the Compaq Armada e500 and reportedly also fixed the
> > SONY that originally uncovered it.
>
> confirmed here too. The patch fixes S3 resume on this Sony (GR7/K)
> running 2.6.16-rc1-mm3.

excellent news! thanks for testing.

Jeff, could you please apply to 2.6.16-rcX

Jesse

e100_resume_no_init.patch

Helge Hafting

unread,

Jan 30, 2006, 3:26:53 AM1/30/06

to dav...@webmaster.com, h...@symas.com, Linux Kernel Mailing List

David Schwartz wrote:

> Third, there's the ambiguity of the standard. It says the "sceduling
>policy" shall decide, not that the scheduler shall decide. If the policy is
>to make a conditional or delayed decision, that is still perfectly valid
>policy. "Whichever thread requests it first" is a valid scheduler policy.
>
>

Sure. And with a "whichever thread aquires it first" policy, then
it is obvious what happens when a mutex is released when someone
is blocked on it: Whoever blocked on it first is then the one
who requested it first - that cannot change as the request was made
before the mutex even was released. So then, the releasing thread has
no chance of getting the mutex back until the others have had a
go at it - no matter what threads actually gets scheduled.

Helge Hafting

unread,

Jan 30, 2006, 3:31:27 AM1/30/06

to Nikita Danilov, Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Nikita Danilov wrote:

>Howard Chu writes:
>
>[...]
>
> >
> > A straightforward reading of the language here says the decision happens
> > "when pthread_mutex_unlock() is called" and not at any later time. There
> > is nothing here to support your interpretation.
> > >
> > > I think the intention of the wording is that for deterministic policies,
> > > it is clear that the waiting threads are actually worken and reevaluated
> > > for scheduling. In the case of SCHED_OTHER, it means basically nothing,
> > > considering the scheduling policy is arbitrary.
> > >
> > Clearly the point is that one of the waiting threads is waken and gets
> > the mutex, and it doesn't matter which thread is chosen. I.e., whatever
>
>Note that this behavior directly leads to "convoy formation": if that
>woken thread T0 does not immediately run (e.g., because there are higher
>priority threads) but still already owns the mutex, then other running
>threads contending for this mutex will block waiting for T0, forming a
>convoy.
>
I just wonder - what is the problem with this convoy formation?
It can only happen when the cpu is overloaded, and in that case
someone has to wait. In this case, the mutex waiters.

Aggressively handing the cpu to whoever holds a mutex will mean the
mutexes are free more of the time - but it will *not* mean less waiting in
tghe system. You just changes who waits.

Helge Hafting

unread,

Jan 30, 2006, 3:39:29 AM1/30/06

to linux-os (Dick Johnson), Howard Chu, Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

linux-os (Dick Johnson) wrote:

>To fix the current problem, you can substitute usleep(0); It will
>give the CPU to somebody if it's computable, then give it back to
>you. It seems to work in every case that sched_yield() has
>mucked up (perhaps 20 to 30 here).
>
>
Isn't that dangerous? Someday, someone working on linux (or some
other unixish os) might come up with an usleep implementation where
usleep(0) just returns and becomes a no-op. Which probably is ok
with the usleep spec - it did sleep for zero time . . .

Howard Chu

unread,

Jan 30, 2006, 3:51:16 AM1/30/06

to Helge Hafting, linux-os (Dick Johnson), Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Helge Hafting wrote:
> linux-os (Dick Johnson) wrote:
>
>> To fix the current problem, you can substitute usleep(0); It will
>> give the CPU to somebody if it's computable, then give it back to
>> you. It seems to work in every case that sched_yield() has
>> mucked up (perhaps 20 to 30 here).
>>
>>
> Isn't that dangerous? Someday, someone working on linux (or some
> other unixish os) might come up with an usleep implementation where
> usleep(0) just returns and becomes a no-op. Which probably is ok
> with the usleep spec - it did sleep for zero time . . .
>

We actually experimented with usleep(0) and select(...) with a zeroed
timeval. Both of these approaches performed worse than just using
sched_yield(), depending on the system and some other conditions.
Dual-core AMD64 vs single-CPU had quite different behaviors. Also, if
the slapd main event loop was using epoll() instead of select(), the
select's used for yields slowed down by a couple orders of magnitude. (A
test that normally took ~30 seconds took as long as 45 minutes in one
case, it was quite erratic.)

It turned out that most of those yield's were leftovers inherited from
when we only supported non-preemptive threads, and simply deleting them
was the best approach.

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

-

Nikita Danilov

unread,

Jan 30, 2006, 6:15:24 AM1/30/06

to Helge Hafting, Nikita Danilov, Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

Helge Hafting writes:
> Nikita Danilov wrote:
>
> >Howard Chu writes:
> >
> >[...]
> >
> > >
> > > A straightforward reading of the language here says the decision happens
> > > "when pthread_mutex_unlock() is called" and not at any later time. There
> > > is nothing here to support your interpretation.
> > > >
> > > > I think the intention of the wording is that for deterministic policies,
> > > > it is clear that the waiting threads are actually worken and reevaluated
> > > > for scheduling. In the case of SCHED_OTHER, it means basically nothing,
> > > > considering the scheduling policy is arbitrary.
> > > >
> > > Clearly the point is that one of the waiting threads is waken and gets
> > > the mutex, and it doesn't matter which thread is chosen. I.e., whatever
> >
> >Note that this behavior directly leads to "convoy formation": if that
> >woken thread T0 does not immediately run (e.g., because there are higher
> >priority threads) but still already owns the mutex, then other running
> >threads contending for this mutex will block waiting for T0, forming a
> >convoy.
> >
> I just wonder - what is the problem with this convoy formation?
> It can only happen when the cpu is overloaded, and in that case
> someone has to wait. In this case, the mutex waiters.

The obvious problem is extra context switch: if mutex is left unlocked,
then first thread (say, T0) that tries to acquire it, succeeds and
continues to run, whereas if mutex is directly handed to the runnable
(but not running) thread T1, T0 has to block, until T1 runs.

What's worse, convoys tend to grow once formed.

>
> Aggressively handing the cpu to whoever holds a mutex will mean the
> mutexes are free more of the time - but it will *not* mean less waiting in
> tghe system. You just changes who waits.
>
> Helge Hafting

Nikita.

linux-os (Dick Johnson)

unread,

Jan 30, 2006, 8:29:12 AM1/30/06

to Helge Hafting, Howard Chu, Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On Mon, 30 Jan 2006, Helge Hafting wrote:

> linux-os (Dick Johnson) wrote:
>
>> To fix the current problem, you can substitute usleep(0); It will
>> give the CPU to somebody if it's computable, then give it back to
>> you. It seems to work in every case that sched_yield() has
>> mucked up (perhaps 20 to 30 here).
>>
>>
> Isn't that dangerous? Someday, someone working on linux (or some
> other unixish os) might come up with an usleep implementation where
> usleep(0) just returns and becomes a no-op. Which probably is ok
> with the usleep spec - it did sleep for zero time . . .
>
> Helge Hafting

Dangerous?? You have a product that needs to ship. You can make
it work by adding a hack. You add a hack. I don't see danger at
all. I see getting the management off the back of the software
engineers so that they can fix the code. Further, you __test__ the
stuff before you ship. If usleep(0) just spins, then you use
usleep(1).

Also, I don't think any Engineer would use threads for anything
that could be potentially dangerous anyway. You create step-by-step
ordered procedures with explicit state-machines for things that
really need to happen as written. You use threads for things that
must occur, but you don't give a damn when they occur (like updating
a window on the screen or sorting keys in a database).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

Helge Hafting

unread,

Jan 30, 2006, 10:11:11 AM1/30/06

to linux-os (Dick Johnson), Howard Chu, Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

linux-os (Dick Johnson) wrote:

>On Mon, 30 Jan 2006, Helge Hafting wrote:
>
>
>
>>linux-os (Dick Johnson) wrote:
>>
>>
>>
>>>To fix the current problem, you can substitute usleep(0); It will
>>>give the CPU to somebody if it's computable, then give it back to
>>>you. It seems to work in every case that sched_yield() has
>>>mucked up (perhaps 20 to 30 here).
>>>
>>>
>>>
>>>
>>Isn't that dangerous? Someday, someone working on linux (or some
>>other unixish os) might come up with an usleep implementation where
>>usleep(0) just returns and becomes a no-op. Which probably is ok
>>with the usleep spec - it did sleep for zero time . . .
>>
>>
>

>Dangerous?? You have a product that needs to ship. You can make
>it work by adding a hack. You add a hack. I don't see danger at
>all. I see getting the management off the back of the software
>engineers so that they can fix the code. Further, you __test__ the
>stuff before you ship. If usleep(0) just spins, then you use
>usleep(1).
>
>

The dangerous part was that usleep(0) works as a "yield"
today, as your testing will confirm before you ship the product.
But it may break next year if someone changes this part of
the kernel. Then your customer suddenly have a broken product.

Helge Hafting

Kyle Moffett

unread,

Jan 30, 2006, 10:34:30 AM1/30/06

to Howard Chu, Helge Hafting, linux-os (Dick Johnson), Nick Piggin, Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hanc...@shaw.ca

On Jan 30, 2006, at 03:50, Howard Chu wrote:
> Helge Hafting wrote:
>> linux-os (Dick Johnson) wrote:
>>> To fix the current problem, you can substitute usleep(0); It will
>>> give the CPU to somebody if it's computable, then give it back to
>>> you. It seems to work in every case that sched_yield() has mucked
>>> up (perhaps 20 to 30 here).
>>
>> Isn't that dangerous? Someday, someone working on linux (or some
>> other unixish os) might come up with an usleep implementation
>> where usleep(0) just returns and becomes a no-op. Which probably
>> is ok with the usleep spec - it did sleep for zero time . . .
>
> We actually experimented with usleep(0) and select(...) with a
> zeroed timeval. Both of these approaches performed worse than just
> using sched_yield(), depending on the system and some other
> conditions. Dual-core AMD64 vs single-CPU had quite different
> behaviors. Also, if the slapd main event loop was using epoll()
> instead of select(), the select's used for yields slowed down by a
> couple orders of magnitude. (A test that normally took ~30 seconds
> took as long as 45 minutes in one case, it was quite erratic.)
>
> It turned out that most of those yield's were leftovers inherited
> from when we only supported non-preemptive threads, and simply
> deleting them was the best approach.

I would argue that in a non realtime environment sched_yield() is not
useful at all. When you want to wait for another process, you wait
explicitly for that process using one of the various POSIX-defined
methods, such as mutexes, condition variables, etc. There are very
clearly and thoroughly defined ways to wait for other processes to
complete work, why rely on usleep(0) giving CPU to some other task
when you can explicitly tell the scheduler "I am waiting for task foo
to release this mutex" or "I can't run until somebody signals this
condition variable".

Cheers,
Kyle Moffett

--
Unix was not designed to stop people from doing stupid things,
because that would also stop them from doing clever things.
-- Doug Gwyn

li...@horizon.com

unread,

Jan 30, 2006, 5:02:15 PM1/30/06

to dav...@webmaster.com, linux-...@vger.kernel.org

> It can tell the difference between the other thread getting
> the mutex first and it getting the mutex first. But it cannot tell the
> difference between an implementation that puts random sleeps before calls
> to 'pthread_mutex_lock' and an implementation that has the allegedly
> non-compliant behavior. That makes the behavior compliant under the
> 'as-if' rule.
>
> If you don't believe me, try to write a program that prints
> 'non-compliant' on a system that has the alleged non-compliance but is
> guaranteed not to do so on any compliant system. It cannot be done.
>
> In order to claim the alleged compliance, you would have to
> know that a thread waiting for a mutex did not get it. But there is no
> possible way you can know that another thread is waiting for the mutex
> (as opposed to being about to wait for it). So you can never detect the
> claimed non-compliance, so it's not non-compliance.

An excellent point, but the existence of pthread_mutex_trylock()
invalidates it.

To be very specific, the following will do the job:

volatile unsigned shared_variable;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void
thread_function()
{
unsigned prev_value = shared_variable;

for (;;) {
unsigned cur_value, delta;
if (pthread_mutex_trylock(&lock) == 0) {
cur_value = ++shared_variable;
pthread_mutex_unlock(&lock);
delta = cur_value - prev_value;
} else {
/* Another thread is holding the lock. */
pthread_mutex_lock(&lock);
cur_value = ++shared_variable;
pthread_mutex_unlock(&lock);
delta = cur_value - prev_value;
if (delta == 1)
fatal("non-compliant");
}
/* Assuming we don't wrap */
if (delta == 0)
fatal("buggy as a roach motel");
}
}

You need to run more than one instance of the thread_function()
to have a chance of triggering the non-compliant message, of course.

li...@horizon.com

unread,

Jan 30, 2006, 6:37:47 PM1/30/06

to dav...@webmaster.com, linux-...@vger.kernel.org

Thinking some more on my example, for SCHED_OTHER threads, it is possible
to define the problem away by making pthread_mutex_trylock behave
compatibly with pthread_mutex_lock. That is, threads in pthread_mutex_lock()
are actually descheduled just before waiting for the lock (SCHED_OTHER is
allowed to do that), and when the lock becomes available, the scheduler
then decides who to run through the acquisition code.

As long as pthread_mutex_trylock() succeeds in such a case, some may call
it weird, but it's conformant, and the performance arguments for the
"unfair" case might easily win the day.

This is assuming that SCHED_OTHER can block a process for an arbitrary
time for no good reason. Otherwise, if the lock holder is waiting for
device I/O and no other processes are competing for the CPU, perhaps
blocking on the edge like that for an unbounded time is illegal.

However, if you have priorities and can't redefine locking using creative
scheduling policies, it's less clear. If I have a couple of real-time
tasks, I can't decide arbitrarily to run one in lieu of the other.
For example, suppose that without priority inheritance, you have three
tasks, A (highest priority), B, and C (lowest).

There are three locks. Initially, A holds lock 1 and C holds lock 2.
Then A tries to acquire lock 2. A blocks, so B runs until it blocks trying
to get lock 1. Then C runs and drops lock 2. A gets it, then drops lock 1
and tries to re-acquire it.

It seems to me that the Posix spec mandates that B gets lock 1 (and A
must block) before A can re-acquire it.

(This can also be done with priority inheritance, although it's a bit
different. A version that works whether priority inheritance is
implemented or not is probably possible, too.)

I'm not saying that this is a good thing, but it's distinguishable, and I
don't have any language-lawyer way to escape the obligation.

David Schwartz

unread,

Jan 31, 2006, 6:19:06 PM1/31/06

to Nikita Danilov, Howard Chu, Linux Kernel Mailing List

> I just wonder - what is the problem with this convoy formation?
> It can only happen when the cpu is overloaded, and in that case
> someone has to wait. In this case, the mutex waiters.

The problem is that you need to become more efficient as load increases, not less. If you get more efficient as load increases, you can get into a situation where even though you have an amount of load you can handle, you will never catch up on the load that backed up before.

> Aggressively handing the cpu to whoever holds a mutex will mean the
> mutexes are free more of the time - but it will *not* mean less waiting in
> tghe system. You just changes who waits.

It will mean fewer context switches and more effective use of caches as load increases. Even a very small amount of "gets more efficient as load goes up" can mean the difference between a system that handles load spikes smoothly (with a temporary reduction in responsiveness) and a system that backs up in a load spike and never recovers (with a per,amently increasing reduction in responsiveness even with load that's normally tolerable).

As load goes up, you need your threads to use more of their timeslice. This means not descheduling a running thread unless it is unavoidable.

DS

Nick Piggin

unread,

Feb 1, 2006, 7:32:07 AM2/1/06

to Howard Chu, dav...@webmaster.com, Linux Kernel Mailing List

Howard Chu wrote:
> Nick Piggin wrote:
>> Howard Chu wrote:
>>
>>>
>>> And again in this case, A should not be immediately reacquiring the
>>> lock if it doesn't actually need it.
>>>
>>
>> No, not immediately, I said "for a very long time". As in: A does not
>> need the exclusion provided by the lock for a very long time so it
>> drops it to avoid needless contention, then reaquires it when it finally
>> does need the lock.
>
>
> OK. I think this is really a separate situation. Just to recap: A takes
> lock, does some work, releases lock, a very long time passes, then A
> takes the lock again. In the "time passes" part, that mutex could be
> locked and unlocked any number of times by other threads and A won't
> know or care. Particularly on an SMP machine, other threads that were
> blocked on that mutex could do useful work in the interim without
> impacting A's progress at all. So here, when A leaves the mutex unlocked
> for a long time, it's desirable to give the mutex to one of the waiters
> ASAP.
>

But how do you quantify "a long time"? And what happens if process A is
a very high priority and which nothing else is allowed to run?

>> Just accept that my described scenario is legitimate then consider it in
>> isolation rather than getting caught up in the superfluous details of how
>> such a situation might come about.
>
>
> OK. I'm not trying to be difficult here. In much of life, context is
> everything; very little can be understood in isolation.
>

OK, but other valid examples were offered up - lock inversion avoidance,
and externally driven systems (ie. where it is not known which lock will
be taken next).

> Back to the scenario:
>
>> A realtime system with tasks A and B, A has an RT scheduling priority of
>> 1, and B is 2. A and B are both runnable, so A is running. A takes a
>> mutex
>> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
>> some point it drops the mutex and then tries to take it again.
>>
>> What happens?
>
>
> As I understand the spec, A must block because B has acquired the mutex.
> Once again, the SUS discussion of priority inheritance would never need
> to have been written if this were not the case:
>
> >>>
> In a priority-driven environment, a direct use of traditional primitives
> like mutexes and condition variables can lead to unbounded priority
> inversion, where a higher priority thread can be blocked by a lower
> priority thread, or set of threads, for an unbounded duration of time.
> As a result, it becomes impossible to guarantee thread deadlines.
> Priority inversion can be bounded and minimized by the use of priority
> inheritance protocols. This allows thread deadlines to be guaranteed
> even in the presence of synchronization requirements.
> <<<
>
> The very first sentence indicates that a higher priority thread can be
> blocked by a lower priority thread. If your interpretation of the spec
> were correct, then such an instance would never occur. Since your

Wrong. It will obviously occur if the lower priority process is able
to take a lock before a higher priority process.

The situation will not exist in "the scenario" though, if we follow
my reading of the spec, because *the scheduler* determines the next
process to gain the mutex. This makes perfect sense to me.

> scenario is using realtime threads, then we can assume that the Priority
> Ceiling feature is present and you can use it if needed. (
> http://www.opengroup.org/onlinepubs/000095399/xrat/xsh_chap02.html#tag_03_02_09_06
> Realtime Threads option group )
>

Any kind of priority boost / inherentance like this is orthogonal to
the issue. They still do not prevent B from acquiring the mutex and
thereby blocking the execution of the higher priority A. I think this
is against the spirit of the spec, especially the part where it says
*the scheduler* will choose which process to gain the lock.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

Lee Schermerhorn

unread,

Feb 1, 2006, 12:06:48 PM2/1/06

to Nick Piggin, linux-kernel

Hi, Nick, All:

It is with some trepedation that I jump into this already looong thread...

I spent ~6 years as secretary of the Posix.4/.4a [realtime and threads]
working group where we discussed this stuff ad nauseum. I was also
"technical reviewer" of a few chapters of the .4 drafts [Posix.1a spec].
I only bring this up to point out where my "understanding" of the
"intent" of the spec, such as it is, comes from. A caveat: my
involvement was with IEEE/Posix. The Open Group has adopted the Posix
specs into the SUS and has, rightfully so, imposed additional
interpretation and requirements on it. E.g., the SUS can require that
certain features that are optional in Posix.

There are ambiguities in the spec. We tried to avoid these. One reason
for some of the ambiguities is that this is the best that we could get
the various factions, corporations, ... represented in the working
groups and the balloting groups [not necessarily the same folks] to
agree upon.

The drafts went through a couple of years [maybe longer] of balloting
where folks would object to the wording based on all sorts of real and
imagined scenarios. Still, when all is said and done, the spec says
what it says. I've been told that any interpretation that meets the
letter of the specification [and English is notorious for it's
ambiguity, leaving the field wide open, here] is valid, intentions of
the authors notwithstanding.

Also, note that, for Posix, at least, there are 2 parts of the spec:
The main body of the spec--so called "normative" text or mandatory
requirements--and the rationale that attempts to explain some of the
background and, well, rationale. The rationale is non-normative/non-
binding.

With that background:

[note: the single '>' indents are from Nick. I grabbed this from the
archive. sorry].

My copy of the Posix spec and the on-line SUSv2 say *the scheduling
policy* determines the next process/thread. I.e, the scheduler, per se,
isn't required to get involved--the selected waiter doesn't need to run
to obtain the mutex.

The intention of the authors [again, not binding] was that for
SCHED_OTHER, all bets are off; and for RT/FIFO policies, the highest
priority, longest waiting thread would be chosen. Think of the queue of
waiters being ordered the same way as the run queue for the policy in
effect. Then, the selection of the next thread to get the mutex is
simply the head of the list of waiters. Presumably the scheduling
policy determined how the waiters were queued--paying the price at wait
time, when you're going to suffer a context switch anyway. However, one
could queue in any order at wait time and pay the price to scan the
queue at unlock time.

Now, this harks back to Howard Chu's [I think?] discussion of
[paraphrasing] "what is the eligible set of threads from which the
system selects." The spec is ambiguous here.

One consideration that I haven't heard discussed in this thread [and I
might have missed it] is the notion of "forward progress". I checked my
copy and the spec seems to be silent on this topic. Most
implementations that I've worked with work hard to guarantee forward
progress through the mutex. Generally, I've seen this implemented by
handing off the lock to the "most eligible" thread [as determined by the
scheduling policy]. This is, in my reading of the spec and based on my
understanding of intent, a perfectly conforming implementation. It also
avoids the "thundering herd" phenomenon.

>
>> scenario is using realtime threads, then we can assume that the Priority
>> Ceiling feature is present and you can use it if needed. (
>> http://www.opengroup.org/onlinepubs/000095399/xrat/xsh_chap02.html#tag_03_02_09_06
>> Realtime Threads option group )
>>
>
>Any kind of priority boost / inherentance like this is orthogonal to
>the issue. They still do not prevent B from acquiring the mutex and
>thereby blocking the execution of the higher priority A. I think this
>is against the spirit of the spec, especially the part where it says
>*the scheduler* will choose which process to gain the lock.
>

I don't agree. Priority inheritance and priority ceiling options were
provided for just this purpose. Some of the "hard real-time" working
group members insisted on this feature to avoid the dreaded priority
inversion. However, the non-real-time folks interested in threads for
concurrent programming, didn't need nor want the associated overhead.
Thus, the options. If, in the absence of priority inheritance, a lower
priority thread [B] gains a mutex [e.g., because of direct hand-off at
unlock time] and a higher priority thread [A] must then wait for the
mutex, that's allowed. [Disclaimer: the priority inheritance and other
mutex options were added in a later version of the spec, after I ended
my involvement in the working grou. However, they were discussed at
length in the original .4/.4a working groups and deferred to the later
update.]

The scenario that most of the real time folks [that I've talked to or
heard discuss it] worry about is when a 3rd process, C, with priority
between A and B, preempts B while it's holding the mutex. C can run
for an unbounded time [*unbounded* priority inversion is the big
bugaboo], as long as it doesn't try to obtain the mutex, effectively
preventing B from finishing its use and allowing A to proceed. This is
what priority inheritance is supposed to prevent.

With priority inheritance, when A attempts to take the mutex held by B,
B inherits A's priority such that any C [between A and B in priority]
can't preempt B. When, B releases the the mutex, it's priority is
dropped [to the lower of it's native priority or the highest priority of
any waiters on any other mutexes that B or any of those aforementioned
waiters might hold, yada, yada, yada--an ugly requirement, no? more
rationale for making it optional!] and now A, if it's the highest
priority waiter, can grab the mutex. In a uniprocessor, it probably
doesn't matter whether we hand the mutex off directly to A or just wake
up A, let it preempt B and grab the mutex. In an SMP, some other thread
of lower priority that A could sneak in, requiring another heavy-weight
priority inheritance transaction if we don't hand off.

So, ....

The current implementation, that apparently doesn't hand off, but
requires a waiter to run to grab the mutex is probably conforming, in a
strict sense. But, to say that this is the intent of the spec is, IMO,
a stretch. I suspect it violates the "Principle of Least Astonishment"
for a lot of practioners. I know is does for me, but I've learned that
my vote doesn't count for much in any venue...

Trying to be helpful, but probably just muddying the water...,
Lee

Jeff Garzik

unread,

Feb 7, 2006, 1:57:40 AM2/7/06

to Jesse Brandeburg, Stefan Seyfried, Olaf Kirch, Linux Kernel Mailing List, net...@vger.kernel.org, Jesse Brandeburg, Jeff Kirsher

SIGH. In your last patch submission you had it right, but Intel has yet
again regressed in patch submission form.

Your fixes will be expedited if they can be applied by script, and then
quickly whisked upstream to Linus/Andrew. This one had to be applied by
hand (so yes, its applied) for several reasons:

* Unreviewable in mail reader, due to MIME type application/octet-stream.

* In general, never use MIME (attachments), they decrease the audience
that can easily review your patch.

* Your patch's description and signed-off-by were buried inside the
octet-stream attachment.

* Please review http://linux.yyz.us/patch-format.html (I probably
should add MIME admonitions to that)

Jeff