[RFC] New locking primitive for 2.5

Martin Wirth

unread,

Feb 7, 2002, 10:38:57 AM2/7/02

to

This is a request for comment on a new locking primitive
called a combilock.

The goal of this development is:

1. To allow for a better SMP scalability of semaphores used as Mutex
2. As a replacement for long held spinlocks in an preemptible kernel

The new lock uses a combination of a spinlock and a (mutex-)semaphore.
You can lock it for short-term issues in a spin-lock mode:

combi_spin_lock(struct combilock *x)
combi_spin_unlock(struct combilock *x)

and for longer lasting tasks in a sleeping mode by:

combi_mutex_lock(struct combilock *x)
combi_mutex_unlock(struct combilock *x)

If a spin_lock request is blocked by a mutex_lock call, the spin_lock
attempt also sleeps i.e. behaves like a semaphore.
If you gained ownership of the lock, you can switch between spin-mode
and mutex-(ie.e sleeping) mode by calling:

combi_to_mutex_mode(struct combilock *x)
combi_to_spin_mode(struct combilock *x)

without loosing the lock. So you may start with a spin-lock and relax
to a sleeping lock if for example you need to call a non-atomic kmalloc.

This approach is less automatic than a first_spin_then_sleep mutex,
but normally the programmer knows better if he is going to do quick
things, or maybe unbounded stuff.

Note: For a preemptible kernel this approach could lead to much less
scheduling ping-pong also for UP if a spinlock is replaced by a
combilock instead of a semaphore.

Limitations:
1. In the current implementation the combilock is not irq-save

2. Although the combilock shares some advantages with a
spin-lock (no unnecessary scheduling for short time locking) it may
behave like a semaphore on entry also if you call combi_spin_lock.
For example

spin_lock(&slock);
combi_spin_lock(&clock);

is a BUG because combi_spin_lock may sleep while holding slock!

Open questions:

* Does it make sense to also provide irq-save versions of the
locking functions? This means you could use the unlock functions
from interrupt context. But the main use in this situation is
completion handling and there are already (new) completion handlers
available. So I don't think this is a must have.

* I there any need to provide non exclusive versions of the waiting
functions? For real-time applications this would lead to an
automatic selection of the highest priority task that's waiting
for the lock. But on the other hand it leads to a lot of unnecessary
scheduling and I doubt it's really worth it. But maybe there are
other good reasons to wakeup all waiters.

Possible optimizations:

If a further extension to a priority inheritance scheme is discarded
the owner may be replace by a simple flag. And if this is done,
one possibly could group together the owner flag and the waitqueue
spinlock to a single 3-state lock. But on architectures without a
cmpxchg command this may give no real performance win. And also on
i386 this would need a tweaking of the waitqueue code.

To really take any benefit from a preemptible kernel a lot of spin locks
will have to be replaced by mutex locks. The combi-lock approach may
convince more people who typically fear the higher scheduling pressure
of sleeping locks to do so, if they can decide on each instance which
approach (spin of sleep) will be taken.

Here comes the code:

diff -ruP linux-2.5.3/include/linux/combilock.h
linux/include/linux/combilock.h
--- linux-2.5.3/include/linux/combilock.h Thu Jan 1 01:00:00 1970
+++ linux/include/linux/combilock.h Wed Feb 6 14:09:23 2002
@@ -0,0 +1,86 @@
+#ifndef __LINUX_COMBILOCK_H
+#define __LINUX_COMBILOCK_H
+
+/*
+* combilock data structure.
+* See kernel/sched.c for details.
+*/
+
+
+#include <linux/wait.h>
+#include <asm/current.h>
+
+struct combilock {
+ struct task_struct volatile *owner;
+ wait_queue_head_t wait;
+};
+
+#define COMBILOCK_INITIALIZER(work) \
+ { NULL, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
+
+#define DECLARE_COMBILOCK(work) \
+ struct combilock work = COMBILOCK_INITIALIZER(work)
+
+static inline void init_combilock(struct combilock *x)
+{
+ x->owner = NULL;
+ init_waitqueue_head(&x->wait);
+}
+
+extern int FASTCALL(combi_mutex_trylock(struct combilock *x));
+extern void FASTCALL(combi_mutex_lock(struct combilock *x));
+extern int FASTCALL(combi_mutex_lock_interruptible(struct combilock
*x));
+extern void FASTCALL(combi_mutex_unlock(struct combilock *x));
+extern void FASTCALL(__combi_wait(struct combilock *x));
+
+static inline struct task_struct volatile *combi_owner(struct combilock
*x)
+{
+ return x->owner;
+}
+
+static inline void combi_spin_lock(struct combilock *x)
+{
+ spin_lock(&x->wait.lock);
+ if (unlikely(x->owner))
+ __combi_wait(x);
+}
+
+static inline int combi_spin_trylock(struct combilock *x)
+{
+ if (unlikely(spin_trylock(&x->wait.lock)))
+ return 1;
+ if (unlikely(x->owner))
+ __combi_wait(x);
+ return 0;
+}
+
+static inline void combi_spin_unlock(struct combilock *x)
+{
+ spin_unlock(&x->wait.lock);
+}
+
+static inline void combi_to_mutex_mode(struct combilock *x)
+{
+ if (likely(!x->owner)) {
+ x->owner=current;
+ spin_unlock(&x->wait.lock);
+ }
+}
+
+static inline void combi_to_spin_mode(struct combilock *x)
+{
+ if (likely(x->owner)) {
+ spin_lock(&x->wait.lock);
+ x->owner=NULL;
+ }
+}
+
+static inline void combi_generic_unlock(struct combilock *x)
+{
+ if (likely(!x->owner))
+ combi_spin_unlock(x);
+ else
+ combi_mutex_unlock(x);
+}
+
+#endif
diff -ruP -X /home/adlex/linuxdiffpattern.txt linux-2.5.3/kernel/ksyms.c
linux/kernel/ksyms.c
--- linux-2.5.3/kernel/ksyms.c Tue Jan 29 19:47:10 2002
+++ linux/kernel/ksyms.c Wed Feb 6 15:35:26 2002
@@ -46,6 +46,7 @@
#include <linux/tty.h>
#include <linux/in6.h>
#include <linux/completion.h>
+#include <linux/combilock.h>
#include <linux/seq_file.h>
#include <asm/checksum.h>

@@ -371,6 +372,13 @@
/* completion handling */
EXPORT_SYMBOL(wait_for_completion);
EXPORT_SYMBOL(complete);
+
+/* combilock non-inline functions */
+EXPORT_SYMBOL(combi_mutex_trylock);
+EXPORT_SYMBOL(combi_mutex_lock);
+EXPORT_SYMBOL(combi_mutex_lock_interruptible);
+EXPORT_SYMBOL(combi_mutex_unlock);
+EXPORT_SYMBOL(__combi_wait);

/* The notion of irq probe/assignment is foreign to S/390 */

diff -ruP -X /home/adlex/linuxdiffpattern.txt linux-2.5.3/kernel/sched.c
linux/kernel/sched.c
--- linux-2.5.3/kernel/sched.c Tue Jan 29 00:12:47 2002
+++ linux/kernel/sched.c Wed Feb 6 13:36:19 2002
@@ -19,6 +19,7 @@
#include <linux/smp_lock.h>
#include <linux/interrupt.h>
#include <asm/mmu_context.h>
+#include <linux/combilock.h>

#define BITMAP_SIZE ((((MAX_PRIO+7)/8)+sizeof(long)-1)/sizeof(long))

@@ -782,6 +783,89 @@
x->done--;
spin_unlock_irq(&x->wait.lock);
}
+
+
+
+
+
+
+/*
+ * Helper functions assume we hold x->wait.lock
+ */
+void __combi_wait(struct combilock *x)
+{
+ DECLARE_WAITQUEUE(wait, current);
+
+ wait.flags |= WQ_FLAG_EXCLUSIVE;
+ __add_wait_queue_tail(&x->wait, &wait);
+ do {
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ spin_unlock(&x->wait.lock);
+ schedule();
+ spin_lock(&x->wait.lock);
+ } while (x->owner);
+ __remove_wait_queue(&x->wait, &wait);
+}
+
+int combi_mutex_lock_interruptible(struct combilock *x)
+{
+ int res=0;
+ spin_lock(&x->wait.lock);
+ if (unlikely(x->owner)) {
+ DECLARE_WAITQUEUE(wait, current);
+
+ wait.flags |= WQ_FLAG_EXCLUSIVE;
+ __add_wait_queue_tail(&x->wait, &wait);
+ for (;;) {
+ __set_current_state(TASK_INTERRUPTIBLE);
+ spin_unlock(&x->wait.lock);
+ schedule();
+ spin_lock(&x->wait.lock);
+ if (likely(!x->owner)) {
+ x->owner=current;
+ break;
+ }
+ if (unlikely(signal_pending(current))) {
+ res=1;
+ break;
+ }
+ }
+ __remove_wait_queue(&x->wait, &wait);
+ }
+ spin_unlock(&x->wait.lock);
+ return res;
+}
+
+int combi_mutex_trylock(struct combilock *x)
+{
+ spin_lock(&x->wait.lock);
+ if (unlikely(x->owner)) {
+ spin_unlock(&x->wait.lock);
+ return 1;
+ }
+ x->owner=current;
+ spin_unlock(&x->wait.lock);
+ return 0;
+}
+
+void combi_mutex_lock(struct combilock *x)
+{
+ spin_lock(&x->wait.lock);
+ if (unlikely(x->owner))
+ __combi_wait(x);
+ x->owner=current;
+ spin_unlock(&x->wait.lock);
+}
+
+void combi_mutex_unlock(struct combilock *x)
+{
+ spin_lock(&x->wait.lock);
+ x->owner=NULL;
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE |
TASK_INTERRUPTIBLE,1, 0);
+ spin_unlock(&x->wait.lock);
+}
+
+

#define SLEEP_ON_VAR \
unsigned long flags; \
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Richard Gooch

unread,

Feb 7, 2002, 1:06:23 PM2/7/02

to

Daniel Phillips writes:

> On February 7, 2002 04:38 pm, Martin Wirth wrote:
> > The new lock uses a combination of a spinlock and a (mutex-)semaphore.
>

> Spinaphore :-)

Or mutabloat :-(

Regards,

Richard....
Permanent: rgo...@atnf.csiro.au
Current: rgo...@ras.ucalgary.ca

Christoph Hellwig

unread,

Feb 7, 2002, 1:22:42 PM2/7/02

to

In article <3C629F91...@dlr.de> you wrote:
> The new lock uses a combination of a spinlock and a (mutex-)semaphore.

> You can lock it for short-term issues in a spin-lock mode:
>
> combi_spin_lock(struct combilock *x)
> combi_spin_unlock(struct combilock *x)
>
> and for longer lasting tasks in a sleeping mode by:
>
> combi_mutex_lock(struct combilock *x)
> combi_mutex_unlock(struct combilock *x)

I think this API is really ugly. If both pathes actually do the same,
just with different defaults, one lock function with a flag would be
much nicer. Also why do we need two unlock functions?

What about the following instead:

combi_lock(struct combilock *x, int spin);
combi_unlock(struct combilock *x);

> If a spin_lock request is blocked by a mutex_lock call, the spin_lock
> attempt also sleeps i.e. behaves like a semaphore.
> If you gained ownership of the lock, you can switch between spin-mode
> and mutex-(ie.e sleeping) mode by calling:
>
> combi_to_mutex_mode(struct combilock *x)
> combi_to_spin_mode(struct combilock *x)
>
> without loosing the lock. So you may start with a spin-lock and relax
> to a sleeping lock if for example you need to call a non-atomic kmalloc.

This looks really ugly. I'd really prefer an automatic fallback from
spinning to sleeping after some timeout like e.g. solaris adaptive
mutices.

> * Does it make sense to also provide irq-save versions of the
> locking functions? This means you could use the unlock functions
> from interrupt context. But the main use in this situation is
> completion handling and there are already (new) completion handlers
> available. So I don't think this is a must have.

You are no supposed to sleep in irq context, so irq-save combi-locks
don't make that much sense, IMHO.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

Ingo Molnar

unread,

Feb 7, 2002, 4:27:14 PM2/7/02

to

On Thu, 7 Feb 2002, Andrew Morton wrote:

> Quite a few. Significant ones. pagemap_lru_lock and lru_list_lock
> come to mind.

ugh. Are you sure we want to *sleep* with something like pagemap_lru_lock
held? That pretty much brings all pagecache related operations to a
grinding halt. I think complex spinlocked sections should be simplified
rather.

Ingo

Robert Love

unread,

Feb 7, 2002, 1:40:59 PM2/7/02

to

On Thu, 2002-02-07 at 10:38, Martin Wirth wrote:
> This is a request for comment on a new locking primitive
> called a combilock.

Interesting ...

The question I raise is, how many locks do we have where we have a
single resource we lock where in some codepaths the lock is used for
short duration and in other places the lock is long-duration?

It would be useful to identify a few locks where this would benefit and
apply the appropriate combi variant and do some benchmarking.

Some of the talk I've heard has been toward an adaptive lock. These are
locks like Solaris's that can spin or sleep, usually depending on the
state of the lock's holder. Another alternative, which I prefer since
it is much less overhead, is a lock that spins-then-sleeps
unconditionally.

> If a spin_lock request is blocked by a mutex_lock call, the spin_lock
> attempt also sleeps i.e. behaves like a semaphore.
> If you gained ownership of the lock, you can switch between spin-mode
> and mutex-(ie.e sleeping) mode by calling:

This can be bad. What if I grab a spinlock in a codepath where only a
spinlock is appropriate (i.e. somewhere I can't sleep, like an irq
handler) -- and then I sleep?

> Note: For a preemptible kernel this approach could lead to much less
> scheduling ping-pong also for UP if a spinlock is replaced by a
> combilock instead of a semaphore.

Very true ... but only assuming we can find locks where there are
differing profiles of use.

> Open questions:

>
> * Does it make sense to also provide irq-save versions of the
> locking functions? This means you could use the unlock functions
> from interrupt context. But the main use in this situation is
> completion handling and there are already (new) completion handlers
> available. So I don't think this is a must have.

You can't sleep in an interrupt request handler, so this wouldn't make a
lot of sense.

> * I there any need to provide non exclusive versions of the waiting
> functions? For real-time applications this would lead to an
> automatic selection of the highest priority task that's waiting
> for the lock. But on the other hand it leads to a lot of unnecessary
> scheduling and I doubt it's really worth it. But maybe there are
> other good reasons to wakeup all waiters.

Non-exclusive wakeups are common ... see the uses of normal wake_up vs
wake_up_exclusive.

> To really take any benefit from a preemptible kernel a lot of spin locks
> will have to be replaced by mutex locks. The combi-lock approach may
> convince more people who typically fear the higher scheduling pressure
> of sleeping locks to do so, if they can decide on each instance which
> approach (spin of sleep) will be taken.

We shouldn't engage in wholesale changing of spinlocks to semaphores
without a priority-inheritance mechanism. And _that_ is the bigger
issue ...

Robert Love

Andrew Morton

unread,

Feb 7, 2002, 2:25:14 PM2/7/02

to

Robert Love wrote:
>
> On Thu, 2002-02-07 at 10:38, Martin Wirth wrote:
> > This is a request for comment on a new locking primitive
> > called a combilock.
>
> Interesting ...
>
> The question I raise is, how many locks do we have where we have a
> single resource we lock where in some codepaths the lock is used for
> short duration and in other places the lock is long-duration?

Quite a few. Significant ones. pagemap_lru_lock and lru_list_lock
come to mind.

> It would be useful to identify a few locks where this would benefit and

> apply the appropriate combi variant and do some benchmarking.
>
> Some of the talk I've heard has been toward an adaptive lock. These are
> locks like Solaris's that can spin or sleep, usually depending on the
> state of the lock's holder. Another alternative, which I prefer since
> it is much less overhead, is a lock that spins-then-sleeps
> unconditionally.

I dunno. The spin-a-bit-then-sleep lock has always struck me as
i_dont_know_what_the_fuck_im_doing_lock(). Martin's approach puts
the decision in the hands of the programmer, rather than saying
"Oh gee I goofed" at runtime.

I need to think about all of this some more...

> ...

>
> > To really take any benefit from a preemptible kernel a lot of spin locks
> > will have to be replaced by mutex locks. The combi-lock approach may
> > convince more people who typically fear the higher scheduling pressure
> > of sleeping locks to do so, if they can decide on each instance which
> > approach (spin of sleep) will be taken.
>
> We shouldn't engage in wholesale changing of spinlocks to semaphores
> without a priority-inheritance mechanism. And _that_ is the bigger
> issue ...

hmmm.

Let's back off a bit. What are we trying to achieve here? What
problem are we trying to solve? Is it to allow preemptability
inside the infamous long-held locks? If so then I'd favour
a piecemeal approach to handling each one, rather than magic
bullets. Now it may be that certain of the locks are best handled
via a new primitive, but that's not obviously true at this time, to me.

-

Robert Love

unread,

Feb 7, 2002, 3:08:02 PM2/7/02

to

On Thu, 2002-02-07 at 14:58, yoda...@fsmlabs.com wrote:

> On Thu, Feb 07, 2002 at 01:40:59PM -0500, Robert Love wrote:
> > We shouldn't engage in wholesale changing of spinlocks to semaphores
> > without a priority-inheritance mechanism. And _that_ is the bigger
> > issue ...
>

> Cool. We can then have the Solaris "this usually doesn't fail on test" priority
> inherit read/write lock. I can hardly wait.

Or, we could do things right and not.

Robert Love

Andrew Morton

unread,

Feb 7, 2002, 2:59:19 PM2/7/02

to

Ingo Molnar wrote:
>
> On Thu, 7 Feb 2002, Andrew Morton wrote:
>

> > Quite a few. Significant ones. pagemap_lru_lock and lru_list_lock
> > come to mind.
>

> ugh. Are you sure we want to *sleep* with something like pagemap_lru_lock
> held?

Not guilty :) I was answering rml's question.

I suspect lru_list_lock is the shining example. We often
take it for very short periods and occasionally take it
for enormous periods.

> That pretty much brings all pagecache related operations to a
> grinding halt.

yup. We'd go into a sheduling storm until we find the process
which holds the lock.

> I think complex spinlocked sections should be simplified
> rather.

yes.

-

Mark Frazer

unread,

Feb 7, 2002, 2:55:46 PM2/7/02

to

Christoph Hellwig <h...@ns.caldera.de> [02/02/07 14:41]:

> What about the following instead:
>
> combi_lock(struct combilock *x, int spin);
> combi_unlock(struct combilock *x);

how about
combi_lock (struct combilock *x, int canblock, int shouldblock);

Where the should block flag is copied to the mutex once it's held by
the caller to indicate to new threads grabbing the lock how long the
lock will be held for.

For locks that are held for some short duration tasks and some long
duration tasks, the holder should indicate how long the lock will be held.
I'd consider that an improvement over the "spin for a while then block"
idea.

Interrupt handlers can't block, so we need a flag to never block.

-mark

yoda...@fsmlabs.com

unread,

Feb 7, 2002, 2:56:01 PM2/7/02

to

On Thu, Feb 07, 2002 at 04:38:57PM +0100, Martin Wirth wrote:
> This is a request for comment on a new locking primitive
> called a combilock.
>

> The goal of this development is:
>
> 1. To allow for a better SMP scalability of semaphores used as Mutex
> 2. As a replacement for long held spinlocks in an preemptible kernel
>

> The new lock uses a combination of a spinlock and a (mutex-)semaphore.
> You can lock it for short-term issues in a spin-lock mode:
>
> combi_spin_lock(struct combilock *x)
> combi_spin_unlock(struct combilock *x)
>
> and for longer lasting tasks in a sleeping mode by:
>
> combi_mutex_lock(struct combilock *x)
> combi_mutex_unlock(struct combilock *x)
>

> If a spin_lock request is blocked by a mutex_lock call, the spin_lock
> attempt also sleeps i.e. behaves like a semaphore.

So what's the difference between combi_spin and combi_mutex?
combi_spin becomes
if not mutex locked, spin
else sleep
Bizzare

The entire concept is revolting.

Ingo Molnar

unread,

Feb 7, 2002, 5:09:16 PM2/7/02

to

On Thu, 7 Feb 2002, yodaiken wrote:

> So what's the difference between combi_spin and combi_mutex?
> combi_spin becomes
> if not mutex locked, spin
> else sleep
> Bizzare

no, the real optimization is that when spin meets spin, they will not
mutex. If a mutex-user has it then spins turn into mutex, but that (is
supposed to) happen rarely.

i think one example *could* be to turn inode->i_sem into a combi-lock. Eg.
generic_file_llseek() could use the spin variant.

this is a real performance problem, i've seen scheduling storms in
dbench-type runs due to llseek taking the inode semaphore.

whether combi-locks truly bring performance benefits remains to be seen,
but the patch definitely needs to provide some working example and some
hard numbers for some real workload.

Ingo

Robert Love

unread,

Feb 7, 2002, 3:11:46 PM2/7/02

to

On Thu, 2002-02-07 at 15:06, Andrew Morton wrote:

> A dynamic lock which says "we've spun for too long, let's sleep"
> seems to be a tradeoff between programmer effort and efficiency,
> and a bad one at that.

I'm not so sure. What if we can't _know_ how long the lock will be held
because we don't know the status of the holder? What if _he_ is
sleeping on some other lock or their are a lot of contending processes?

Certainly I agree, we need to put forth effort into designing things
right and with a minimal amount of lock held time.

> Possibly the locks could become more adaptive, and could, at
> each call site, "learn" the expected spintime. But it all seems
> too baroque to me.

Agreed, this is much too much ;-)

Robert Love

yoda...@fsmlabs.com

unread,

Feb 7, 2002, 2:58:53 PM2/7/02

to

On Thu, Feb 07, 2002 at 01:40:59PM -0500, Robert Love wrote:

> > To really take any benefit from a preemptible kernel a lot of spin locks
> > will have to be replaced by mutex locks. The combi-lock approach may
> > convince more people who typically fear the higher scheduling pressure
> > of sleeping locks to do so, if they can decide on each instance which
> > approach (spin of sleep) will be taken.
>

> We shouldn't engage in wholesale changing of spinlocks to semaphores
> without a priority-inheritance mechanism. And _that_ is the bigger
> issue ...

Cool. We can then have the Solaris "this usually doesn't fail on test" priority
inherit read/write lock. I can hardly wait.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
www.fsmlabs.com www.rtlinux.com

Robert Love

unread,

Feb 7, 2002, 3:20:24 PM2/7/02

to

On Thu, 2002-02-07 at 15:15, yoda...@fsmlabs.com wrote:

> I'd love to hear how things could be done right here.
> There seem to be 3 choices for reader writer locks

Assuming there is no

4. a solution that works

(and I do not assume that) we can just not do inheritance under
reader-writer locks and that means they remain as spin locks. Normal
spin locks remain proper candidates.

I never mentioned anything about reader-writer locks in my original
email. Most of the long-held locks I am considering are not in this
category anyway ...

Robert Love

P.S. If this is going to turn into another priority-inheritance flame, I
am stopping here. Let's take it off-list or just drop it, please. I'd
much prefer to discuss the current combilock issue which is at hand. ;)

Dave Hansen

unread,

Feb 7, 2002, 2:51:22 PM2/7/02

to

Andrew Morton wrote:
> Robert Love wrote:
>>On Thu, 2002-02-07 at 10:38, Martin Wirth wrote:
>>Some of the talk I've heard has been toward an adaptive lock. These are
>>locks like Solaris's that can spin or sleep, usually depending on the
>>state of the lock's holder. Another alternative, which I prefer since
>>it is much less overhead, is a lock that spins-then-sleeps
>>unconditionally.
> I dunno. The spin-a-bit-then-sleep lock has always struck me as
> i_dont_know_what_the_fuck_im_doing_lock(). Martin's approach puts
> the decision in the hands of the programmer, rather than saying
> "Oh gee I goofed" at runtime.

The spin-then-sleep lock could be interesting as a replacement for the
BKL in places where a semaphore causes performance degredation. In
quite a few places where we replaced the BKL with a more finely grained
semapore (not a spinlock because we needed to sleep during the hold),
instead of spinning for a bit, it would schedule instead. This was bad
:). Spin-then-sleep would be great behaviour in this situation.

--
Dave Hansen
have...@us.ibm.com

Andrew Morton

unread,

Feb 7, 2002, 3:06:22 PM2/7/02

to

Dave Hansen wrote:
>
> Andrew Morton wrote:
> > Robert Love wrote:
> >>On Thu, 2002-02-07 at 10:38, Martin Wirth wrote:
> >>Some of the talk I've heard has been toward an adaptive lock. These are
> >>locks like Solaris's that can spin or sleep, usually depending on the
> >>state of the lock's holder. Another alternative, which I prefer since
> >>it is much less overhead, is a lock that spins-then-sleeps
> >>unconditionally.
> > I dunno. The spin-a-bit-then-sleep lock has always struck me as
> > i_dont_know_what_the_fuck_im_doing_lock(). Martin's approach puts
> > the decision in the hands of the programmer, rather than saying
> > "Oh gee I goofed" at runtime.
>
> The spin-then-sleep lock could be interesting as a replacement for the
> BKL in places where a semaphore causes performance degredation. In
> quite a few places where we replaced the BKL with a more finely grained
> semapore (not a spinlock because we needed to sleep during the hold),
> instead of spinning for a bit, it would schedule instead. This was bad
> :). Spin-then-sleep would be great behaviour in this situation.

But surely you *knew*, from inspection, which code paths needed
a spinning lock, and which code paths needed a sleeping lock?

Assuming the answer is "yes" then a nice fix would be to use
two separate locks - one which spins and one which sleeps.

Now, if the resource which is being protected truly cannot
be split up into spin-protected and sleep-protected sections
then a lock which can be atomically converted from spinning to
sleeping at the programmer's discretion seems appropriate.

A dynamic lock which says "we've spun for too long, let's sleep"
seems to be a tradeoff between programmer effort and efficiency,
and a bad one at that.

Possibly the locks could become more adaptive, and could, at

each call site, "learn" the expected spintime. But it all seems
too baroque to me.

-

yoda...@fsmlabs.com

unread,

Feb 7, 2002, 3:15:50 PM2/7/02

to

On Thu, Feb 07, 2002 at 03:08:02PM -0500, Robert Love wrote:

> On Thu, 2002-02-07 at 14:58, yoda...@fsmlabs.com wrote:
>
> > On Thu, Feb 07, 2002 at 01:40:59PM -0500, Robert Love wrote:
> > > We shouldn't engage in wholesale changing of spinlocks to semaphores
> > > without a priority-inheritance mechanism. And _that_ is the bigger
> > > issue ...
> >
> > Cool. We can then have the Solaris "this usually doesn't fail on test" priority
> > inherit read/write lock. I can hardly wait.
>

> Or, we could do things right and not.

I'd love to hear how things could be done right here.

There seem to be 3 choices for reader writer locks

1. Do the right thing and say no to inheritance: and this
means no inheritance on mutexes either.
2. Use the Solaris - "sometimes kinda works" method.
3. Make readers/writer locks very slow and expensive e.g
a complete list of reader identities that with atomic insert/delete
and with check for uniqueness on insert! Not to mention the write
promotion, any interactions between the "favor writes" design it should
have and inheritance, links for a mutex inheriting lock to follow down
the complete tree of paths from the r/w lock ...

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
www.fsmlabs.com www.rtlinux.com

-

yoda...@fsmlabs.com

unread,

Feb 7, 2002, 3:36:02 PM2/7/02

to

On Thu, Feb 07, 2002 at 03:20:24PM -0500, Robert Love wrote:

> On Thu, 2002-02-07 at 15:15, yoda...@fsmlabs.com wrote:
>
> > I'd love to hear how things could be done right here.
> > There seem to be 3 choices for reader writer locks
>

> Assuming there is no
>
> 4. a solution that works
>
> (and I do not assume that) we can just not do inheritance under

> reader-writer locks and that means they remain as spin locks. Normal
> spin locks remain proper candidates.
>
> I never mentioned anything about reader-writer locks in my original
> email. Most of the long-held locks I am considering are not in this
> category anyway ...

I'm content to let it drop here, but I simply observe that you keep bringing
up the glorious future of inheritance without addressing any of the hard
problems. My contention is that the very capable Solaris engineers did not find the
(4) above because it does not exist.

> P.S. If this is going to turn into another priority-inheritance flame, I
> am stopping here. Let's take it off-list or just drop it, please. I'd
> much prefer to discuss the current combilock issue which is at hand. ;)

It's the same issue.

yoda...@fsmlabs.com

unread,

Feb 7, 2002, 3:31:09 PM2/7/02

to

On Thu, Feb 07, 2002 at 11:09:16PM +0100, Ingo Molnar wrote:
>
> On Thu, 7 Feb 2002, yodaiken wrote:
>
> > So what's the difference between combi_spin and combi_mutex?
> > combi_spin becomes
> > if not mutex locked, spin
> > else sleep
> > Bizzare
>
> no, the real optimization is that when spin meets spin, they will not
> mutex. If a mutex-user has it then spins turn into mutex, but that (is
> supposed to) happen rarely.

It seems like what you want is:
if the lock is about to be released, spin, else sleep.
But what is proposed is
if the lock is locked as a mutex, sleep, else spin
although I doubt either of these work - they seem like attempts to avoid
designing the code.

>
> i think one example *could* be to turn inode->i_sem into a combi-lock. Eg.
> generic_file_llseek() could use the spin variant.
>
> this is a real performance problem, i've seen scheduling storms in
> dbench-type runs due to llseek taking the inode semaphore.

llseek:
atomic_enquee request
if no room gotta sleep
else if trylock mutex
return
else
do work
loop:
process any pending requests
release lock;
if pending_requests && !(trylock mutex) goto loop

> whether combi-locks truly bring performance benefits remains to be seen,
> but the patch definitely needs to provide some working example and some
> hard numbers for some real workload.

I think it's a lot easier to propose lock structures than to work on
reducing synchronization problems.

>
> Ingo

Andrew Morton

unread,

Feb 7, 2002, 3:57:19 PM2/7/02

to

yoda...@fsmlabs.com wrote:
>
> llseek:
> atomic_enquee request
> if no room gotta sleep
> else if trylock mutex
> return
> else
> do work
> loop:
> process any pending requests
> release lock;
> if pending_requests && !(trylock mutex) goto loop

This is how printk() works. It was a very powerful and satisfactory
solution to a nasty locking/atomicity problem. It'd be nice to have
a more generic way of expressing that solution.

-

yoda...@fsmlabs.com

unread,

Feb 7, 2002, 4:02:19 PM2/7/02

to

On Thu, Feb 07, 2002 at 12:57:19PM -0800, Andrew Morton wrote:
> yoda...@fsmlabs.com wrote:
> >
> > llseek:
> > atomic_enquee request
> > if no room gotta sleep
> > else if trylock mutex
> > return
> > else
> > do work
> > loop:
> > process any pending requests
> > release lock;
> > if pending_requests && !(trylock mutex) goto loop
>
> This is how printk() works. It was a very powerful and satisfactory
> solution to a nasty locking/atomicity problem. It'd be nice to have
> a more generic way of expressing that solution.

note how I put in the goto so Ingo would be more happy with it -)

>
> -

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
www.fsmlabs.com www.rtlinux.com

-

yoda...@fsmlabs.com

unread,

Feb 7, 2002, 4:00:56 PM2/7/02

to

On Thu, Feb 07, 2002 at 09:57:35PM +0100, Daniel Phillips wrote:

> On February 7, 2002 09:36 pm, yoda...@fsmlabs.com wrote:
> > > P.S. If this is going to turn into another priority-inheritance flame, I
> > > am stopping here. Let's take it off-list or just drop it, please. I'd
> > > much prefer to discuss the current combilock issue which is at hand. ;)
> >
> > It's the same issue.
>

> Not necessarily, look at Ingo's observation about replacing semaphores with
> combi-locks as opposed to replacing spinlocks with combi-locks.

The underlying issue is an attempt to find a magic trick that will make
hard synchronization problems go away. The result is usually something that
makes hard synchronization problems more obscure. Ingo points to a case,
apparently triggered only by a wierd benchmark artefact where a queue of very
short term operations builds up a queue of expensive process reschedules. The
problem is that the same semaphore is used to for slow and fast operations
and the solution is to split them apart somehow or to conclude that its not
an important case.

As Ingo points out, you need some actual positive results here, not a plausibility
argument.

Martin Wirth

unread,

Feb 7, 2002, 3:49:53 PM2/7/02

to

Christoph Hellwig wrote:
> I think this API is really ugly. If both pathes actually do the same,
> just with different defaults, one lock function with a flag would be
> much nicer.

Just to use plain numbers is not very instructive, so you ask for a
macro
definition like COMBI_LOCK_SPIN_MODE ?????

> Also why do we need two unlock functions?

There is the generic_unlock function, if you forgot in which mode you
are.
The main reason is performance for the spin mode: combi_spin_unlock is
just
a spin_unlock, no test, no branch. So you are faster if you know what
you did
a few lines of code before ;-)

Robert Love wrote:
> > If a spin_lock request is blocked by a mutex_lock call, the spin_lock
> > attempt also sleeps i.e. behaves like a semaphore.

> > If you gained ownership of the lock, you can switch between spin-mode
> > and mutex-(ie.e sleeping) mode by calling:
>
> This can be bad. What if I grab a spinlock in a codepath where only a
> spinlock is appropriate (i.e. somewhere I can't sleep, like an irq
> handler) -- and then I sleep?

As noted in my initial e-mail the current implementation is not for
use in irq-handlers or BHs etc.

>
> > Open questions:
> >
> > * Does it make sense to also provide irq-save versions of the
> > locking functions? This means you could use the unlock functions
> > from interrupt context. But the main use in this situation is
> > completion handling and there are already (new) completion handlers
> > available. So I don't think this is a must have
>

> You can't sleep in an interrupt request handler , so this wouldn't make a
> lot of sense.

You of course were only allowed to call the unlock() functions!!
Therefore you could use them to free a resource from the handler
(but that's very much completion handling, see above).

> We shouldn't engage in wholesale changing of spinlocks to semaphores
> without a priority-inheritance mechanism. And _that_ is the bigger
> issue ...

The combilock at least can be used to narrow the time windows for
priority
inversion because for most purposes you would use the spin mode. I
thinking
about some extension in this direction (that's why the owner field is a
pointer
to the owning task btw.).

> As for combi lock itself, it would be great, if it were possible to
> detect whether lock is held by thread running on the same CPU and sleep
> if so. This would allow for implementing interrupts as separate threads,
> etc.

That the e.g. the aproach of Solaris which results in about 5 time
higher
latencies from a hardware interrupt to the waiting process.

Martin Wirth

Nigel Gamble

unread,

Feb 8, 2002, 3:20:37 AM2/8/02

to

On Thu, 7 Feb 2002, Andrew Morton wrote:
> I dunno. The spin-a-bit-then-sleep lock has always struck me as
> i_dont_know_what_the_fuck_im_doing_lock(). Martin's approach puts
> the decision in the hands of the programmer, rather than saying
> "Oh gee I goofed" at runtime.

I completely agree, and I couldn't have put it better! Kernel
programmers really should know exactly why, what, where and for how long
they are holding a lock.

This is why, incidently, I don't like any of the so-called lockless
schemes, including the original unix kernel monitor lock (i.e. only one
kernel thread active at a time), because they encourage unmaintainable
code where the critical sections are invisible to everyone and are
easily broken when someone accidently inserts a blocking function into
one of the invisible critical sections.

Nigel Gamble ni...@nrg.org
Mountain View, CA, USA. http://www.nrg.org/

Martin Wirth

unread,

Feb 8, 2002, 3:34:58 AM2/8/02

to

Robert Love wrote:

> Some of the talk I've heard has been toward an adaptive lock. These
> are locks like Solaris's that can spin or sleep, usually depending on
> the state of the lock's holder. Another alternative, which I prefer
> since it is much less overhead, is a lock that spins-then-sleeps
> unconditionally.

Dave Hanson wrote:

> he spin-then-sleep lock could be interesting as a replacement for the
> BKL in places where a semaphore causes performance degredation. In
> quite a few places where we replaced the BKL with a more finely grained
> semapore (not a spinlock because we needed to sleep during the hold),
> instead of spinning for a bit, it would schedule instead. This was bad
> :). Spin-then-sleep would be great behaviour in this situation.

Wouldn't it be sufficient to include the following patch of code
at the beginning of __combi_wait(...):

if (smp_processor_id() != owner->cpu) {
int cnt=MAX_LOOP_CNT;
retry:
spin_unlock(&x->wait.lock)
do {
barrier();
while (--cnt && x->owner);
spin_lock(&x->wait.lock);
if (!x->owner)
return;
if (cnt)
goto retry;
}
then the sleep code of __combi_wait(...)

If one fears that the owner (or current if the kernel is made
preemptible) migrated to the same cpu while we are spinning
for x->owner and hence may
make no progress, one could let the waiting loop last about a typical
process switch time and add an outer loop that checks if the cpu
of the owner is still different.

Martin Wirth

Christoph Hellwig

unread,

Feb 8, 2002, 7:31:07 AM2/8/02

to

In article <Pine.LNX.4.33.02020...@localhost.localdomain> you wrote:
> i think one example *could* be to turn inode->i_sem into a combi-lock. Eg.
> generic_file_llseek() could use the spin variant.

No. i_sem should be split into a spinlock for short-time accessed
fields that get written to even if the file content is only read (i.e.
atime) and a read-write semaphore.

yoda...@fsmlabs.com

unread,

Feb 8, 2002, 10:13:00 AM2/8/02

to

On Fri, Feb 08, 2002 at 10:47:24AM -0200, Denis Vlasenko wrote:

> On 7 February 2002 17:56, yoda...@fsmlabs.com wrote:
> > > If a spin_lock request is blocked by a mutex_lock call, the spin_lock
> > > attempt also sleeps i.e. behaves like a semaphore.
> >

> > So what's the difference between combi_spin and combi_mutex?
> > combi_spin becomes
> > if not mutex locked, spin
> > else sleep
> > Bizzare
>

> combi_spin_lock():
> If not mutex locked, spin - will be released shortly
> else sleep - may take long time before released
> * lock released *
> spin lock it! <=== this is the difference -
> combi_mutex_lock would mutex lock it here
>
> What's wrong with this?

In the elegant words of Andrew Morton, this is a "I don't know what
the fuck I'm doing lock".

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
www.fsmlabs.com www.rtlinux.com

-

Nigel Gamble

unread,

Feb 8, 2002, 11:51:36 AM2/8/02

to

On Fri, 8 Feb 2002, Christoph Hellwig wrote:
> In article <Pine.LNX.4.33.02020...@localhost.localdomain> you wrote:
> > i think one example *could* be to turn inode->i_sem into a combi-lock. Eg.
> > generic_file_llseek() could use the spin variant.
>
> No. i_sem should be split into a spinlock for short-time accessed
> fields that get written to even if the file content is only read (i.e.
> atime) and a read-write semaphore.

Read-write semaphores should never be used. As others have pointed out,
they cause really intractable priority inversion problems (because a
high priority writer will often have to wait for an unbounded number of
lower priority readers, some of which may have called a blocking
function while holding the read lock).

Note that I'm not talking about read-write spinlocks, which are (or
should be) held for a short, bounded time and can't be held over a
blocking call, so they are not quite so problematic.

Nigel Gamble ni...@nrg.org
Mountain View, CA, USA. http://www.nrg.org/

-

Linus Torvalds

unread,

Feb 8, 2002, 1:28:44 PM2/8/02

to

Now, before we go further, can people explain _why_ we want this?

If something is getting a lot of short contention as a semaphore, maybe
it's just broken locking. Let's not work around it with a new locking
primitive just because we can.

What is the _existing_ problem this is trying to solve, and why?

Linus

Larry McVoy

unread,

Feb 8, 2002, 12:06:34 PM2/8/02

to

On Fri, Feb 08, 2002 at 12:20:37AM -0800, Nigel Gamble wrote:
> On Thu, 7 Feb 2002, Andrew Morton wrote:
> > I dunno. The spin-a-bit-then-sleep lock has always struck me as
> > i_dont_know_what_the_fuck_im_doing_lock(). Martin's approach puts
> > the decision in the hands of the programmer, rather than saying
> > "Oh gee I goofed" at runtime.
>
> I completely agree, and I couldn't have put it better! Kernel
> programmers really should know exactly why, what, where and for how long
> they are holding a lock.

Should != do.

And any kernel programmer who says they do in a fine grained multithreaded
kernel is full of it. Look at IRIX, look at Solaris, and show me someone
who says they know for a fact how long they hold each lock and I'll show
you a liar.

Furthermore, while adaptive spin-then-sleep locks may look stupid, I think
you may be missing the point. If you are running an SMP kernel on a UP,
you want the lock to sleep immediately. If you are running an SMP kernel
on an SMP, then you want to spin if the lock is held by some other CPU
but sleep if it is held by this CPU. I suspect that that is what was
really meant by spin-a-bit-then-sleep, it just got lost in translation.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Alexander Viro

unread,

Feb 8, 2002, 1:33:31 PM2/8/02

to

On Fri, 8 Feb 2002, Martin Wirth wrote:

> But this approach needs a lot a proper documentation and discipline.

... and this attitude is my main beef with B^WLSE crowd.

Folks, if you have nothing else to do - *don't* mess with locking just
for the heck of it. And yes, you are supposed to understand WTF you
are doing.

Linus Torvalds

unread,

Feb 8, 2002, 3:02:25 PM2/8/02

to

On Fri, 8 Feb 2002, Martin Wirth wrote:
>

> There are currently several attempts discussed to push out the
> BKL and replace it by a semaphore e.g. the next step Robert Love
> planned for his ll_seek patch (replace the BKL by inode i_sem).

But that won't have any contention anyway, so it's a non-issue.

Linus

Andrew Morton

unread,

Feb 8, 2002, 1:41:18 PM2/8/02

to

Christoph Hellwig wrote:
>
> In article <Pine.LNX.4.33.02020...@localhost.localdomain> you wrote:
> > i think one example *could* be to turn inode->i_sem into a combi-lock. Eg.
> > generic_file_llseek() could use the spin variant.
>
> No. i_sem should be split into a spinlock for short-time accessed
> fields that get written to even if the file content is only read (i.e.
> atime) and a read-write semaphore.

I don't see any strong reason for taking i_sem in the generic seek
functions. The only thing we seem to need to protect in there
is the non-atomic access to 64-bit i_size on 32-bit platforms,
for which a spinlock is appropriate.

I'd be interested in hearing more details on the regression which
Ingo has seen due to the introduction of i_sem locking in llseek.
Not just for "I told you so" value, but for the body of knowledge :)

-

Martin Wirth

unread,

Feb 8, 2002, 1:12:33 PM2/8/02

to

Linus Torvalds wrote:
>
> Now, before we go further, can people explain _why_ we want this?
>
> If something is getting a lot of short contention as a semaphore, maybe
> it's just broken locking. Let's not work around it with a new locking
> primitive just because we can.
>
> What is the _existing_ problem this is trying to solve, and why?
>
> Linus

There are currently several attempts discussed to push out the

BKL and replace it by a semaphore e.g. the next step Robert Love
planned for his ll_seek patch (replace the BKL by inode i_sem).

The naive replacement BKL -> semaphore is surely bad.
Now on the other extreme you may always find a splitting of
the data you want to protected into short locked and long locked
sections like Christoph Hellwig suggested for i_sem:

>
> No. i_sem should be split into a spinlock for short-time accessed
> fields that get written to even if the file content is only read (i.e.
> atime) and a read-write semaphore.
>

But this approach needs a lot a proper documentation and discipline.

So for most BKL removal work I suggested the combi-lock which scales
better than a semaphore but is more manageable than splitted locking.

Martin

P.S.: I posted the combi-lock as a practical RFC to bring the discussion
about lock scalability, latency and possible preemptiblity of the
linux kernel back to a concrete technical level (if you look at the
"[2.4.17/18] VM and swap - it's really unusable" thread some weeks
ago you can surely imagine why). And the simple reason is that for
my real time data acquisition systems I really would like to get rid
of my SunOS 5.8 and W2K machines. But the truth is that the standard
linux kernel is a real (worst case) latency hog even when compared with
these two bloated OS, at least for my applications. Only Andrew
Morton's (full)-ll patch boosts linux to comparable performance
(less than 1-2 ms latency under heavy io-load). (I know that SunOS 5.8
and W2K also can have larger latencies under special circumstances,
but not for a simple streaming data acquisition with some online
visualization).

Andrew Morton

unread,

Feb 8, 2002, 1:54:33 PM2/8/02

to

Linus Torvalds wrote:
>
> On Fri, 8 Feb 2002, Martin Wirth wrote:
> >

> > There are currently several attempts discussed to push out the
> > BKL and replace it by a semaphore e.g. the next step Robert Love
> > planned for his ll_seek patch (replace the BKL by inode i_sem).
>

> But that won't have any contention anyway, so it's a non-issue.
>

Yesterday, Ingo said:

> i think one example *could* be to turn inode->i_sem into a combi-lock. Eg.
> generic_file_llseek() could use the spin variant.
>

> this is a real performance problem, i've seen scheduling storms in
> dbench-type runs due to llseek taking the inode semaphore.

-

Ingo Molnar

unread,

Feb 8, 2002, 3:47:23 PM2/8/02

to

On Fri, 8 Feb 2002, Andrew Morton wrote:

> I'd be interested in hearing more details on the regression which Ingo

> has seen due to the introduction of i_sem locking in llseek. [...]

i saw heavy scheduling during dbench runs (even if just running 6 threads
on an 8 CPU box), and checked out the source of the scheduling storm -
most of it was due to llseek()'s down().

i also wrote a dbench-alike load simulator for pagecache scalability,
there i saw this in an even more prominent way, 200k/sec reschedules in a
situation when there should be none.

i'd suggest 64-bit update instructions on x86 as well, they do exist.
spinlock only for the truly hopeless cases like SMP boxes composed of
i486's. We really want llseek() to scale ...

Ingo

Alexander Viro

unread,

Feb 8, 2002, 1:56:07 PM2/8/02

to

On Fri, 8 Feb 2002, Ingo Molnar wrote:

> i'd suggest 64-bit update instructions on x86 as well, they do exist.
> spinlock only for the truly hopeless cases like SMP boxes composed of
> i486's. We really want llseek() to scale ...

Ingo, are you sure that you actually saw llseek() causing problems?
And not, say it, ext2_get_block()?

If you've got a heavy holder of some lock + lots of guys who grab it
for a short periods, the real trouble is the former, not the latter.

I'm going to send ext2-without-BKL patches to Linus - tonight or tomorrow.
I really wonder what effect that would have on the things.

Linus Torvalds

unread,

Feb 8, 2002, 2:11:50 PM2/8/02

to

On Fri, 8 Feb 2002, Andrew Morton wrote:
>
> Yesterday, Ingo said:
>
> > i think one example *could* be to turn inode->i_sem into a combi-lock. Eg.
> > generic_file_llseek() could use the spin variant.
> >
> > this is a real performance problem, i've seen scheduling storms in
> > dbench-type runs due to llseek taking the inode semaphore.

... so just make it a spinlock instead.

The semaphore is overkill, as the only thing we're really protecting is
one 64-bit access against other updates.

Linus

Alexander Viro

unread,

Feb 8, 2002, 2:10:25 PM2/8/02

to

On Fri, 8 Feb 2002, Ingo Molnar wrote:

> > I'm going to send ext2-without-BKL patches to Linus - tonight or
> > tomorrow. I really wonder what effect that would have on the things.
>

> oh, that is a really cool thing!
>
> llseek() is unrelated, and i think pretty gross. Is there any other reason
> to llseek()'s i_sem usage other than the 64-bit atomic update of the file
> offset value? We can do lockless, SMP-correct 64-bit updates on x86 pretty
> easily.

Umm... Wait a second. You've seen the problems on ->i_sem variant
of llseek()? My apologies - I've misparsed you.

I seriously suspect that BKL-for-lseek() will be good enough once we
kill BKL in ext2_get_block() and friends. IOW, I doubt that
generic_file_lseek() showing up in BKL contention is the real
problem...

Ingo Molnar

unread,

Feb 8, 2002, 4:23:27 PM2/8/02

to

On Fri, 8 Feb 2002, Alexander Viro wrote:

> Had anyone actually seen lseek() vs. lseek() contention prior to the
> switch to ->i_sem-based variant? [...]

yes, i've seen this for years. (if you accept dbench overhead.)

and regarding the reintroduction of BKL, *please* do not just use a global
locks around such pieces of code, lock bouncing sucks on SMP, even if
there is no overhead.

Ingo

Ingo Molnar

unread,

Feb 8, 2002, 3:59:38 PM2/8/02

to

On Fri, 8 Feb 2002, Alexander Viro wrote:

> > i'd suggest 64-bit update instructions on x86 as well, they do exist.
> > spinlock only for the truly hopeless cases like SMP boxes composed of
> > i486's. We really want llseek() to scale ...
>
> Ingo, are you sure that you actually saw llseek() causing problems?

yes.

> And not, say it, ext2_get_block()?

no. I saw ext2_get_block() overhead in other workloads, but not this one.
(this was a fully cached pagecache-only workload.)

> If you've got a heavy holder of some lock + lots of guys who grab it
> for a short periods, the real trouble is the former, not the latter.

no, i simply had code that called llseek() pretty often for the same inode
(big database file), and on multiple CPUs it was just way too easy for one
semaphore user to cause another one to block.

> I'm going to send ext2-without-BKL patches to Linus - tonight or
> tomorrow. I really wonder what effect that would have on the things.

oh, that is a really cool thing!

llseek() is unrelated, and i think pretty gross. Is there any other reason
to llseek()'s i_sem usage other than the 64-bit atomic update of the file
offset value? We can do lockless, SMP-correct 64-bit updates on x86 pretty
easily.

Ingo

Robert Love

unread,

Feb 8, 2002, 2:36:26 PM2/8/02

to

On Fri, 2002-02-08 at 14:21, Alexander Viro wrote:

> Had anyone actually seen lseek() vs. lseek() contention prior to the
> switch to ->i_sem-based variant?

Yes, I did, even on my 2-way.

Additionally, when I posted the remove-bkl-llseek patch, someone from
SGI noted that on a 24-processor NUMA IA-64 machine, _50%_ of machine
time was spent spinning on the BKL in llseek-intense operations.

The bkl is not held for a long time, but it is acquired often, and there
are definitely workloads that show a big hit with the BKL in there.

Robert Love

Alexander Viro

unread,

Feb 8, 2002, 2:21:58 PM2/8/02

to

On Fri, 8 Feb 2002, Linus Torvalds wrote:

> ... so just make it a spinlock instead.
>
> The semaphore is overkill, as the only thing we're really protecting is
> one 64-bit access against other updates.

I'm not sure that we really need a separate spinlock here. BKL might
be just fine, provided that we remove it from real hogs. And we can
do it now.

Had anyone actually seen lseek() vs. lseek() contention prior to the

switch to ->i_sem-based variant? If the mix looked like
infrequently called BKL hog + many lseeks()
almost all contention cases would have lseek() spinning while
a hog holds BKL. And real problem here is a hog...

Linus Torvalds

unread,

Feb 8, 2002, 4:36:27 PM2/8/02

to

On Fri, 8 Feb 2002, Ingo Molnar wrote:
>
> and regarding the reintroduction of BKL, *please* do not just use a global
> locks around such pieces of code, lock bouncing sucks on SMP, even if
> there is no overhead.

I'd suggest not having a lock at all, but instead add two functions: one
to read a 64-bit value atomically, the other to write it atomically (and
they'd be atomic only wrt each other, no memory barriers etc implied).

On 64-bit architectures that's just a direct dereference, and even on x86
it's just a "cmpxchg8b".

Linus

Jeff Garzik

unread,

Feb 8, 2002, 3:04:34 PM2/8/02

to

Linus Torvalds wrote:
>
> On Fri, 8 Feb 2002, Ingo Molnar wrote:
> >
> > and regarding the reintroduction of BKL, *please* do not just use a global
> > locks around such pieces of code, lock bouncing sucks on SMP, even if
> > there is no overhead.
>
> I'd suggest not having a lock at all, but instead add two functions: one
> to read a 64-bit value atomically, the other to write it atomically (and
> they'd be atomic only wrt each other, no memory barriers etc implied).
>
> On 64-bit architectures that's just a direct dereference, and even on x86
> it's just a "cmpxchg8b".

Are there architectures out there that absolutely must implement this
with a spinlock? Your suggested API of functions to read/write 64-bit
values atomically would work for such a case, but still I am just
curious.

Jeff

--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com

yoda...@fsmlabs.com

unread,

Feb 8, 2002, 3:38:51 PM2/8/02

to

On Fri, Feb 08, 2002 at 08:14:32PM +0000, Anton Altaparmakov wrote:
> At 16:51 08/02/02, Nigel Gamble wrote:
> >Read-write semaphores should never be used. As others have pointed out,
> >they cause really intractable priority inversion problems (because a

...
>
> Read-write semaphores have their use and the current Linux implementation

Here's the context: the preemption patch puts pressure on Linux to move
from BKL to semaphores and then it will be seen that semaphores need
to have dynamic priority inherit to sort-of-work, and then it will be seen
that read/write lock is a problem!

> The value of allowing multiple cpus to read the same data simultaneously by
> far offsets the priority problems IMVHO. At least the way I am using rw
> semaphores in ntfs it is. Readlocks are grabbed loads and loads of times to
> serialize meta data access in the page cache while writelocks are a minute
> number in comparison and because the data required to be accessed may not

this is absolutely correct. However, once the decision has been made or
fallen into to go to a priority inherit scheme, Linux will find itself
in the same bind as Solaris.

> be cached in memory (page cache page is not read in, is swapped out,
> whatever) a disk access may be required which means a rw spin lock is no
> good. In fact ntfs would be the perfect candidate for automatic rw combi
> locks where the locking switches from spinning to sleeping if the code path
> reaches a disk access. I can't use a manually controlled lock as the page

Seem like the lock is simply grabbed way to far up.

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
www.fsmlabs.com www.rtlinux.com

-

Anton Altaparmakov

unread,

Feb 8, 2002, 3:14:32 PM2/8/02

to

At 16:51 08/02/02, Nigel Gamble wrote:

> > No. i_sem should be split into a spinlock for short-time accessed
> > fields that get written to even if the file content is only read (i.e.
> > atime) and a read-write semaphore.
>

>Read-write semaphores should never be used. As others have pointed out,
>they cause really intractable priority inversion problems (because a

>high priority writer will often have to wait for an unbounded number of
>lower priority readers, some of which may have called a blocking
>function while holding the read lock).
>
>Note that I'm not talking about read-write spinlocks, which are (or
>should be) held for a short, bounded time and can't be held over a
>blocking call, so they are not quite so problematic.

Read-write semaphores have their use and the current Linux implementation
(big reader/occasional writer) guarantees that the writer is not starved as
incoming read locks are put to sleep as soon as a write lock request comes
in, even if that is sleeping waiting for the old readlocks to be released
(unless the readers are holding the semaphore forever in which case this is
the programmers fault and not the rw semaphore implementations fault). [I
should add I only am familliar with the i386 implementation but I assume
the others are the same.]

The value of allowing multiple cpus to read the same data simultaneously by
far offsets the priority problems IMVHO. At least the way I am using rw
semaphores in ntfs it is. Readlocks are grabbed loads and loads of times to
serialize meta data access in the page cache while writelocks are a minute
number in comparison and because the data required to be accessed may not

be cached in memory (page cache page is not read in, is swapped out,
whatever) a disk access may be required which means a rw spin lock is no
good. In fact ntfs would be the perfect candidate for automatic rw combi
locks where the locking switches from spinning to sleeping if the code path
reaches a disk access. I can't use a manually controlled lock as the page

cache lookups are done via the mm/filemap.c access functions which are the
only ones who can know if a disk access is required or not so ntfs would
never know if it is going to sleep or not so unless the locks had
autodetection of whether to spin or sleep they would be useless.

I guess the point I am trying to make is that both rw semaphores and combi
locks are not bad per se but, as all other locking mechanisms, they should
be used in situations appropriate for their locktype and their
implementation...

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

Mike Fedyk

unread,

Feb 8, 2002, 4:16:28 PM2/8/02

to

On Fri, Feb 08, 2002 at 03:04:34PM -0500, Jeff Garzik wrote:
> Linus Torvalds wrote:
> >
> > On Fri, 8 Feb 2002, Ingo Molnar wrote:
> > >
> > > and regarding the reintroduction of BKL, *please* do not just use a global
> > > locks around such pieces of code, lock bouncing sucks on SMP, even if
> > > there is no overhead.
> >
> > I'd suggest not having a lock at all, but instead add two functions: one
> > to read a 64-bit value atomically, the other to write it atomically (and
> > they'd be atomic only wrt each other, no memory barriers etc implied).
> >
> > On 64-bit architectures that's just a direct dereference, and even on x86
> > it's just a "cmpxchg8b".
>
> Are there architectures out there that absolutely must implement this
> with a spinlock? Your suggested API of functions to read/write 64-bit
> values atomically would work for such a case, but still I am just
> curious.
>

SMP 486s would need that (if there is such a beast). What point does x86
get the 64 bit instructions? If after 586, then it would definately need a
spinlock or somesuch in those functions.

Mike

Anton Altaparmakov

unread,

Feb 8, 2002, 4:55:32 PM2/8/02

to

At 20:38 08/02/02, yoda...@fsmlabs.com wrote:
>On Fri, Feb 08, 2002 at 08:14:32PM +0000, Anton Altaparmakov wrote:

> > The value of allowing multiple cpus to read the same data
> simultaneously by
> > far offsets the priority problems IMVHO. At least the way I am using rw
> > semaphores in ntfs it is. Readlocks are grabbed loads and loads of
> times to
> > serialize meta data access in the page cache while writelocks are a minute
> > number in comparison and because the data required to be accessed may not
>

>this is absolutely correct. However, once the decision has been made or
>fallen into to go to a priority inherit scheme, Linux will find itself
>in the same bind as Solaris.
>

> > be cached in memory (page cache page is not read in, is swapped out,
> > whatever) a disk access may be required which means a rw spin lock is no
> > good. In fact ntfs would be the perfect candidate for automatic rw combi
> > locks where the locking switches from spinning to sleeping if the code
> path
> > reaches a disk access. I can't use a manually controlled lock as the page
>

>Seem like the lock is simply grabbed way to far up.

It cannot be taken any later. NTFS is a database both in memory and on disk
and in fact the in memory meta data is the on-disk structures (i.e. no
conversion between the two is performed, except in few speed optimization
cases and in cases where meta data is compressed on disk).

Because everything is ogranized in database records, I need to grab the
lock as soon as I intend to access the record (in fact the lock itself is
taken by map_mft_record(READ or WRITE) which returns a locked record, that
is mapped and pinned in memory, to the caller. I can then do all sorts of
things with the record, like search it, parse data from it, etc. When
finished I unmap_mft_record(READ or WRITE) which undoes everything
map_mft_record() did, i.e. unpins the record, unmaps it and releases the lock.

All transactions with the on disk storage are done via the page cache and
read_cache_page() using a special readpage() and a special async io
completion handler.

Because of the abstraction of access layers it is undesirable to operate on
the same locks in different layers, hence the top most layer is the one
locking and unlocking the records. The lower layers don't even know the
locks exist. Also pushing the lock deep into the layers is not actually
possible as it would probably need to be taken in reac_cache_page() (the
only place where the code knows if the page is present in memory or not and
a disk access is required) which is a VFS/MM function and hence that is not
an option.

Further, I may need to allocate memory in order to store decompressed
metadata for example but I won't know if I need to do that until I have
already locked the record and accessed it to determine if the metadata is
compressed or not.

So basically I can only obtain sufficient info for deciding what lock I
need after I have already accessed parts of the record, but to access it I
need to have locked it. Chicken and egg situation... So the locks are
always semaphores and are taken in the top layer.

A combi lock would allow spinning in the case where it turns out the code
paths will not hit disk or sleep in kmalloc, etc and sleeping in the
disk/kmalloc hit case.

(Yes I know atomic kmalloc exists but I actually need vmalloc in some cases
depending how much memory I need to handle the metadata...)

Best regards,

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

-

Alan Cox

unread,

Feb 8, 2002, 7:09:04 PM2/8/02

to

> SMP 486s would need that (if there is such a beast). What point does x86
> get the 64 bit instructions? If after 586, then it would definately need a
> spinlock or somesuch in those functions.

There are SMP 486 class x86 machines that are MP 1.1 compliant. They are
sufficiently rare that I think its quite acceptable to "implement" a
cmpxchg8b macro on 486 as spin_lock_irqsave/blah/spin_unlock_irqrestore. It
would be wrong to cripple the other 99.99% of SMP users

Mike Fedyk

unread,

Feb 8, 2002, 7:05:33 PM2/8/02

to

On Sat, Feb 09, 2002 at 12:09:04AM +0000, Alan Cox wrote:
> > SMP 486s would need that (if there is such a beast). What point does x86

> > get the 64 bit instructions? If after 586, then it would definitely need a
> > spin-lock or some-such in those functions.

>
> There are SMP 486 class x86 machines that are MP 1.1 compliant. They are
> sufficiently rare that I think its quite acceptable to "implement" a
> cmpxchg8b macro on 486 as spin_lock_irqsave/blah/spin_unlock_irqrestore. It
> would be wrong to cripple the other 99.99% of SMP users
>

Sorry, I only meant to say that the only question is where the split should
be between spin-lock and 64bit instruction...

This would be included in the appropriate config option.

Mike

Alexander Viro

unread,

Feb 8, 2002, 7:18:02 PM2/8/02

to

On 8 Feb 2002, Robert Love wrote:

> On Fri, 2002-02-08 at 14:21, Alexander Viro wrote:
>

> > Had anyone actually seen lseek() vs. lseek() contention prior to the
> > switch to ->i_sem-based variant?
>

> Yes, I did, even on my 2-way.
>
> Additionally, when I posted the remove-bkl-llseek patch, someone from
> SGI noted that on a 24-processor NUMA IA-64 machine, _50%_ of machine
> time was spent spinning on the BKL in llseek-intense operations.

That doesn't say _anything_ about the source of contention.

> The bkl is not held for a long time, but it is acquired often, and there
> are definitely workloads that show a big hit with the BKL in there.

Sigh... OK, here's an exercise:

You have a spinlock and two functions - hog() and light(). Both grab that
spinlock. light() releases it and returns fast. hog() spends quite a
while in critical area and then releases the lock.

a) show that if lock gets contended then most of the time it will be
held by hog()

b) show that time spent spinning in light() depends mostly on the amount
of time spent in hog()

c) replace light() with light1(), ..., lightN(). Compare the effect
of removing lock from one of them with that of removing it from hog().

d) characterize the dependence of light()/light() contention upon the
amount of time spent in hog(). (hint: attempts to call light()
while hog() is running will lead to accumulation of contenders and when
hog() is done they _will_ figth each other).

General rules:
* go for hogs first, don't try to pick the light ones.
* one spinning most is not necessary the cause of contention.
* even when you see real light/light contention - check the
history immediately before; you may find the hog that had had held them
back.

Horst von Brand

unread,

Feb 8, 2002, 2:22:01 PM2/8/02

to

Martin Wirth <Martin...@dlr.de> said:
> This is a request for comment on a new locking primitive
> called a combilock.
>
> The goal of this development is:
>
> 1. To allow for a better SMP scalability of semaphores used as Mutex
> 2. As a replacement for long held spinlocks in an preemptible kernel
>
> The new lock uses a combination of a spinlock and a (mutex-)semaphore.
> You can lock it for short-term issues in a spin-lock mode:
>
> combi_spin_lock(struct combilock *x)
> combi_spin_unlock(struct combilock *x)
>
> and for longer lasting tasks in a sleeping mode by:
>
> combi_mutex_lock(struct combilock *x)
> combi_mutex_unlock(struct combilock *x)

Can you sleep if acquired as the spinlock?

Is there any measurable (or at least plausible reason why there should be)
performance improvement? (No, "should make preemptible kernel faster"
doesn't cut it at all for me). Or any hope that it will substantially
simplify kernel programming with _no_ performance degradation by replacing
both semphores and spinlocks?
--
Horst von Brand http://counter.li.org # 22616

Benjamin Herrenschmidt

unread,

Feb 8, 2002, 4:40:12 PM2/8/02

to

>Are there architectures out there that absolutely must implement this
>with a spinlock? Your suggested API of functions to read/write 64-bit
>values atomically would work for such a case, but still I am just
>curious.

At least PPC32 can't do that without a spinlock_irq

Ben.

Linus Torvalds

unread,

Feb 9, 2002, 2:32:26 PM2/9/02

to

On Fri, 8 Feb 2002, Benjamin Herrenschmidt wrote:
>
> At least PPC32 can't do that without a spinlock_irq

We don't need to be irq-safe for this, though. Just specify it to be
process safe - which means that on UP it boils down to at most maybe
having to protect against preemption.

Linus