- VFS: converted sb->s_lock to a mutex too. This improves the create+unlink benchmark on ext3.
- further simplification of __mutex_lock_common(): no more gotos, and only one atomic_xchg() is done. Code size is now extremely small on both UP and SMP:
text data bss dec hex filename 398 0 0 398 18e mutex.o.UP
text data bss dec hex filename 463 0 0 463 1cf mutex.o.SMP
- synchro-test updates: max # of threads of 64, fairness stats, better defaults if =y, cleanups. (David Howells, me)
- FASTCALL -> fastcall in mutex.h (suggested by David Howells)
Took a glance at this on ppc64. Would it be useful if I contributed an arch specific version like arm has? We'll either need an arch specific version or have the generic changed.
Anyway, here is some disassembly of some of the code generated with my comments:
The eieio is completly unnecessary, it got picked up from atomic_dec_return (Anton, why is there an eieio at the start of atomic_dec_return in the first place?).
Ignore the + on the bne, the disassembler is wrong, it is really a -
That eieio should be an lwsync to avoid data corruption. And I think the isync is superfluous.
Ditto the disassembler being wrong about the + vs -. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> Took a glance at this on ppc64. Would it be useful if I contributed an arch > specific version like arm has? We'll either need an arch specific version or > have the generic changed.
Don't change the generic version. You should provide a ppc specific version if the generic ones don't look so good.
> > Took a glance at this on ppc64. Would it be useful if I contributed an arch > > specific version like arm has? We'll either need an arch specific version or > > have the generic changed.
> Don't change the generic version. You should provide a ppc specific > version if the generic ones don't look so good.
Well, if the generic one generates _buggy_ code on ppc64, that means that either the generic version is buggy, or one of the atomics that it uses is buggily implemented on ppc64.
And I think it's the generic mutex stuff that is buggy. It seems to assume memory barriers that aren't valid to assume.
A mutex is more than just updating the mutex count properly. You also have to have the proper memory barriers there to make sure that the things that the mutex _protects_ actually stay inside the mutex.
So while a ppc64-optimized mutex is probably a good idea per se, I think the generic mutex code had better be fixed first and regardless of any optimized version.
On x86/x86-64, the locked instructions automatically imply the proper memory barriers, but that was just lucky, I think.
>>>Took a glance at this on ppc64. Would it be useful if I contributed an arch >>>specific version like arm has? We'll either need an arch specific version or >>>have the generic changed.
>>Don't change the generic version. You should provide a ppc specific >>version if the generic ones don't look so good.
> Well, if the generic one generates _buggy_ code on ppc64, that means that > either the generic version is buggy, or one of the atomics that it uses is > buggily implemented on ppc64.
> And I think it's the generic mutex stuff that is buggy. It seems to assume > memory barriers that aren't valid to assume.
> A mutex is more than just updating the mutex count properly. You also have > to have the proper memory barriers there to make sure that the things that > the mutex _protects_ actually stay inside the mutex.
> So while a ppc64-optimized mutex is probably a good idea per se, I think > the generic mutex code had better be fixed first and regardless of any > optimized version.
> On x86/x86-64, the locked instructions automatically imply the proper > memory barriers, but that was just lucky, I think.
I think the generic code is correct according to Documentation/atomic_ops.txt which basically defines any atomic_xxx operation which both modifies its operand and returns something to have a full memory barrier before and after its load/store operations.
Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems to be that it allows loads to be reordered before stores, but that's OK with smp_wmb, right?
And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does not? And rmb() does lwsync, which apparently does not order IO at all...
> Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems > to be that it allows loads to be reordered before stores, but that's > OK with smp_wmb, right?
lwsync implies more ordering than eieio and so may take longer. lwsync orders everything except store - load, and eieio just orders store - store.
On power3 an lwsync is a full sync which takes forever, although in newer chips both lwsync and eieio tend to take the same number of cycles.
> And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does > not? And rmb() does lwsync, which apparently does not order IO at all...
Because people love to abuse the barrier macros :) grep for wmb in drivers/net and look for the number of places wmb() is being used to order memory and IO. Architecturally eieio is a store - store ordering for IO and memory but not between the two. sync is slow but does guarantee this.
SGIs mmiowb() might be useful for some of these cases but every time its brought up everyone ends up confused as to its real use.
Really we should have io_*mb() and smp_*mb(). At that stage we may even be able to kill the base *mb() macros.
> Took a glance at this on ppc64. Would it be useful if I contributed > an arch specific version like arm has? We'll either need an arch > specific version or have the generic changed.
feel free to implement an assembly mutex fastpath, and it would certainly be welcome and useful - but i think you are wrong about SMP synchronization:
> Anyway, here is some disassembly of some of the code generated with my > comments:
> c00000000049bf9c <.mutex_lock>: > c00000000049bf9c: 7c 00 06 ac eieio > c00000000049bfa0: 7d 20 18 28 lwarx r9,r0,r3 > c00000000049bfa4: 31 29 ff ff addic r9,r9,-1 > The eieio is completly unnecessary, it got picked up from > atomic_dec_return (Anton, why is there an eieio at the start of > atomic_dec_return in the first place?).
a mutex is like a spinlock, it must prevent loads and stores within the critical section from 'leaking outside the critical section' [they must not be reordered to before the mutex_lock(), nor to after the mutex_unlock()] - hence the barriers added by atomic_dec_return() are very much needed.
> Well, if the generic one generates _buggy_ code on ppc64, that means > that either the generic version is buggy, or one of the atomics that > it uses is buggily implemented on ppc64.
> And I think it's the generic mutex stuff that is buggy. It seems to > assume memory barriers that aren't valid to assume.
in those headers i'm only using atomic_dec_return(), atomic_cmpxchg() and atomic_xchg() - all of which imply a barrier. It is atomic_inc/add/sub/dec() that doesnt default to an implied SMP barrier.
but it's certainly not documented too well. If atomic_dec_return() is not supposed to imply a barrier, then all the affected architectures (sparc64, ppc64, mips, alpha, etc.) are overdoing synchronization currently: they all have barriers for these primitives. [They also have an implicit barrier for atomic_dec_test() - which is being relied on for correctness - no kernel code adds an smp_mb__before_atomic_dec() barrier to around atomic_dec_test().]
the patch below adds the barriers to the asm-generic mutex routines, so it's not like i'm lazy ;), but i really think this is unnecessary. Adding this patch would add a second, unnecessary barrier for all the arches that have barrier-less atomic ops.
it also makes sense: the moment you are interested in the 'previous value' of the atomic counter in an atomic fashion, you very likely want to use it for a critical section. (e.g. all the put-the-resource ops that use atomic_dec_test() rely on this implicit barrier.)
Ingo
Index: linux/include/asm-generic/mutex-dec.h =================================================================== --- linux.orig/include/asm-generic/mutex-dec.h +++ linux/include/asm-generic/mutex-dec.h @@ -19,6 +19,7 @@ */ #define __mutex_fastpath_lock(count, fail_fn) \ do { \ + smp_mb__before_atomic_dec(); \ if (unlikely(atomic_dec_return(count) < 0)) \ fail_fn(count); \ } while (0) @@ -36,6 +37,7 @@ do { \ static inline int __mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *)) { + smp_mb__before_atomic_dec(); if (unlikely(atomic_dec_return(count) < 0)) return fail_fn(count); else @@ -59,6 +61,8 @@ __mutex_fastpath_lock_retval(atomic_t *c do { \ if (unlikely(atomic_inc_return(count) <= 0)) \ fail_fn(count); \ + else \ + smp_mb__after_atomic_inc(); \ } while (0)
#define __mutex_slowpath_needs_to_unlock() 1 @@ -92,6 +96,7 @@ __mutex_fastpath_trylock(atomic_t *count * the mutex state would be. */ #ifdef __HAVE_ARCH_CMPXCHG + smp_mb__before_atomic_dec(); if (likely(atomic_cmpxchg(count, 1, 0)) == 1) return 1; return 0; Index: linux/include/asm-generic/mutex-xchg.h =================================================================== --- linux.orig/include/asm-generic/mutex-xchg.h +++ linux/include/asm-generic/mutex-xchg.h @@ -24,6 +24,7 @@ */ #define __mutex_fastpath_lock(count, fail_fn) \ do { \ + smp_mb__before_atomic_dec(); \ if (unlikely(atomic_xchg(count, 0) != 1)) \ fail_fn(count); \ } while (0) @@ -42,6 +43,7 @@ do { \ static inline int __mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *)) { + smp_mb__before_atomic_dec(); if (unlikely(atomic_xchg(count, 0) != 1)) return fail_fn(count); else @@ -64,6 +66,8 @@ __mutex_fastpath_lock_retval(atomic_t *c do { \ if (unlikely(atomic_xchg(count, 1) != 0)) \ fail_fn(count); \ + else \ + smp_mb__after_atomic_inc(); \ } while (0)
#define __mutex_slowpath_needs_to_unlock() 0 @@ -86,7 +90,10 @@ do { \ static inline int __mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) { - int prev = atomic_xchg(count, 0); + int prev; + + smp_mb__before_atomic_dec(); + prev = atomic_xchg(count, 0);
> the patch below adds the barriers to the asm-generic mutex routines, so > it's not like i'm lazy ;), but i really think this is unnecessary. > Adding this patch would add a second, unnecessary barrier for all the > arches that have barrier-less atomic ops.
> it also makes sense: the moment you are interested in the 'previous > value' of the atomic counter in an atomic fashion, you very likely want > to use it for a critical section. (e.g. all the put-the-resource ops > that use atomic_dec_test() rely on this implicit barrier.)
Ok, fair enough. However, that still leaves the question of which way the barrier works. Traditionally, we have only cared about one thing: that all preceding writes have finished, because the "atomic_dec_return" thing is used as a _reference_counter_, and we're going to release the thing.
However, that's not the case in a mutex. A mutex locking operation works exactly the other way around: it doesn't really care about the previous writes at all, since those operations were unlocked. It cares about the _subsequent_ writes, since those have to be seen by others as being in the critical region, and never be seen as happening before the lock.
So I _think_ your argument is bogus, and your patch is bogus. The use of "atomic_dec_return()" in a mutex is _not_ the same barrier as using it for reference counting. Not at all. Memory barriers aren't just one thing: they are semi-permeable things in two different directions and with two different operations: there are several different kinds of them.
> #define __mutex_fastpath_lock(count, fail_fn) \ > do { \ > + smp_mb__before_atomic_dec(); \ > if (unlikely(atomic_dec_return(count) < 0)) \ > fail_fn(count); \ > } while (0)
So I think the barrier has to come _after_ the atomic decrement (or exchange).
Because as it is written now, any writes in the locked region could percolate up to just before the atomic dec - ie _outside_ the region. Which is against the whole point of a lock - it would allow another CPU to see the write even before it sees that the lock was successful, as far as I can tell.
But memory ordering is subtle, so maybe I'm missing something..
>>The eieio is completly unnecessary, it got picked up from >>atomic_dec_return (Anton, why is there an eieio at the start of >>atomic_dec_return in the first place?).
> a mutex is like a spinlock, it must prevent loads and stores within the > critical section from 'leaking outside the critical section' [they must > not be reordered to before the mutex_lock(), nor to after the > mutex_unlock()] - hence the barriers added by atomic_dec_return() are > very much needed.
The bne- and isync together form a sufficient import barrier. See PowerPC Book2 Appendix B.2.1.1
And if the eieio was necessary it should come after not before twidling the lock bits. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wednesday, January 4, 2006 7:39 pm, Anton Blanchard wrote: > SGIs mmiowb() might be useful for some of these cases but every time > its brought up everyone ends up confused as to its real use.
It's documented in Documentation/DocBook/deviceiobook.tmpl. If the documentation isn't clear, we should fix it, rather than avoid using the primitive altogether. If drivers/net really means mmiowb() in some places, we should change it, and like you said maybe get rid of some of these primitives so that their usage is clearer.
> > the patch below adds the barriers to the asm-generic mutex routines, so > > it's not like i'm lazy ;), but i really think this is unnecessary. > > Adding this patch would add a second, unnecessary barrier for all the > > arches that have barrier-less atomic ops.
> > it also makes sense: the moment you are interested in the 'previous > > value' of the atomic counter in an atomic fashion, you very likely want > > to use it for a critical section. (e.g. all the put-the-resource ops > > that use atomic_dec_test() rely on this implicit barrier.)
> Ok, fair enough. However, that still leaves the question of which way > the barrier works. Traditionally, we have only cared about one thing: > that all preceding writes have finished, because the > "atomic_dec_return" thing is used as a _reference_counter_, and we're > going to release the thing.
yeah, i think you are right. Here's a detailed analysis of why you are right about atomic_dec_return():
there are 8 types of instruction reordering that can happen at the beginning and at the end of a critical section. Firstly, here's the programmed order of instructions:
a pre-read crossing forwards into (and over) the critical section: OK a pre-write crossing forwards into (and over) the critical section: OK a critical-read crossing backwards across critical-START: BAD a critical-write crossing backwards across critical-START: BAD a critical-read crossing forwards across critical-END: BAD a critical-write crossing forwards across critical-END: BAD a post-read crossing backwards into (and over) the critical section: OK a post-write crossing backwards into (and over) the critical section: OK
so critical-START needs to be a read and a write barrier for reads and writes happening after it. I.e. it's a memory barrier that only lets instruction reordering in a forward direction, not in a backwards direction.
critical-END needs to be a read and write barrier that only lets instructions into the critical section in a backwards direction - and it's a full memory barrier otherwise.
AFAICS, currently we dont have such a 'half-conductor / diode' memory barrier primitive: smp_mb() is a full barrier for both directions. But lets assume they existed, and lets call them smp_mb_forwards() and smp_mb_backwards().
furthermore, the locking and unlocking instruction must not cross into the critical section, so the lock sequence must be at least:
i also think this is the 'absolute minimum memory ordering requirement' for critical sections: relaxing this any further is not possible without breaking critical sections.
i doubt many (in fact, any) CPUs are capable of expressing it in such a finegrained way. With our existing primitives, probably the closest one would be:
lock smp_mb(); ... smp_mb(); unlock
as long as the CPU always executes the lock and unlock stores (which go to the same address) in program order. (is there any CPU doesnt do that?)
in that sense, both atomic_dec_return() and atomic_inc_return() are in indeed incorrect (for the use of mutexes) e.g. on ppc64. They are both done via:
eioio ... atomic-dec ... isync
eioio is a stronger than smp_wmb() - it is a barrier for system memory and IO space memory writes. isync is a read barrier - it throws away all speculative register contents. So it is roughly equivalent to:
smp_wmb(); ... atomic-dec ... smp_rmb();
this fulfills the requirement of the critical section not leaking out of the lock sequence itself, but (if i got the ppc64 semantics right) it doesnt protect a write within the critical section to cross out over the smp_rmb(), and to get reordered with the atomic-dec - violating the critical section rules.
some other architectures are safe by accident, e.g. Alpha's atomic_dec_return() does:
smp_mb(); ... atomic-dec ... smp_mb();
which is overkill, full read and write barrier on both sides.
Sparc64's atomic_dec_return() does yet another thing:
AFAICS this violates the requirements: a load from within the critical section may go before the atomic-conditional-store, and may thus be executed before the critical section acquires the lock.
on MIPS, atomic_dec_return() does what is equivalent to:
... atomic-dec ... smp_mb();
which is fine for a lock sequence, but atomic_inc_return() is not adequate for an unlock sequence:
... atomic-inc ... smp_mb();
because this allows reads and writes within the critical section to reorder with the atomic-inc instructions.
to sum it up: atomic_dec/inc_return() alone is not enough to implement critical sections, on a number of architectures. atomic_xchg() seems to have similar problems too.
the patch below adds the smp_mb() barriers to the generic headers, which should now fulfill all the ordering requirements, on every architecture. It only relies on one property of the atomic primitives: that they wont get reordered with respect to themselves, so an atomic_inc_ret() and an atomic_dec_ret() cannot switch place.
Can you see any hole in this reasoning?
Ingo
Index: linux/include/asm-generic/mutex-dec.h =================================================================== --- linux.orig/include/asm-generic/mutex-dec.h +++ linux/include/asm-generic/mutex-dec.h @@ -21,6 +21,8 @@ do { \ if (unlikely(atomic_dec_return(count) < 0)) \ fail_fn(count); \ + else \ + smp_mb(); \ } while (0)
> to sum it up: atomic_dec/inc_return() alone is not enough to implement > critical sections, on a number of architectures. atomic_xchg() seems to > have similar problems too.
Yes.
> the patch below adds the smp_mb() barriers to the generic headers, which > should now fulfill all the ordering requirements, on every architecture. > It only relies on one property of the atomic primitives: that they wont > get reordered with respect to themselves, so an atomic_inc_ret() and an > atomic_dec_ret() cannot switch place.
> Can you see any hole in this reasoning?
No. The alternative is to just make the ordering requirements for "atomic_dec_return()" and "atomic_xchg()" be absolute. Say that they have to be full memory barriers, and push the problem into the low-level architecture.
I _think_ your patch is the right approach, because most architectures are likely to do their own fast-paths for mutexes, and as such the generic ones are more of a template for how to do it, and hopefilly aren't that performance critical.
> The bne- and isync together form a sufficient import barrier. See > PowerPC Book2 Appendix B.2.1.1
ok. Please correct me if i'm wrong: the question is, could we on ppc64 use atomic_dec_return() for mutex_lock(), and atomic_inc_return() for mutex_unlock().
the EIEIO_ON_SMP is in essence smp_wmb(), correct? (it's a bit stronger because it also orders IO-space writes, but it doesnt impose any restrictions on reads)
ISYNC_ON_SMP flushes all speculative reads currently in the queue - and is hence a smp_rmb_backwards() primitive [per my previous mail] - but does not affect writes - correct?
if that's the case, what prevents a store from within the critical section going up to right after the EIEIO_ON_SMP, but before the atomic-dec instructions? Does any of those instructions imply some barrier perhaps? Are writes always ordered perhaps (like on x86 CPUs), and hence the store before the bne is an effective write-barrier?
> I _think_ your patch is the right approach, because most architectures > are likely to do their own fast-paths for mutexes, and as such the > generic ones are more of a template for how to do it, and hopefilly > aren't that performance critical.
yeah, i think so too. We've got 3 architectures done in assembly so far, and it seems people like optimizing such code. Also, since the generic code does all the boring slowpath stuff, the architecture can concentrate on the fun part alone: to make the fastpath really fast. The generic code is still in full control of all the mutex semantics, and can ignore asm/mutex.h when it wants/needs to. So i'm quite happy with the current design and i'm not against more per-arch assembly fun, at all.
there's one exception i think: atomic-xchg.h was pretty optimal on ARM, and i'd expect it to be pretty optimal on the other atomic-swap platforms too. So maybe we should specify atomic_xchg() to _not_ imply a full barrier - it's a new API anyway. We cannot embedd the barrier within atomic_xchg(), because the barrier is needed at different ends for lock and for unlock, and adding two barriers would be unnecessary.
asm-generic/mutex-dec.h is less optimal (and thus less critical), and i can see no easy way to modify it, because i think it would be quite confusing to enforce 'lock' ordering for atomic_dec_return(), and 'unlock' ordering for atomic_inc_return(). We cannot remove the existing barriers either (and add them explicitly), because there are existing users of these primitives. (although we could add explicit barriers to those too - but probably not worth the trouble)
> ISYNC_ON_SMP flushes all speculative reads currently in the queue - and > is hence a smp_rmb_backwards() primitive [per my previous mail] - but > does not affect writes - correct?
> if that's the case, what prevents a store from within the critical > section going up to right after the EIEIO_ON_SMP, but before the > atomic-dec instructions? Does any of those instructions imply some > barrier perhaps? Are writes always ordered perhaps (like on x86 CPUs), > and hence the store before the bne is an effective write-barrier?
While isync technically doesn't order stores it does order instructions. The previous bne- must complete, that bne- is dependent on the previous stwcx being complete. So no stores are slipping up. To get a better explanation you will have to read the document yourself.
Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC. It is completely untested, but compiles.
Signed-off-by: Joel Schopp <jsch...@austin.ibm.com>
>>Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC. >>It is completely untested, but compiles.
> Shouldn't you make that "isync" dependent on SMP too? UP doesn't need it, > since DMA will never matter, and interrupts are precise.
> Linus
I think the isync is necessary to keep heavily out of order processors from getting ahead of themselves even on UP. Scanning back through the powerpc spinlock code they seem to take the same view there as well. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> > Shouldn't you make that "isync" dependent on SMP too? UP doesn't > > need it, since DMA will never matter, and interrupts are precise.
> I think the isync is necessary to keep heavily out of order processors > from getting ahead of themselves even on UP. Scanning back through > the powerpc spinlock code they seem to take the same view there as > well.
the asm/spinlock.h ops are only built on SMP kernels. mutex.h is for both UP and SMP. On UP you should need no synchronization, because the only way another context could interfere with your critical section is by getting interrupted, and interrupts are fully synchronizing, right? On UP the only synchronization needed is when a device reads/writes memory in parallel to the CPU.
On Thu, Jan 05, 2006 at 05:06:26PM -0600, Joel Schopp wrote: > Here is a first pass at a powerpc file for the fast paths just as an > FYI/RFC. It is completely untested, but compiles.
You really should test it, it saves reviewers time. It's not that hard to at least try booting it.
Besides the isync comments earlier, there's a bunch of whitespace issues going on. Did you copy and paste the code from somewhere? If so, you should move the original copyright over too.
All your macros use spaces instead of tabs up to the \, should be changed.
All tmp variables should be ints, since the atomic_t counter is a 32-bit variable. If you use longs, and lwarx (loads 32-bit without sign extend), the comparison with < 0 will never be true.
> Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h > =================================================================== > --- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h 2006-01-04 14:46:31.%N -0600 > +++ 2.6.15-mutex14/include/asm-powerpc/mutex.h 2006-01-05 16:25:41.%N -0600 > @@ -1,9 +1,83 @@ > /* > - * Pull in the generic implementation for the mutex fastpath. > + * include/asm-powerpc/mutex.h
Ingo Molnar (on Thu, 5 Jan 2006 23:43:57 +0100) wrote:
>there's one exception i think: atomic-xchg.h was pretty optimal on ARM, >and i'd expect it to be pretty optimal on the other atomic-swap >platforms too. So maybe we should specify atomic_xchg() to _not_ imply a >full barrier - it's a new API anyway. We cannot embedd the barrier >within atomic_xchg(), because the barrier is needed at different ends >for lock and for unlock, and adding two barriers would be unnecessary.
IA64 defines two qualifiers for cmpxchg, specifically for distinguishing between acquiring and releasing the lock.
cmpxchg<sz>.<sem>
<sz> is the data size, 1, 2, 4 or 8. <sem> is one of 'acq' or 'rel'.
sem Ordering Semaphore Operation Completer Semantics acq Acquire The memory read/write is made visible prior to all subsequent data memory accesses. rel Release The memory read/write is made visible after all previous data memory accesses.
cmpxchg4.acq prevents following data accesses from migrating before taking the lock (critical R/W cannot precede critical-START). cmpxchg4.rel prevents preceding data accesses from migrating after releasing the lock (critical R/W cannot follow critical-END). I suggest adding acq and rel hints to atomic_xchg, and let architectures that implement suitable operations use those hints.
> > the patch below adds the barriers to the asm-generic mutex routines, so > > it's not like i'm lazy ;), but i really think this is unnecessary. > > Adding this patch would add a second, unnecessary barrier for all the > > arches that have barrier-less atomic ops.
> > it also makes sense: the moment you are interested in the 'previous > > value' of the atomic counter in an atomic fashion, you very likely want > > to use it for a critical section. (e.g. all the put-the-resource ops > > that use atomic_dec_test() rely on this implicit barrier.)
> So I _think_ your argument is bogus, and your patch is bogus. The use of > "atomic_dec_return()" in a mutex is _not_ the same barrier as using it for > reference counting. Not at all. Memory barriers aren't just one thing: > they are semi-permeable things in two different directions and with two > different operations: there are several different kinds of them.
> So I think the barrier has to come _after_ the atomic decrement (or > exchange).
> Because as it is written now, any writes in the locked region could > percolate up to just before the atomic dec - ie _outside_ the region. > Which is against the whole point of a lock - it would allow another CPU to > see the write even before it sees that the lock was successful, as far as > I can tell.
> But memory ordering is subtle, so maybe I'm missing something..
We mere humans^W device driver people get more confused with barriers every day, as CPUs get more subtle in their out-of-order-ness.
I think adding longer-named-but-self-explanatory aliases for memory and io barrier functions can help.
mmiowb => barrier_memw_iow ... => barrier_memw_memw (a store-store barrier to mem) ...
General template for the name may be something like
Are there even more subtle cases? -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
This is the second pass at optimizing the fastpath for the new mutex subsystem on PowerPC. I think it is ready to be included in the series with the other mutex patches now. Tested on a 4 core (2 SMT threads/core) Power5 machine with gcc 3.3.2.
> This is the second pass at optimizing the fastpath for the new mutex subsystem > on PowerPC. I think it is ready to be included in the series with the other > mutex patches now. Tested on a 4 core (2 SMT threads/core) Power5 machine with > gcc 3.3.2.
Doens't this mean that the sped-up mutexes are still slower than semaphores? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/