Re: [PATCH] Document Linux's memory barriers

Stephen Hemminger

unread,

Mar 7, 2006, 12:50:11 PM3/7/06

to

This has been needed for quite some time but needs some more
additions:

1) Access to i/o mapped memory does not need memory barriers.

2) Explain difference between mb() and barrier().

3) Explain wmb() versus mmiowb()

Give some more examples of correct usage in drivers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

David Howells

unread,

Mar 7, 2006, 12:50:13 PM3/7/06

to

The attached patch documents the Linux kernel's memory barriers.

Signed-Off-By: David Howells <dhow...@redhat.com>
---
warthog>diffstat -p1 mb.diff
Documentation/memory-barriers.txt | 359 ++++++++++++++++++++++++++++++++++++++
1 files changed, 359 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..c2fc51b
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,359 @@
+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Linux kernel memory barrier functions.
+
+ (*) Implied kernel memory barriers.
+
+ (*) i386 and x86_64 arch specific notes.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose a
+partial ordering between the memory access operations specified either side of
+the barrier.
+
+Older and less complex CPUs will perform memory accesses in exactly the order
+specified, so if one is given the following piece of code:
+
+ a = *A;
+ *B = b;
+ c = *C;
+ d = *D;
+ *E = e;
+
+It can be guaranteed that it will complete the memory access for each
+instruction before moving on to the next line, leading to a definite sequence
+of operations on the bus:
+
+ read *A, write *B, read *C, read *D, write *E.
+
+However, with newer and more complex CPUs, this isn't always true because:
+
+ (*) they can rearrange the order of the memory accesses to promote better use
+ of the CPU buses and caches;
+
+ (*) reads are synchronous and may need to be done immediately to permit
+ progress, whereas writes can often be deferred without a problem;
+
+ (*) and they are able to combine reads and writes to improve performance when
+ talking to the SDRAM (modern SDRAM chips can do batched accesses of
+ adjacent locations, cutting down on transaction setup costs).
+
+So what you might actually get from the above piece of code is:
+
+ read *A, read *C+*D, write *E, write *B
+
+Under normal operation, this is probably not going to be a problem; however,
+there are two circumstances where it definitely _can_ be a problem:
+
+ (1) I/O
+
+ Many I/O devices can be memory mapped, and so appear to the CPU as if
+ they're just memory locations. However, to control the device, the driver
+ has to make the right accesses in exactly the right order.
+
+ Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+ presents to the CPU an "address register" and a bunch of "data registers".
+ The way it's accessed is to write the index of the internal register you
+ want to access to the address register, and then read or write the
+ appropriate data register to access the chip's internal register:
+
+ *ADR = ctl_reg_3;
+ reg = *DATA;
+
+ The problem with a clever CPU or a clever compiler is that the write to
+ the address register isn't guaranteed to happen before the access to the
+ data register, if the CPU or the compiler thinks it is more efficient to
+ defer the address write:
+
+ read *DATA, write *ADR
+
+ then things will break.
+
+ The way to deal with this is to insert an I/O memory barrier between the
+ two accesses:
+
+ *ADR = ctl_reg_3;
+ mb();
+ reg = *DATA;
+
+ In this case, the barrier makes a guarantee that all memory accesses
+ before the barrier will happen before all the memory accesses after the
+ barrier. It does _not_ guarantee that all memory accesses before the
+ barrier will be complete by the time the barrier is complete.
+
+ (2) Multiprocessor interaction
+
+ When there's a system with more than one processor, these may be working
+ on the same set of data, but attempting not to use locks as locks are
+ quite expensive. This means that accesses that affect both CPUs may have
+ to be carefully ordered to prevent error.
+
+ Consider the R/W semaphore slow path. In that, a waiting process is
+ queued on the semaphore, as noted by it having a record on its stack
+ linked to the semaphore's list:
+
+ struct rw_semaphore {
+ ...
+ struct list_head waiters;
+ };
+
+ struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ };
+
+ To wake up the waiter, the up_read() or up_write() functions have to read
+ the pointer from this record to know as to where the next waiter record
+ is, clear the task pointer, call wake_up_process() on the task, and
+ release the task struct reference held:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ If any of these steps occur out of order, then the whole thing may fail.
+
+ Note that the waiter does not get the semaphore lock again - it just waits
+ for its task pointer to be cleared. Since the record is on its stack, this
+ means that if the task pointer is cleared _before_ the next pointer in the
+ list is read, then another CPU might start processing the waiter and it
+ might clobber its stack before up*() functions have a chance to read the
+ next pointer.
+
+ CPU 0 CPU 1
+ =============================== ===============================
+ down_xxx()
+ Queue waiter
+ Sleep
+ up_yyy()
+ READ waiter->task;
+ WRITE waiter->task;
+ <preempt>
+ Resume processing
+ down_xxx() returns
+ call foo()
+ foo() clobbers *waiter
+ </preempt>
+ READ waiter->list.next;
+ --- OOPS ---
+
+ This could be dealt with using a spinlock, but then the down_xxx()
+ function has to get the spinlock again after it's been woken up, which is
+ a waste of resources.
+
+ The way to deal with this is to insert an SMP memory barrier:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ smp_mb();
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ In this case, the barrier makes a guarantee that all memory accesses
+ before the barrier will happen before all the memory accesses after the
+ barrier. It does _not_ guarantee that all memory accesses before the
+ barrier will be complete by the time the barrier is complete.
+
+ SMP memory barriers are normally no-ops on a UP system because the CPU
+ orders overlapping accesses with respect to itself.
+
+
+=====================================
+LINUX KERNEL MEMORY BARRIER FUNCTIONS
+=====================================
+
+The Linux kernel has six basic memory barriers:
+
+ MANDATORY (I/O) SMP
+ =============== ================
+ GENERAL mb() smp_mb()
+ READ rmb() smp_rmb()
+ WRITE wmb() smp_wmb()
+
+General memory barriers make a guarantee that all memory accesses specified
+before the barrier will happen before all memory accesses specified after the
+barrier.
+
+Read memory barriers make a guarantee that all memory reads specified before
+the barrier will happen before all memory reads specified after the barrier.
+
+Write memory barriers make a guarantee that all memory writes specified before
+the barrier will happen before all memory writes specified after the barrier.
+
+SMP memory barriers are no-ops on uniprocessor compiled systems because it is
+assumed that a CPU will be self-consistent, and will order overlapping accesses
+with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in the access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order the first CPU commits its accesses to the bus.
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barriering functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write
+ barrier after it, depending on the function.
+
+
+==============================
+IMPLIED KERNEL MEMORY BARRIERS
+==============================
+
+Some of the other functions in the linux kernel imply memory barriers. For
+instance all the following (pseudo-)locking functions imply barriers.
+
+ (*) interrupt disablement and/or interrupts
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+ Memory accesses issued after the LOCK will be completed after the LOCK
+ accesses have completed.
+
+ Memory accesses issued before the LOCK may be completed after the LOCK
+ accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+ Memory accesses issued before the UNLOCK will be completed before the
+ UNLOCK accesses have completed.
+
+ Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+ accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+ The LOCK accesses will be completed before the unlock accesses.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do
+anything at all, especially with respect to I/O memory barriering.
+
+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
+memory and I/O accesses individually, or interrupt handling will barrier
+memory and I/O accesses on entry and on exit. This prevents an interrupt
+routine interfering with accesses made in a disabled-interrupt section of code
+and vice versa.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+As an example, consider the following:
+
+ *A = a;
+ *B = b;
+ LOCK
+ *C = c;
+ *D = d;
+ UNLOCK
+ *E = e;
+ *F = f;
+
+The following sequence of events on the bus is acceptable:
+
+ LOCK, *F+*A, *E, *C+*D, *B, UNLOCK
+
+But none of the following are:
+
+ *F+*A, *B, LOCK, *C, *D, UNLOCK, *E
+ *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
+ *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
+ *B, LOCK, *C, *D, UNLOCK, *F+*A, *E
+
+
+Consider also the following (going back to the AMD PCnet example):
+
+ DISABLE IRQ
+ *ADR = ctl_reg_3;
+ mb();
+ x = *DATA;
+ *ADR = ctl_reg_4;
+ mb();
+ *DATA = y;
+ *ADR = ctl_reg_5;
+ mb();
+ z = *DATA;
+ ENABLE IRQ
+ <interrupt>
+ *ADR = ctl_reg_7;
+ mb();
+ q = *DATA
+ </interrupt>
+
+What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the
+wrong register? (There's no guarantee that the process of handling an
+interrupt will barrier memory accesses in any way).
+
+
+==============================
+I386 AND X86_64 SPECIFIC NOTES
+==============================
+
+Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
+bus appear in program order - and so there's no requirement for any sort of
+explicit memory barriers.
+
+From the Pentium-III onwards were three new memory barrier instructions:
+LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier
+functions rmb(), wmb() and mb(). However, there are additional implicit memory
+barriers in the CPU implementation:
+
+ (*) Interrupt processing implies mb().
+
+ (*) The LOCK prefix adds implication of mb() on whatever instruction it is
+ attached to.
+
+ (*) Normal writes to memory imply wmb() [and so SFENCE is normally not
+ required].
+
+ (*) Normal writes imply a semi-rmb(): reads before a write may not complete
+ after that write, but reads after a write may complete before the write
+ (ie: reads may go _ahead_ of writes).
+
+ (*) Non-temporal writes imply no memory barrier, and are the intended target
+ of SFENCE.
+
+ (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O].
+
+
+======================
+POWERPC SPECIFIC NOTES
+======================
+
+The powerpc is weakly ordered, and its read and write accesses may be
+completed generally in any order. It's memory barriers are also to some extent
+more substantial than the mimimum requirement, and may directly effect
+hardware outside of the CPU.

Andi Kleen

unread,

Mar 7, 2006, 1:10:08 PM3/7/06

to

On Tuesday 07 March 2006 18:40, David Howells wrote:

> +Older and less complex CPUs will perform memory accesses in exactly the order
> +specified, so if one is given the following piece of code:
> +
> + a = *A;
> + *B = b;
> + c = *C;
> + d = *D;
> + *E = e;
> +
> +It can be guaranteed that it will complete the memory access for each
> +instruction before moving on to the next line, leading to a definite sequence
> +of operations on the bus:

Actually gcc is free to reorder it
(often it will not when it cannot prove that they don't alias, but sometimes
it can)

> +
> + Consider, for example, an ethernet chipset such as the AMD PCnet32. It
> + presents to the CPU an "address register" and a bunch of "data registers".
> + The way it's accessed is to write the index of the internal register you
> + want to access to the address register, and then read or write the
> + appropriate data register to access the chip's internal register:
> +
> + *ADR = ctl_reg_3;
> + reg = *DATA;

You're not supposed to do it this way anyways. The official way to access
MMIO space is using read/write[bwlq]

Haven't read all of it sorry, but thanks for the work of documenting
it.

-Andi

Alan Cox

unread,

Mar 7, 2006, 1:40:15 PM3/7/06

to

On Maw, 2006-03-07 at 17:40 +0000, David Howells wrote:
> +Older and less complex CPUs will perform memory accesses in exactly the order
> +specified, so if one is given the following piece of code:

Not really true. Some of the fairly old dumb processors don't do this to
the bus, and just about anything with a cache wont (as it'll burst cache
lines to main memory)

> + want to access to the address register, and then read or write the
> + appropriate data register to access the chip's internal register:
> +
> + *ADR = ctl_reg_3;
> + reg = *DATA;

Not allowed anyway

> + In this case, the barrier makes a guarantee that all memory accesses
> + before the barrier will happen before all the memory accesses after the
> + barrier. It does _not_ guarantee that all memory accesses before the
> + barrier will be complete by the time the barrier is complete.

Better meaningful example would be barriers versus an IRQ handler. Which
leads nicely onto section 2

> +General memory barriers make a guarantee that all memory accesses specified
> +before the barrier will happen before all memory accesses specified after the
> +barrier.

No. They guarantee that to an observer also running on that set of
processors the accesses to main memory will appear to be ordered in that
manner. They don't guarantee I/O related ordering for non main memory
due to things like PCI posting rules and NUMA goings on.

As an example of the difference here a Geode will reorder stores as it
feels but snoop the bus such that it can ensure an external bus master
cannot observe this by holding it off the bus to fix up ordering
violations first.

> +Read memory barriers make a guarantee that all memory reads specified before
> +the barrier will happen before all memory reads specified after the barrier.
> +
> +Write memory barriers make a guarantee that all memory writes specified before
> +the barrier will happen before all memory writes specified after the barrier.

Both with the caveat above

> +There is no guarantee that any of the memory accesses specified before a memory
> +barrier will be complete by the completion of a memory barrier; the barrier can
> +be considered to draw a line in the access queue that accesses of the
> +appropriate type may not cross.

CPU generated accesses to main memory

> + (*) interrupt disablement and/or interrupts
> + (*) spin locks
> + (*) R/W spin locks
> + (*) mutexes
> + (*) semaphores
> + (*) R/W semaphores

Should probably cover schedule() here.

> +Locks and semaphores may not provide any guarantee of ordering on UP compiled
> +systems, and so can't be counted on in such a situation to actually do
> +anything at all, especially with respect to I/O memory barriering.

_irqsave/_irqrestore ...

> +==============================
> +I386 AND X86_64 SPECIFIC NOTES
> +==============================
> +
> +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
> +bus appear in program order - and so there's no requirement for any sort of
> +explicit memory barriers.

Actually they are not. Processors prior to Pentium Pro ensure that the
perceived ordering between processors of writes to main memory is
preserved. The Pentium Pro is supposed to but does not in SMP cases. Our
spin_unlock code knows about this. It also has some problems with this
situation when handling write combining memory. The IDT Winchip series
processors are run in out of order store mode and our lock functions and
dmamappers should know enough about this.

On x86 memory barriers for read serialize order using lock instructions,
on write the winchip at least generates serializing instructions.

barrier() is pure CPU level of course

> + (*) Normal writes to memory imply wmb() [and so SFENCE is normally not
> + required].

Only at an on processor level and not for all clones, also there are
errata here for PPro.

> + (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O].

Not always. MMIO ordering is outside of the CPU ordering rules and into
PCI and other bus ordering rules. Consider

writel(STOP_DMA, &foodev->ctrl);
free_dma_buffers(foodev);

This leads to horrible disasters.

> +
> +======================
> +POWERPC SPECIFIC NOTES

Can't comment on PPC

David Howells

unread,

Mar 7, 2006, 1:40:16 PM3/7/06

to

Andi Kleen <a...@suse.de> wrote:

> Actually gcc is free to reorder it
> (often it will not when it cannot prove that they don't alias, but sometimes
> it can)

Yeah... I have mentioned the fact that compilers can reorder too, but
obviously not enough.

> You're not supposed to do it this way anyways. The official way to access
> MMIO space is using read/write[bwlq]

True, I suppose. I should make it clear that these accessor functions imply
memory barriers, if indeed they do, and that you should use them rather than
accessing I/O registers directly (at least, outside the arch you should).

David

Jesse Barnes

unread,

Mar 7, 2006, 1:50:12 PM3/7/06

to

On Tuesday, March 7, 2006 10:30 am, David Howells wrote:
> True, I suppose. I should make it clear that these accessor functions
> imply memory barriers, if indeed they do, and that you should use them
> rather than accessing I/O registers directly (at least, outside the
> arch you should).

But they don't, that's why we have mmiowb(). There are lots of cases to
handle:
1) memory vs. memory
2) memory vs. I/O
3) I/O vs. I/O
(reads and writes for every case).

AFAIK, we have (1) fairly well handled with a plethora of barrier ops.
(2) is a bit fuzzy with the current operations I think, and for (3) all
we have is mmiowb() afaik. Maybe one of the ppc64 guys can elaborate on
the barriers their hw needs for the above cases (I think they're the
pathological case, so covering them should be good enough everybody).

Btw, thanks for putting together this documentation, it's desperately
needed.

Jesse

Andi Kleen

unread,

Mar 7, 2006, 1:50:16 PM3/7/06

to

On Tuesday 07 March 2006 19:30, David Howells wrote:

> > You're not supposed to do it this way anyways. The official way to access
> > MMIO space is using read/write[bwlq]
>
> True, I suppose. I should make it clear that these accessor functions imply
> memory barriers, if indeed they do,

I don't think they do.

> and that you should use them rather than
> accessing I/O registers directly (at least, outside the arch you should).

Even inside the architecture it's a good idea.

-Andi

linux-os (Dick Johnson)

unread,

Mar 7, 2006, 2:00:07 PM3/7/06

to

On Tue, 7 Mar 2006, Alan Cox wrote:
[SNIPPED...]

>
> Not always. MMIO ordering is outside of the CPU ordering rules and into
> PCI and other bus ordering rules. Consider
>
> writel(STOP_DMA, &foodev->ctrl);
> free_dma_buffers(foodev);
>
> This leads to horrible disasters.

This might be a good place to document:
dummy = readl(&foodev->ctrl);

Will flush all pending writes to the PCI bus and that:
(void) readl(&foodev->ctrl);
... won't because `gcc` may optimize it away. In fact, variable
"dummy" should be global or `gcc` may make it go away as well.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

Matthew Wilcox

unread,

Mar 7, 2006, 2:10:14 PM3/7/06

to

On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote:
> This might be a good place to document:
> dummy = readl(&foodev->ctrl);
>
> Will flush all pending writes to the PCI bus and that:
> (void) readl(&foodev->ctrl);
> ... won't because `gcc` may optimize it away. In fact, variable
> "dummy" should be global or `gcc` may make it go away as well.

static inline unsigned int readl(const volatile void __iomem *addr)
{
return *(volatile unsigned int __force *) addr;
}

The cast is volatile, so gcc knows not to optimise it away.

linux-os (Dick Johnson)

unread,

Mar 7, 2006, 2:20:12 PM3/7/06

to

On Tue, 7 Mar 2006, Matthew Wilcox wrote:

> On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote:
>> This might be a good place to document:
>> dummy = readl(&foodev->ctrl);
>>
>> Will flush all pending writes to the PCI bus and that:
>> (void) readl(&foodev->ctrl);
>> ... won't because `gcc` may optimize it away. In fact, variable
>> "dummy" should be global or `gcc` may make it go away as well.
>
> static inline unsigned int readl(const volatile void __iomem *addr)
> {
> return *(volatile unsigned int __force *) addr;
> }
>
> The cast is volatile, so gcc knows not to optimise it away.
>

When the assignment is not made a.k.a., cast to void, or when the
assignment is made to an otherwise unused variable, `gcc` does,
indeed make it go away. These problems caused weeks of chagrin
after it was found that a PCI DMA operation took 20 or more times
than it should. The writel(START_DMA, &control), followed by
a dummy = readl(&control), ended up with the readl() missing.
That meant that the DMA didn't start until some timer code
read a status register, wondering why it hadn't completed yet.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to Deliver...@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

Alan Cox

unread,

Mar 7, 2006, 2:30:10 PM3/7/06

to

On Maw, 2006-03-07 at 13:54 -0500, linux-os (Dick Johnson) wrote:
> On Tue, 7 Mar 2006, Alan Cox wrote:
> > writel(STOP_DMA, &foodev->ctrl);
> > free_dma_buffers(foodev);
> >
> > This leads to horrible disasters.
>
> This might be a good place to document:
> dummy = readl(&foodev->ctrl);

Absolutely. And this falls outside of the memory barrier functions.

>
> Will flush all pending writes to the PCI bus and that:
> (void) readl(&foodev->ctrl);
> ... won't because `gcc` may optimize it away. In fact, variable
> "dummy" should be global or `gcc` may make it go away as well.

If they were ordinary functions then maybe, but they are not so a simple
readl(&foodev->ctrl) will be sufficient and isn't optimised away.

Alan

David Howells

unread,

Mar 7, 2006, 2:30:19 PM3/7/06

to

Andi Kleen <a...@suse.de> wrote:

> > > You're not supposed to do it this way anyways. The official way to access
> > > MMIO space is using read/write[bwlq]
> >
> > True, I suppose. I should make it clear that these accessor functions imply
> > memory barriers, if indeed they do,
>
> I don't think they do.

Hmmm.. Seems Stephen Hemminger disagrees:

| > > 1) Access to i/o mapped memory does not need memory barriers.
| >

| > There's no guarantee of that. On FRV you have to insert barriers as
| > appropriate when you're accessing I/O mapped memory if ordering is required
| > (accessing an ethernet card vs accessing a frame buffer), but support for
| > inserting the appropriate barriers is built into gcc - which knows the rules
| > for when to insert them.
| >
| > Or are you referring to the fact that this should be implicit in inX(),
| > outX(), readX(), writeX() and similar?
|
| yes

David

Bryan O'Sullivan

unread,

Mar 7, 2006, 2:30:19 PM3/7/06

to

On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote:

> True, I suppose. I should make it clear that these accessor functions imply
> memory barriers, if indeed they do,

They don't, but according to Documentation/DocBook/deviceiobook.tmpl
they are performed by the compiler in the order specified.

They also convert between PCI byte order and CPU byte order. If you
want to avoid that, you need the __raw_* versions, which are not
guaranteed to be provided by all arches.

<b

Andi Kleen

unread,

Mar 7, 2006, 2:30:20 PM3/7/06

to

On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote:
> On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote:
>
> > True, I suppose. I should make it clear that these accessor functions imply
> > memory barriers, if indeed they do,
>
> They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> they are performed by the compiler in the order specified.

I don't think that's correct. Probably the documentation should
be fixed.

-Andi

Stephen Hemminger

unread,

Mar 7, 2006, 2:50:09 PM3/7/06

to

On Tue, 07 Mar 2006 19:24:03 +0000
David Howells <dhow...@redhat.com> wrote:

> Andi Kleen <a...@suse.de> wrote:
>
> > > > You're not supposed to do it this way anyways. The official way to access
> > > > MMIO space is using read/write[bwlq]
> > >
> > > True, I suppose. I should make it clear that these accessor functions imply
> > > memory barriers, if indeed they do,
> >
> > I don't think they do.
>
> Hmmm.. Seems Stephen Hemminger disagrees:
>
> | > > 1) Access to i/o mapped memory does not need memory barriers.
> | >
> | > There's no guarantee of that. On FRV you have to insert barriers as
> | > appropriate when you're accessing I/O mapped memory if ordering is required
> | > (accessing an ethernet card vs accessing a frame buffer), but support for
> | > inserting the appropriate barriers is built into gcc - which knows the rules
> | > for when to insert them.
> | >
> | > Or are you referring to the fact that this should be implicit in inX(),
> | > outX(), readX(), writeX() and similar?
> |

The problem with all this is like physics it is all relative to the observer.
I get confused an lost when talking about the general case because there are so many possible
specific examples where a barrier is or is not needed.

David Howells

unread,

Mar 7, 2006, 3:10:09 PM3/7/06

to

Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:

> Better meaningful example would be barriers versus an IRQ handler. Which
> leads nicely onto section 2

Yes, except that I can't think of one that's feasible that doesn't have to do
with I/O - which isn't a problem if you are using the proper accessor
functions.

Such an example has to involve more than one CPU, because you don't tend to
get memory/memory ordering problems on UP.

The obvious one might be circular buffers, except there's no problem there
provided you have a memory barrier between accessing the buffer and updating
your pointer into it.

David

Jesse Barnes

unread,

Mar 7, 2006, 3:10:16 PM3/7/06

to

On Tuesday, March 7, 2006 3:57 am, Andi Kleen wrote:
> On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote:
> > On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote:
> > > True, I suppose. I should make it clear that these accessor
> > > functions imply memory barriers, if indeed they do,
> >
> > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > they are performed by the compiler in the order specified.
>
> I don't think that's correct. Probably the documentation should
> be fixed.

On ia64 I'm pretty sure it's true, and it seems like it should be in the
general case too. The compiler shouldn't reorder uncached memory
accesses with volatile semantics...

Jesse

Bryan O'Sullivan

unread,

Mar 7, 2006, 4:20:14 PM3/7/06

to

On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote:

> > > True, I suppose. I should make it clear that these accessor functions imply
> > > memory barriers, if indeed they do,
> >
> > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > they are performed by the compiler in the order specified.
>
> I don't think that's correct. Probably the documentation should
> be fixed.

That's why I hedged my words with "according to ..." :-)

But on most arches those accesses do indeed seem to happen in-order. On
i386 and x86_64, it's a natural consequence of program store ordering.
On at least some other arches, there are explicit memory barriers in the
implementation of the access macros to force this ordering to occur.

<b

Andi Kleen

unread,

Mar 7, 2006, 4:30:09 PM3/7/06

to

On Tuesday 07 March 2006 22:14, Bryan O'Sullivan wrote:
> On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote:
> > > > True, I suppose. I should make it clear that these accessor functions
> > > > imply memory barriers, if indeed they do,
> > >
> > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > > they are performed by the compiler in the order specified.
> >
> > I don't think that's correct. Probably the documentation should
> > be fixed.
>
> That's why I hedged my words with "according to ..." :-)
>
> But on most arches those accesses do indeed seem to happen in-order. On
> i386 and x86_64, it's a natural consequence of program store ordering.

Not true for reads on x86.

-Andi

Chuck Ebbert

unread,

Mar 7, 2006, 6:30:25 PM3/7/06

to

In-Reply-To: <31492.11...@warthog.cambridge.redhat.com>

On Tue, 07 Mar 2006 17:40:45 +0000, David Howells wrote:

> The attached patch documents the Linux kernel's memory barriers.

References:

AMD64 Architecture Programmer's Manual Volume 2: System Programming
Chapter 7.1: Memory-Access Ordering
Chapter 7.4: Buffering and Combining Memory Writes

IA-32 Intel Architecture Software Developer’s Manual, Volume 3:
System Programming Guide
Chapter 7.1: Locked Atomic Operations
Chapter 7.2: Memory Ordering
Chapter 7.4: Serializing Instructions

--
Chuck
"Penguins don't come from next door, they come from the Antarctic!"

David S. Miller

unread,

Mar 7, 2006, 7:20:14 PM3/7/06

to

From: Chuck Ebbert <76306...@compuserve.com>
Date: Tue, 7 Mar 2006 18:17:19 -0500

> In-Reply-To: <31492.11...@warthog.cambridge.redhat.com>
>
> On Tue, 07 Mar 2006 17:40:45 +0000, David Howells wrote:
>
> > The attached patch documents the Linux kernel's memory barriers.
>
> References:

Here are some good ones for Sparc64:

The SPARC Architecture Manual, Version 9
Chapter 8: Memory Models
Appendix D: Formal Specification of the Memory Models
Appendix J: Programming with the Memory Models

UltraSPARC Programmer Reference Manual
Chapter 5: Memory Accesses and Cacheability
Chapter 15: Sparc-V9 Memory Models

UltraSPARC III Cu User's Manual
Chapter 9: Memory Models

UltraSPARC IIIi Processor User's Manual
Chapter 8: Memory Models

UltraSPARC Architecture 2005
Chapter 9: Memory
Appendix D: Formal Specifications of the Memory Models

UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005
Chapter 8: Memory Models
Appendix F: Caches and Cache Coherency

Robert Hancock

unread,

Mar 7, 2006, 7:30:13 PM3/7/06

to

Jesse Barnes wrote:

> On Tuesday, March 7, 2006 10:30 am, David Howells wrote:
>> True, I suppose. I should make it clear that these accessor functions

>> imply memory barriers, if indeed they do, and that you should use them

>> rather than accessing I/O registers directly (at least, outside the
>> arch you should).
>

> But they don't, that's why we have mmiowb().

I don't think that is why that function exists.. It's a no-op on most
architectures, even where you would need to be able to do write barriers
on IO accesses (i.e. x86_64 using CONFIG_UNORDERED_IO). I believe that
function is intended for a more limited special case.

I think any complete memory barrier description should document that
function as well as EXPLICITLY specifying whether or not the
readX/writeX, etc. functions imply barriers or not.

> Btw, thanks for putting together this documentation, it's desperately
> needed.

Seconded.. The fact that there's debate over what the rules even are
shows why this is needed so badly.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hanc...@nospamshaw.ca
Home Page: http://www.roberthancock.com/

Roberto Nibali

unread,

Mar 7, 2006, 7:30:17 PM3/7/06

to

>>The attached patch documents the Linux kernel's memory barriers.
>
> References:
>
> AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Chapter 7.1: Memory-Access Ordering
> Chapter 7.4: Buffering and Combining Memory Writes
>
> IA-32 Intel Architecture Software Developer’s Manual, Volume 3:
> System Programming Guide
> Chapter 7.1: Locked Atomic Operations
> Chapter 7.2: Memory Ordering
> Chapter 7.4: Serializing Instructions

Do you guys reckon it might be worthwhile adding Sparc's sequential
consistency, TSO, RMO and PSO models, although I think only RMO is used
in the Linux kernel? References can be found for example in:

Solaris Internals, Core Kernel Architecture, p63-68:
Chapter 3.3: Hardware Considerations for Locks and
Synchronization

Unix Systems for Modern Architectures, Symmetric Multiprocessing
and Caching for Kernel Programmers:
Chapter 13 : Other Memory Models

Or is DaveM the only one fiddling with Sparc memory barriers implementation?

Regards,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

Alan Cox

unread,

Mar 7, 2006, 7:40:05 PM3/7/06

to

On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote:
> > But on most arches those accesses do indeed seem to happen in-order. On
> > i386 and x86_64, it's a natural consequence of program store ordering.
>
> Not true for reads on x86.

You must have a strange kernel Andi. Mine marks them as volatile
unsigned char * references.

Alan

Alan Cox

unread,

Mar 7, 2006, 7:40:07 PM3/7/06

to

On Maw, 2006-03-07 at 12:57 +0100, Andi Kleen wrote:
> > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > they are performed by the compiler in the order specified.
>
> I don't think that's correct. Probably the documentation should
> be fixed.

It would be wiser to ensure they are performed in the order specified.
As far as I can see this is currently true due to the volatile cast and
most drivers rely on this property so the brown and sticky will impact
the rotating air impeller pretty fast if it isnt.

Alan Cox

unread,

Mar 7, 2006, 7:40:11 PM3/7/06

to

On Maw, 2006-03-07 at 20:09 +0000, David Howells wrote:
> Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>
> > Better meaningful example would be barriers versus an IRQ handler. Which
> > leads nicely onto section 2
>
> Yes, except that I can't think of one that's feasible that doesn't have to do
> with I/O - which isn't a problem if you are using the proper accessor
> functions.

We get them off bus masters for one and you can construct silly versions
of the other.

There are several kernel instances of

while(*ptr != HAVE_RESPONDED && time_before(jiffies, timeout))
rmb();

where we wait for hardware to bus master respond when it is fast and
doesn't IRQ.

Robert Hancock

unread,

Mar 7, 2006, 8:20:14 PM3/7/06

to

Alan Cox wrote:
> On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote:
>>> But on most arches those accesses do indeed seem to happen in-order. On
>>> i386 and x86_64, it's a natural consequence of program store ordering.
>> Not true for reads on x86.
>
> You must have a strange kernel Andi. Mine marks them as volatile
> unsigned char * references.

Well, that and the fact that IO memory should be mapped as uncacheable
in the MTRRs should ensure that readl and writel won't be reordered on
i386 and x86_64.. except in the case where CONFIG_UNORDERED_IO is
enabled on x86_64 which can reorder writes since it uses nontemporal
stores..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hanc...@nospamshaw.ca
Home Page: http://www.roberthancock.com/

-

Nick Piggin

unread,

Mar 7, 2006, 9:10:10 PM3/7/06

to

David Howells wrote:

>The attached patch documents the Linux kernel's memory barriers.
>

>Signed-Off-By: David Howells <dhow...@redhat.com>
>---
>
>

Good :)

>+==============================
>+IMPLIED KERNEL MEMORY BARRIERS
>+==============================
>+
>+Some of the other functions in the linux kernel imply memory barriers. For
>+instance all the following (pseudo-)locking functions imply barriers.
>+

>+ (*) interrupt disablement and/or interrupts
>

Is this really the case? I mean interrupt disablement only synchronises with
the local CPU, so it probably should not _have_ to imply barriers (eg. some
architectures are playing around with "virtual" interrupt disablement).

[...]

>+
>+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
>+memory and I/O accesses individually, or interrupt handling will barrier
>+memory and I/O accesses on entry and on exit. This prevents an interrupt
>+routine interfering with accesses made in a disabled-interrupt section of code
>+and vice versa.
>+
>

But CPUs should always be consistent WRT themselves, so I'm not sure that
it is needed?

Thanks,
Nick

--
Send instant messages to your online friends http://au.messenger.yahoo.com

Paul Mackerras

unread,

Mar 7, 2006, 10:20:08 PM3/7/06

to

David Howells writes:

> The attached patch documents the Linux kernel's memory barriers.

Thanks for venturing into this particular lion's den. :)

> +Memory barriers are instructions to both the compiler and the CPU to impose a
> +partial ordering between the memory access operations specified either side of
> +the barrier.

... as observed from another agent in the system - another CPU or a
bus-mastering I/O device. A given CPU will always see its own memory
accesses in order.

> + (*) reads are synchronous and may need to be done immediately to permit

Leave out the "are synchronous and". It's not true.

I also think you need to avoid talking about "the bus". Some systems
don't have a bus, but rather have an interconnection fabric between
the CPUs and the memories. Talking about a bus implies that all
memory accesses in fact get serialized (by having to be sent one after
the other over the bus) and that you can therefore talk about the
order in which they get to memory. In some systems, no such order
exists.

It's possible to talk sensibly about the order in which memory
accesses get done without talking about a bus or requiring a total
ordering on the memory access. The PowerPC architecture spec does
this by specifying that in certain circumstances one load or store has
to be "performed with respect to other processors and mechanisms"
before another. A load is said to be performed with respect to
another agent when a store by that agent can no longer change the
value returned by the load. Similarly, a store is performed w.r.t.
an agent when any load done by the agent will return the value stored
(or a later value).

> + The way to deal with this is to insert an I/O memory barrier between the
> + two accesses:

> +
> + *ADR = ctl_reg_3;

> + mb();
> + reg = *DATA;

Ummm, this implies mb() is "an I/O memory barrier". I can see people
getting confused if they read this and then see mb() being used when
no I/O is being done.

> +The Linux kernel has six basic memory barriers:
> +
> + MANDATORY (I/O) SMP
> + =============== ================
> + GENERAL mb() smp_mb()
> + READ rmb() smp_rmb()
> + WRITE wmb() smp_wmb()
> +

> +General memory barriers make a guarantee that all memory accesses specified
> +before the barrier will happen before all memory accesses specified after the
> +barrier.

By "memory accesses" do you mean accesses to system memory, or do you
mean loads and stores - which may be to system memory, memory on an I/O
device (e.g. a framebuffer) or to memory-mapped I/O registers?

Linus explained recently that wmb() on x86 does not order stores to
system memory w.r.t. stores to stores to prefetchable I/O memory (at
least that's what I think he said ;).

> +Some of the other functions in the linux kernel imply memory barriers. For
> +instance all the following (pseudo-)locking functions imply barriers.
> +
> + (*) interrupt disablement and/or interrupts

Enabling/disabling interrupts doesn't imply a barrier on powerpc, and
nor does taking an interrupt or returning from one.

> + (*) spin locks

I think it's still an open question as to whether spin locks do any
ordering between accesses to system memory and accesses to I/O
registers.

> + (*) R/W spin locks
> + (*) mutexes
> + (*) semaphores
> + (*) R/W semaphores

> +
> +In all cases there are variants on a LOCK operation and an UNLOCK operation.
> +
> + (*) LOCK operation implication:
> +
> + Memory accesses issued after the LOCK will be completed after the LOCK
> + accesses have completed.
> +
> + Memory accesses issued before the LOCK may be completed after the LOCK
> + accesses have completed.
> +
> + (*) UNLOCK operation implication:
> +
> + Memory accesses issued before the UNLOCK will be completed before the
> + UNLOCK accesses have completed.
> +
> + Memory accesses issued after the UNLOCK may be completed before the UNLOCK
> + accesses have completed.

And therefore an UNLOCK followed by a LOCK is equivalent to a full
barrier, but a LOCK followed by an UNLOCK isn't.

> +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
> +memory and I/O accesses individually, or interrupt handling will barrier
> +memory and I/O accesses on entry and on exit. This prevents an interrupt
> +routine interfering with accesses made in a disabled-interrupt section of code
> +and vice versa.

I don't think this is right, and I don't think it is necessary to
achieve the end you state, since a CPU will always see its own memory
accesses in program order.

> +The following sequence of events on the bus is acceptable:
> +
> + LOCK, *F+*A, *E, *C+*D, *B, UNLOCK

What does *F+*A mean?

> +Consider also the following (going back to the AMD PCnet example):
> +
> + DISABLE IRQ
> + *ADR = ctl_reg_3;
> + mb();
> + x = *DATA;
> + *ADR = ctl_reg_4;
> + mb();
> + *DATA = y;
> + *ADR = ctl_reg_5;
> + mb();
> + z = *DATA;
> + ENABLE IRQ
> + <interrupt>
> + *ADR = ctl_reg_7;
> + mb();
> + q = *DATA
> + </interrupt>
> +
> +What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the
> +wrong register? (There's no guarantee that the process of handling an
> +interrupt will barrier memory accesses in any way).

Well, the driver should *not* be doing *ADR at all, it should be using
read[bwl]/write[bwl]. The architecture code has to implement
read*/write* in such a way that the accesses generated can't be
reordered. I _think_ it also has to make sure the write accesses
can't be write-combined, but it would be good to have that clarified.

> +======================
> +POWERPC SPECIFIC NOTES
> +======================
> +
> +The powerpc is weakly ordered, and its read and write accesses may be
> +completed generally in any order. It's memory barriers are also to some extent
> +more substantial than the mimimum requirement, and may directly effect
> +hardware outside of the CPU.

Unfortunately mb()/smp_mb() are quite expensive on PowerPC, since the
only instruction we have that implies a strong enough barrier is sync,
which also performs several other kinds of synchronization, such as
waiting until all previous instructions have completed executing to
the point where they can no longer cause an exception.

Paul.

Linus Torvalds

unread,

Mar 7, 2006, 10:40:09 PM3/7/06

to

On Wed, 8 Mar 2006, Paul Mackerras wrote:
>
> Linus explained recently that wmb() on x86 does not order stores to
> system memory w.r.t. stores to stores to prefetchable I/O memory (at
> least that's what I think he said ;).

In fact, it won't order stores to normal memory even wrt any
_non-prefetchable_ IO memory.

PCI (and any other sane IO fabric, for that matter) will do IO posting, so
the fact that the CPU _core_ may order them due to a wmb() doesn't
actually mean anything.

The only way to _really_ synchronize with a store to an IO device is
literally to read from that device (*). No amount of memory barriers will
do it.

So you can really only order stores to regular memory wrt each other, and
stores to IO memory wrt each other. For the former, "smp_wmb()" does it.

For IO memory, normal IO memory is _always_ supposed to be in program
order (at least for PCI. It's part of how the bus is supposed to work),
unless the IO range allows prefetching (and you've set some MTRR). And if
you do, that, currently you're kind of screwed. mmiowb() should do it, but
nobody really uses it, and I think it's broken on x86 (it's a no-op, it
really should be an "sfence").

A full "mb()" is probably most likely to work in practice. And yes, we
should clean this up.

Linus

(*) The "read" can of course be any event that tells you that the store
has happened - it doesn't necessarily have to be an actual "read[bwl]()"
operation. Eg the store might start a command, and when you get the
completion interrupt, you obviously know that the store is done, just from
a causal reason.

Nick Piggin

unread,

Mar 8, 2006, 2:50:09 AM3/8/06

to

Paul Mackerras wrote:
> David Howells writes:

>>+ The way to deal with this is to insert an I/O memory barrier between the
>>+ two accesses:
>>+
>>+ *ADR = ctl_reg_3;
>>+ mb();
>>+ reg = *DATA;
>
>
> Ummm, this implies mb() is "an I/O memory barrier". I can see people
> getting confused if they read this and then see mb() being used when
> no I/O is being done.
>

Isn't it? Why wouldn't you just use smp_mb() if no IO is being done?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

Duncan Sands

unread,

Mar 8, 2006, 3:30:21 AM3/8/06

to

On Tuesday 7 March 2006 21:09, David Howells wrote:
> Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>
> > Better meaningful example would be barriers versus an IRQ handler. Which
> > leads nicely onto section 2
>
> Yes, except that I can't think of one that's feasible that doesn't have to do
> with I/O - which isn't a problem if you are using the proper accessor
> functions.
>
> Such an example has to involve more than one CPU, because you don't tend to
> get memory/memory ordering problems on UP.

On UP you at least need compiler barriers, right? You're in trouble if you think
you are writing in a certain order, and expect to see the same order from an
interrupt handler, but the compiler decided to rearrange the order of the writes...

> The obvious one might be circular buffers, except there's no problem there
> provided you have a memory barrier between accessing the buffer and updating
> your pointer into it.
>
> David

Ciao,

Duncan.

Alan Cox

unread,

Mar 8, 2006, 6:40:47 AM3/8/06

to

On Maw, 2006-03-07 at 19:10 -0600, Robert Hancock wrote:

> Alan Cox wrote:
> > You must have a strange kernel Andi. Mine marks them as volatile
> > unsigned char * references.
>
> Well, that and the fact that IO memory should be mapped as uncacheable
> in the MTRRs should ensure that readl and writel won't be reordered on
> i386 and x86_64.. except in the case where CONFIG_UNORDERED_IO is
> enabled on x86_64 which can reorder writes since it uses nontemporal
> stores..

You need both

real/writel need the volatile to stop gcc removing/reordering the
accesses at compiler level, and the mtrr/pci bridge stuff then deals
with bus level ordering for that CPU.

David Howells

unread,

Mar 8, 2006, 7:40:13 AM3/8/06

to

Linus Torvalds <torv...@osdl.org> wrote:

> > Linus explained recently that wmb() on x86 does not order stores to
> > system memory w.r.t. stores to stores to prefetchable I/O memory (at
> > least that's what I think he said ;).

On i386 and x86_64, do IN and OUT instructions imply MFENCE? It's not obvious
from the x86_64 docs.

David

David Howells

unread,

Mar 8, 2006, 8:30:22 AM3/8/06

to

Paul Mackerras <pau...@samba.org> wrote:

> By "memory accesses" do you mean accesses to system memory, or do you
> mean loads and stores - which may be to system memory, memory on an I/O
> device (e.g. a framebuffer) or to memory-mapped I/O registers?

Well, I meant all loads and stores, irrespective of their destination.

However, on i386, for example, you've actually got at least two different I/O
access domains, and I don't know how they impinge upon each other (IN/OUT vs
MOV).

> Enabling/disabling interrupts doesn't imply a barrier on powerpc, and
> nor does taking an interrupt or returning from one.

Surely it ought to, otherwise what's to stop accesses done with interrupts
disabled crossing with accesses done inside an interrupt handler?

> > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier

> ...

> I don't think this is right, and I don't think it is necessary to
> achieve the end you state, since a CPU will always see its own memory
> accesses in program order.

But what about a driver accessing some memory that its device is going to
observe under irq disablement, and then getting an interrupt immediately after
from that same device, the handler for which communicates with the device,
possibly then being broken because the CPU hasn't completed all the memory
accesses that the driver made while interrupts are disabled?

Alternatively, might it be possible for communications between two CPUs to be
stuffed because one took an interrupt that also modified common data before
the it had committed the memory accesses done under interrupt disablement?
This would suggest using a lock though.

I'm not sure that I can come up with a feasible example for this, but Alan Cox
seems to think that it's a valid problem too.

The only likely way I can see this being a problem is with unordered I/O
writes, which would suggest you have to place an mmiowb() before unlocking the
spinlock in such a case, assuming it is possible to get unordered I/O writes
(which I think it is).

> What does *F+*A mean?

Combined accesses.

> Well, the driver should *not* be doing *ADR at all, it should be using
> read[bwl]/write[bwl]. The architecture code has to implement
> read*/write* in such a way that the accesses generated can't be
> reordered. I _think_ it also has to make sure the write accesses
> can't be write-combined, but it would be good to have that clarified.

Than what use mmiowb()?

Surely write combining and out-of-order reads are reasonable for cacheable
devices like framebuffers.

David

David Howells

unread,

Mar 8, 2006, 9:40:15 AM3/8/06

to

The attached patch documents the Linux kernel's memory barriers.

I've updated it from the comments I've been given.

Note that the per-arch notes sections are gone because it's clear that there
are so many exceptions, that it's not worth having them.

I've added a list of references to other documents.

I've tried to get rid of the concept of memory accesses appearing on the bus;
what matters is apparent behaviour with respect to other observers in the
system.

I'm not sure that any mention interrupts vs interrupt disablement should be
retained... it's unclear that there is actually anything that guarantees that
stuff won't leak out of an interrupt-disabled section and into an interrupt
handler. Paul Mackerras says this isn't valid on powerpc, and looking at the
code seems to confirm that, barring implicit enforcement by the CPU.

Signed-Off-By: David Howells <dhow...@redhat.com>
---

warthog>diffstat -p1 /tmp/mb.diff
Documentation/memory-barriers.txt | 589 ++++++++++++++++++++++++++++++++++++++
1 files changed, 589 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..1340c8d
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,589 @@
+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Where are memory barriers needed?
+
+ - Accessing devices.
+ - Multiprocessor interaction.
+ - Interrupts.
+
+ (*) Linux kernel compiler barrier functions.
+
+ (*) Linux kernel memory barrier functions.
+
+ (*) Implicit kernel memory barriers.
+
+ - Locking functions.
+ - Interrupt disablement functions.
+ - Miscellaneous functions.
+
+ (*) Linux kernel I/O barriering.
+
+ (*) References.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose an
+apparent partial ordering between the memory access operations specified either
+side of the barrier. They request that the sequence of memory events generated
+appears to other components of the system as if the barrier is effective on
+that CPU.
+
+Note that:
+
+ (*) there's no guarantee that the sequence of memory events is _actually_ so
+ ordered. It's possible for the CPU to do out-of-order accesses _as long
+ as no-one is looking_, and then fix up the memory if someone else tries to
+ see what's going on (for instance a bus master device); what matters is
+ the _apparent_ order as far as other processors and devices are concerned;
+ and
+
+ (*) memory barriers are only guaranteed to act within the CPU processing them,
+ and are not, for the most part, guaranteed to percolate down to other CPUs
+ in the system or to any I/O hardware that that CPU may communicate with.
+
+
+For example, a programmer might take it for granted that the CPU will perform
+memory accesses in exactly the order specified, so that if a CPU is, for
+example, given the following piece of code:
+
+ a = *A;
+ *B = b;
+ c = *C;
+ d = *D;
+ *E = e;
+
+They would then expect that the CPU will complete the memory access for each
+instruction before moving on to the next one, leading to a definite sequence of
+operations as seen by external observers in the system:
+
+ read *A, write *B, read *C, read *D, write *E.
+
+
+Reality is, of course, much messier. With many CPUs and compilers, this isn't
+always true because:
+
+ (*) reads are more likely to need to be completed immediately to permit
+ execution progress, whereas writes can often be deferred without a
+ problem;
+
+ (*) reads can be done speculatively, and then the result discarded should it
+ prove not to be required;
+
+ (*) the order of the memory accesses may be rearranged to promote better use
+ of the CPU buses and caches;
+
+ (*) reads and writes may be combined to improve performance when talking to
+ the memory or I/O hardware that can do batched accesses of adjacent
+ locations, thus cutting down on transaction setup costs (memory and PCI
+ devices may be able to do this); and
+
+ (*) the CPU's data cache may affect the ordering, though cache-coherency
+ mechanisms should alleviate this - once the write has actually hit the
+ cache.
+
+So what another CPU, say, might actually observe from the above piece of code
+is:
+
+ read *A, read {*C,*D}, write *E, write *B
+
+ (By "read {*C,*D}" I mean a combined single read).
+
+
+It is also guaranteed that a CPU will be self-consistent: it will see its _own_
+accesses appear to be correctly ordered, without the need for a memory
+barrier. For instance with the following code:
+
+ X = *A;
+ *A = Y;
+ Z = *A;
+
+assuming no intervention by an external influence, it can be taken that:
+
+ (*) X will hold the old value of *A, and will never happen after the write and
+ thus end up being given the value that was assigned to *A from Y instead;
+ and
+
+ (*) Z will always be given the value in *A that was assigned there from Y, and
+ will never happen before the write, and thus end up with the same value
+ that was in *A initially.
+
+(This is ignoring the fact that the value initially in *A may appear to be the
+same as the value assigned to *A from Y).
+
+
+=================================
+WHERE ARE MEMORY BARRIERS NEEDED?
+=================================
+
+Under normal operation, access reordering is probably not going to be a problem
+as a linear program will still appear to operate correctly. There are,
+however, three circumstances where reordering definitely _could_ be a problem:
+
+
+ACCESSING DEVICES
+-----------------
+
+Many devices can be memory mapped, and so appear to the CPU as if they're just
+memory locations. However, to control the device, the driver has to make the
+right accesses in exactly the right order.
+
+Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+presents to the CPU an "address register" and a bunch of "data registers". The
+way it's accessed is to write the index of the internal register to be accessed
+to the address register, and then read or write the appropriate data register
+to access the chip's internal register, which could - theoretically - be done
+by:

+
+ *ADR = ctl_reg_3;

+ reg = *DATA;
+
+The problem with a clever CPU or a clever compiler is that the write to the
+address register isn't guaranteed to happen before the access to the data
+register, if the CPU or the compiler thinks it is more efficient to defer the
+address write:
+
+ read *DATA, write *ADR
+
+then things will break.
+
+
+In the Linux kernel, however, I/O should be done through the appropriate
+accessor routines - such as inb() or writel() - which know how to make such
+accesses appropriately sequential.
+
+On some systems, I/O writes are not strongly ordered across all CPUs, and so
+locking should be used, and mmiowb() should be issued prior to unlocking the
+critical section.
+
+See Documentation/DocBook/deviceiobook.tmpl for more information.
+
+
+MULTIPROCESSOR INTERACTION
+--------------------------
+
+When there's a system with more than one processor, these may be working on the
+same set of data, but attempting not to use locks as locks are quite expensive.
+This means that accesses that affect both CPUs may have to be carefully ordered
+to prevent error.
+
+Consider the R/W semaphore slow path. In that, a waiting process is queued on
+the semaphore, as noted by it having a record on its stack linked to the
+semaphore's list:
+
+ struct rw_semaphore {
+ ...
+ struct list_head waiters;
+ };
+
+ struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ };
+
+To wake up the waiter, the up_read() or up_write() functions have to read the
+pointer from this record to know as to where the next waiter record is, clear
+the task pointer, call wake_up_process() on the task, and release the reference
+held on the waiter's task struct:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+If any of these steps occur out of order, then the whole thing may fail.
+
+Note that the waiter does not get the semaphore lock again - it just waits for
+its task pointer to be cleared. Since the record is on its stack, this means
+that if the task pointer is cleared _before_ the next pointer in the list is
+read, another CPU might start processing the waiter and it might clobber its
+stack before up*() functions have a chance to read the next pointer.
+
+ CPU 0 CPU 1
+ =============================== ===============================
+ down_xxx()
+ Queue waiter
+ Sleep
+ up_yyy()
+ READ waiter->task;
+ WRITE waiter->task;
+ <preempt>
+ Resume processing
+ down_xxx() returns
+ call foo()
+ foo() clobbers *waiter
+ </preempt>
+ READ waiter->list.next;
+ --- OOPS ---
+
+This could be dealt with using a spinlock, but then the down_xxx() function has
+to get the spinlock again after it's been woken up, which is a waste of
+resources.
+
+The way to deal with this is to insert an SMP memory barrier:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ smp_mb();
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+In this case, the barrier makes a guarantee that all memory accesses before the
+barrier will appear to happen before all the memory accesses after the barrier
+with respect to the other CPUs on the system. It does _not_ guarantee that all
+the memory accesses before the barrier will be complete by the time the barrier
+itself is complete.
+
+SMP memory barriers are normally mere compiler barriers on a UP system because
+the CPU orders overlapping accesses with respect to itself.
+
+
+INTERRUPTS
+----------
+
+A driver may be interrupted by its own interrupt service routine, and thus they
+may interfere with each other's attempts to control or access the device.
+
+This may be alleviated - at least in part - by disabling interrupts (a form of
+locking), such that the critical operations are all contained within the
+disabled-interrupt section in the driver. Whilst the driver's interrupt
+routine is executing, the driver's core may not run on the same CPU, and its
+interrupt is not permitted to happen again until the current interrupt has been
+handled, thus the interrupt handler does not need to lock against that.
+
+
+However, consider the following example:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ [A is 0 and B is 0]
+ DISABLE IRQ
+ *A = 1;
+ smp_wmb();
+ *B = 2;

+ ENABLE IRQ
+ <interrupt>

+ *A = 3
+ a = *A;
+ b = *B;
+ smp_wmb();
+ *B = 4;
+ </interrupt>
+
+CPU 2 might see *A == 3 and *B == 0, when what it probably ought to see is *B
+== 2 and *A == 1 or *A == 3, or *B == 4 and *A == 3.
+
+This might happen because the write "*B = 2" might occur after the write "*A =
+3" - in which case the former write has leaked from the interrupt-disabled
+section into the interrupt handler. In this case it is a lock of some
+description should very probably be used.
+
+
+This sort of problem might also occur with relaxed I/O ordering rules, if it's
+permitted for I/O writes to cross. For instance, if a driver was talking to an
+ethernet card that sports an address register and a data register:
+
+ DISABLE IRQ
+ writew(ADR, ctl_reg_3);
+ writew(DATA, y);

+ ENABLE IRQ
+ <interrupt>

+ writew(ADR, ctl_reg_4);
+ q = readw(DATA);
+ </interrupt>
+
+In such a case, an mmiowb() is needed, firstly to prevent the first write to
+the address register from occurring after the write to the data register, and
+secondly to prevent the write to the data register from happening after the
+second write to the address register.
+
+
+=======================================
+LINUX KERNEL COMPILER BARRIER FUNCTIONS
+=======================================
+
+The Linux kernel has an explicit compiler barrier function that prevents the
+compiler from moving the memory accesses either side of it to the other side:
+
+ barrier();
+
+This has no direct effect on the CPU, which may then reorder things however it
+wishes.
+
+In addition, accesses to "volatile" memory locations and volatile asm
+statements act as implicit compiler barriers.
+
+
+=====================================
+LINUX KERNEL MEMORY BARRIER FUNCTIONS
+=====================================
+
+The Linux kernel has six basic CPU memory barriers:
+
+ MANDATORY SMP CONDITIONAL
+ =============== ===============

+ GENERAL mb() smp_mb()
+ READ rmb() smp_rmb()
+ WRITE wmb() smp_wmb()
+

+General memory barriers give a guarantee that all memory accesses specified
+before the barrier will appear to happen before all memory accesses specified
+after the barrier with respect to the other components of the system.
+
+Read and write memory barriers give similar guarantees, but only for memory
+reads versus memory reads and memory writes versus memory writes respectively.
+
+All memory barriers imply compiler barriers.
+
+SMP memory barriers are only compiler barriers on uniprocessor compiled systems
+because it is assumed that a CPU will be apparently self-consistent, and will
+order overlapping accesses correctly with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in that CPU's access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order in which the second CPU sees the first CPU's accesses
+occur.
+
+There is no guarantee that some intervening piece of off-the-CPU hardware will
+not reorder the memory accesses. CPU cache coherency mechanisms should
+propegate the indirect effects of a memory barrier between CPUs.
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barriering functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write
+ barrier after it, depending on the function.
+
+
+===============================
+IMPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+Some of the other functions in the linux kernel imply memory barriers, amongst
+them are locking and scheduling functions and interrupt management functions.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+LOCKING FUNCTIONS
+-----------------
+
+For instance all the following locking functions imply barriers:
+
+ (*) spin locks

+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+ Memory accesses issued after the LOCK will be completed after the LOCK
+ accesses have completed.
+
+ Memory accesses issued before the LOCK may be completed after the LOCK
+ accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+ Memory accesses issued before the UNLOCK will be completed before the
+ UNLOCK accesses have completed.
+
+ Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+ accesses have completed.

+
+ (*) LOCK vs UNLOCK implication:
+
+ The LOCK accesses will be completed before the UNLOCK accesses.
+
+And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but
+a LOCK followed by an UNLOCK isn't.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do anything
+at all, especially with respect to I/O barriering, unless combined with
+interrupt disablement operations.
+
+
+As an example, consider the following:
+
+ *A = a;
+ *B = b;
+ LOCK
+ *C = c;
+ *D = d;
+ UNLOCK
+ *E = e;
+ *F = f;
+
+The following sequence of events is acceptable:
+
+ LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+
+But none of the following are:
+
+ {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E
+ *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
+ *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
+ *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E
+
+
+INTERRUPT DISABLEMENT FUNCTIONS
+-------------------------------
+
+Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will
+barrier memory and I/O accesses versus memory and I/O accesses done in the
+interrupt handler. This prevents an interrupt routine interfering with
+accesses made in a disabled-interrupt section of code and vice versa.
+
+Note that whilst interrupt disablement barriers all act as compiler barriers,
+they only act as memory barriers with respect to interrupts, not with respect
+to nested sections.
+
+Consider the following:
+
+ <interrupt>
+ *X = x;
+ </interrupt>
+ *A = a;
+ SAVE IRQ AND DISABLE
+ *B = b;
+ SAVE IRQ AND DISABLE
+ *C = c;
+ RESTORE IRQ
+ *D = d;
+ RESTORE IRQ
+ *E = e;
+ <interrupt>
+ *Y = y;
+ </interrupt>
+
+It is acceptable to observe the following sequences of events:
+
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, *E, { INT, *Y }
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, { INT, *Y, *E }
+ { INT, *X }, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y }
+ { INT }, *X, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y }
+ { INT }, *A, *X, SAVE, SAVE, *B, *C, *D, *E, REST, REST, { INT, *Y }
+
+But not the following:
+
+ { INT }, SAVE, *A, *B, *X, SAVE, *C, REST, *D, REST, *E, { INT, *Y }
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, REST, { INT, *Y, *D, *E }
+
+
+MISCELLANEOUS FUNCTIONS
+-----------------------
+
+Other functions that imply barriers:
+
+ (*) schedule() and similar imply full memory barriers.
+
+
+===========================
+LINUX KERNEL I/O BARRIERING
+===========================
+
+When accessing I/O memory, drivers should use the appropriate accessor
+functions:
+
+ (*) inX(), outX():
+
+ These are intended to talk to legacy i386 hardware using an alternate bus
+ addressing mode. They are synchronous as far as the x86 CPUs are
+ concerned, but other CPUs and intermediary bridges may not honour that.
+
+ They are guaranteed to be fully ordered with respect to each other.
+
+ (*) readX(), writeX():
+
+ These are guaranteed to be fully ordered and uncombined with respect to
+ each other on the issuing CPU, provided they're not accessing a
+ prefetchable device. However, intermediary hardware (such as a PCI
+ bridge) may indulge in deferral if it so wishes; to flush a write, a read
+ from the same location must be performed.
+
+ Used with prefetchable I/O memory, an mmiowb() barrier may be required to
+ force writes to be ordered.
+
+ (*) readX_relaxed()
+
+ These are not guaranteed to be ordered in any way. There is no I/O read
+ barrier available.
+
+ (*) ioreadX(), iowriteX()
+
+ These will perform as appropriate for the type of access they're actually
+ doing, be it in/out or read/write.
+
+
+==========
+REFERENCES
+==========
+
+AMD64 Architecture Programmer's Manual Volume 2: System Programming
+ Chapter 7.1: Memory-Access Ordering
+ Chapter 7.4: Buffering and Combining Memory Writes
+
+IA-32 Intel Architecture Software Developer's Manual, Volume 3:
+System Programming Guide
+ Chapter 7.1: Locked Atomic Operations
+ Chapter 7.2: Memory Ordering
+ Chapter 7.4: Serializing Instructions
+
+The SPARC Architecture Manual, Version 9
+ Chapter 8: Memory Models
+ Appendix D: Formal Specification of the Memory Models
+ Appendix J: Programming with the Memory Models
+
+UltraSPARC Programmer Reference Manual
+ Chapter 5: Memory Accesses and Cacheability
+ Chapter 15: Sparc-V9 Memory Models
+
+UltraSPARC III Cu User's Manual
+ Chapter 9: Memory Models
+
+UltraSPARC IIIi Processor User's Manual
+ Chapter 8: Memory Models
+
+UltraSPARC Architecture 2005
+ Chapter 9: Memory
+ Appendix D: Formal Specifications of the Memory Models
+
+UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005
+ Chapter 8: Memory Models
+ Appendix F: Caches and Cache Coherency
+
+Solaris Internals, Core Kernel Architecture, p63-68:
+ Chapter 3.3: Hardware Considerations for Locks and
+ Synchronization
+
+Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
+for Kernel Programmers:
+ Chapter 13: Other Memory Models

Alan Cox

unread,

Mar 8, 2006, 10:00:24 AM3/8/06

to

On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote:
> + (*) reads can be done speculatively, and then the result discarded should it
> + prove not to be required;

That might be worth an example with an if() because PPC will do this and if
its a read with a side effect (eg I/O space) you get singed..

> +same set of data, but attempting not to use locks as locks are quite expensive.

s/are quite/is quite

and is quite confusing to read

> +SMP memory barriers are normally mere compiler barriers on a UP system because

s/mere//

Makes it easier to read if you are not 1st language English.

> +In addition, accesses to "volatile" memory locations and volatile asm
> +statements act as implicit compiler barriers.

Add

The use of volatile generates poorer code and hides the serialization in
type declarations that may be far from the code. The Linux coding style therefore
strongly favours the use of explicit barriers except in small and specific cases.

> +SMP memory barriers are only compiler barriers on uniprocessor compiled systems
> +because it is assumed that a CPU will be apparently self-consistent, and will
> +order overlapping accesses correctly with respect to itself.

Is this true of IA-64 ??

> +There is no guarantee that some intervening piece of off-the-CPU hardware will
> +not reorder the memory accesses. CPU cache coherency mechanisms should
> +propegate the indirect effects of a memory barrier between CPUs.

[For information on bus mastering DMA and coherency please read ....]

sincee have a doc on this

> +There are some more advanced barriering functions:

"barriering" ... ick, barrier.

> +LOCKING FUNCTIONS
> +-----------------
> +
> +For instance all the following locking functions imply barriers:

s/For instance//

s/disablement/disabling/

Should clarify local ordering v SMP ordering for locks implied here.

> +INTERRUPT DISABLEMENT FUNCTIONS
> +-------------------------------

s/Disablement/Disabling/

> +Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will

disable

> +===========================
> +LINUX KERNEL I/O BARRIERING

/barriering/barriers

> + (*) inX(), outX():
> +
> + These are intended to talk to legacy i386 hardware using an alternate bus
> + addressing mode. They are synchronous as far as the x86 CPUs are

Not really true. Lots of PCI devices use them. Need to talk about "I/O space"

> + concerned, but other CPUs and intermediary bridges may not honour that.
> +
> + They are guaranteed to be fully ordered with respect to each other.

And make clear I/O space is a CPU property and that inX()/outX() may well map
to read/write variant functions on many processors

> + (*) readX(), writeX():
> +
> + These are guaranteed to be fully ordered and uncombined with respect to
> + each other on the issuing CPU, provided they're not accessing a

MTRRs

> + prefetchable device. However, intermediary hardware (such as a PCI
> + bridge) may indulge in deferral if it so wishes; to flush a write, a read
> + from the same location must be performed.

False. Its not so tightly restricted and many devices the location you write
is not safe to read so you must use another. I'd have to dig the PCI spec
out but I believe it says the same devfn. It also says stuff about rules for
visibility of bus mastering relative to these accesses and PCI config space
accesses relative to the lot (the latter serveral chipsets get wrong). We
should probably point people at the PCI 2.2 spec .

Looks much much better than the first version and just goes to prove how complex
this all is

Andi Kleen

unread,

Mar 8, 2006, 10:00:46 AM3/8/06

to

Robert Hancock <hanc...@shaw.ca> writes:

> Alan Cox wrote:
> > On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote:
> >>> But on most arches those accesses do indeed seem to happen in-order. On
> >>> i386 and x86_64, it's a natural consequence of program store ordering.
> >> Not true for reads on x86.
> > You must have a strange kernel Andi. Mine marks them as volatile
> > unsigned char * references.
>
> Well, that and the fact that IO memory should be mapped as uncacheable
> in the MTRRs should ensure that readl and writel won't be reordered on
> i386 and x86_64.. except in the case where CONFIG_UNORDERED_IO is
> enabled on x86_64 which can reorder writes since it uses nontemporal
> stores..

CONFIG_UNORDERED_IO is a failed experiment. I just removed it.

-Andi

Matthew Wilcox

unread,

Mar 8, 2006, 10:50:33 AM3/8/06

to

On Wed, Mar 08, 2006 at 09:55:06AM -0500, Alan Cox wrote:
> On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote:
> > + (*) reads can be done speculatively, and then the result discarded should it
> > + prove not to be required;
>
> That might be worth an example with an if() because PPC will do this and if
> its a read with a side effect (eg I/O space) you get singed..

PPC does speculative memory accesses to IO? Are you *sure*?

> > +same set of data, but attempting not to use locks as locks are quite expensive.
>
> s/are quite/is quite
>
> and is quite confusing to read

His grammar's right ... but I'd just leave out the 'as' part. As
you're right that it's confusing ;-)

> > +SMP memory barriers are normally mere compiler barriers on a UP system because
>
> s/mere//
>
> Makes it easier to read if you are not 1st language English.

Maybe s/mere/only/?

> > +SMP memory barriers are only compiler barriers on uniprocessor compiled systems
> > +because it is assumed that a CPU will be apparently self-consistent, and will
> > +order overlapping accesses correctly with respect to itself.
>
> Is this true of IA-64 ??

Yes:

#else
# define smp_mb() barrier()
# define smp_rmb() barrier()
# define smp_wmb() barrier()
# define smp_read_barrier_depends() do { } while(0)
#endif

> > + (*) inX(), outX():
> > +
> > + These are intended to talk to legacy i386 hardware using an alternate bus
> > + addressing mode. They are synchronous as far as the x86 CPUs are
>
> Not really true. Lots of PCI devices use them. Need to talk about "I/O space"

Port space is deprecated though. PCI 2.3 says:

"Devices are recommended always to map control functions into Memory Space."

> > +
> > + These are guaranteed to be fully ordered and uncombined with respect to
> > + each other on the issuing CPU, provided they're not accessing a
>
> MTRRs
>
> > + prefetchable device. However, intermediary hardware (such as a PCI
> > + bridge) may indulge in deferral if it so wishes; to flush a write, a read
> > + from the same location must be performed.
>
> False. Its not so tightly restricted and many devices the location you write
> is not safe to read so you must use another. I'd have to dig the PCI spec
> out but I believe it says the same devfn. It also says stuff about rules for
> visibility of bus mastering relative to these accesses and PCI config space
> accesses relative to the lot (the latter serveral chipsets get wrong). We
> should probably point people at the PCI 2.2 spec .

3.2.5 of PCI 2.3 seems most relevant:

Since memory write transactions may be posted in bridges anywhere
in the system, and I/O writes may be posted in the host bus bridge,
a master cannot automatically tell when its write transaction completes
at the final destination. For a device driver to guarantee that a write
has completed at the actual target (and not at an intermediate bridge),
it must complete a read to the same device that the write targeted. The
read (memory or I/O) forces all bridges between the originating master
and the actual target to flush all posted data before allowing the
read to complete. For additional details on device drivers, refer to
Section 6.5. Refer to Section 3.10., item 6, for other cases where a
read is necessary.

Appendix E is also of interest:

2. Memory writes can be posted in both directions in a bridge. I/O and
Configuration writes are not posted. (I/O writes can be posted in the
Host Bridge, but some restrictions apply.) Read transactions (Memory,
I/O, or Configuration) are not posted.

5. A read transaction must push ahead of it through the bridge any posted
writes originating on the same side of the bridge and posted before
the read. Before the read transaction can complete on its originating
bus, it must pull out of the bridge any posted writes that originated
on the opposite side and were posted before the read command completes
on the read-destination bus.

I like the way they contradict each other slightly wrt config reads and
whether you have to read from the same device, or merely the same bus.
One thing that is clear is that a read of a status register on the bridge
isn't enough, it needs to be *through* the bridge, not *to* the bridge.
I wonder if a config read of a non-existent device on the other side of
the bridge would force the write to complete ...

Christoph Lameter

unread,

Mar 8, 2006, 11:30:31 AM3/8/06

to

You need to explain the difference between the compiler reordering and the
control of the compilers arrangement of loads and stores and the cpu
reordering of stores and loads. Note that IA64 has a much more complete
set of means to reorder stores and loads. i386 and x84_64 processors can
only do limited reordering. So it may make sense to deal with general
reordering and then explain i386 as a specific limited case.

See the "Intel Itanium Architecture Software Developer's Manual"
(available from intels website). Look at Volume 1 section 2.6
"Speculation" and 4.4 "Memory Access"

Also the specific barrier functions of various locking elements varies to
some extend.

Bryan O'Sullivan

unread,

Mar 8, 2006, 11:50:35 AM3/8/06

to

On Wed, 2006-03-08 at 12:34 +0000, David Howells wrote:

> On i386 and x86_64, do IN and OUT instructions imply MFENCE?

No.

<b

David Howells

unread,

Mar 8, 2006, 12:10:33 PM3/8/06

to

Alan Cox <al...@redhat.com> wrote:

> [For information on bus mastering DMA and coherency please read ....]
>
> sincee have a doc on this

Documentation/pci.txt?

> The use of volatile generates poorer code and hides the serialization in
> type declarations that may be far from the code.

I'm not sure what you mean by that.

> Is this true of IA-64 ??

Are you referring to non-temporal loads and stores?

> > +There are some more advanced barriering functions:
>
> "barriering" ... ick, barrier.

Picky:-)

> Should clarify local ordering v SMP ordering for locks implied here.

Do you mean explain what each sort of lock does?

> > + (*) inX(), outX():
> > +
> > + These are intended to talk to legacy i386 hardware using an alternate bus
> > + addressing mode. They are synchronous as far as the x86 CPUs are
>
> Not really true. Lots of PCI devices use them. Need to talk about "I/O space"

Which bit is not really true?

David

David Howells

unread,

Mar 8, 2006, 12:30:31 PM3/8/06

to

Matthew Wilcox <mat...@wil.cx> wrote:

> > That might be worth an example with an if() because PPC will do this and
> > if its a read with a side effect (eg I/O space) you get singed..
>
> PPC does speculative memory accesses to IO? Are you *sure*?

Can you do speculative reads from frame buffers?

> # define smp_read_barrier_depends() do { } while(0)

What's this one meant to do?

> Port space is deprecated though. PCI 2.3 says:

That's sort of irrelevant for the here. I still need to document the
interaction.

> Since memory write transactions may be posted in bridges anywhere
> in the system, and I/O writes may be posted in the host bus bridge,

I'm not sure whether this is beyond the scope of this document. Maybe the
document's scope needs to be expanded.

David

David Howells

unread,

Mar 8, 2006, 12:40:19 PM3/8/06

to

Christoph Lameter <clam...@engr.sgi.com> wrote:

> You need to explain the difference between the compiler reordering and the
> control of the compilers arrangement of loads and stores and the cpu
> reordering of stores and loads.

Hmmm... I would hope people looking at this doc would understand that, but
I'll see what I can come up with.

> Note that IA64 has a much more complete set of means to reorder stores and
> loads. i386 and x84_64 processors can only do limited reordering. So it may
> make sense to deal with general reordering and then explain i386 as a
> specific limited case.

Don't you need to use sacrifice_goat() for controlling the IA64? :-)

Besides, I'm not sure that I need to explain that any CPU is a limited case;
I'm primarily trying to define the basic minimal guarantees you can expect
from using a memory barrier, and what might happen if you don't. It shouldn't
matter which arch you're dealing with, especially if you're writing a driver.

I tried to create arch-specific sections for describing arch-specific implicit
barriers and the extent of the explicit memory barriers on each arch, but the
i386 section was generating lots of exceptions that it looked infeasible to
describe them; besides, you aren't allowed to rely on such features outside of
arch code (I count arch-specific drivers as "arch code" for this).

> See the "Intel Itanium Architecture Software Developer's Manual"
> (available from intels website). Look at Volume 1 section 2.6
> "Speculation" and 4.4 "Memory Access"

I've added that to the refs, thanks.

> Also the specific barrier functions of various locking elements varies to
> some extend.

Please elaborate.

David

Alan Cox

unread,

Mar 8, 2006, 12:40:28 PM3/8/06

to

On Wed, Mar 08, 2006 at 05:04:51PM +0000, David Howells wrote:
> > [For information on bus mastering DMA and coherency please read ....]
> > sincee have a doc on this
>
> Documentation/pci.txt?

and:

Documentation/DMA-mapping.txt
Documentation/DMA-API.txt

>
> > The use of volatile generates poorer code and hides the serialization in
> > type declarations that may be far from the code.
>
> I'm not sure what you mean by that.

in foo.h

struct blah {
volatile int x; /* need serialization
}

2 million miles away

blah.x = 1;
blah.y = 4;

And you've no idea that its magically serialized due to a type declaration
in a header you've never read. Hence the "dont use volatile" rule

> > Is this true of IA-64 ??
>
> Are you referring to non-temporal loads and stores?

Yep. But Matthew answered that

> > Should clarify local ordering v SMP ordering for locks implied here.
>
> Do you mean explain what each sort of lock does?

spin_unlock ensures that local CPU writes before the lock are visible
to all processors before the lock is dropped but it has no effect on
I/O ordering. Just a need for clarity.

> > > + (*) inX(), outX():
> > > +
> > > + These are intended to talk to legacy i386 hardware using an alternate bus
> > > + addressing mode. They are synchronous as far as the x86 CPUs are
> >
> > Not really true. Lots of PCI devices use them. Need to talk about "I/O space"
>
> Which bit is not really true?

The "legacy i386 hardware" bit. Many processors have an I/O space.

Christoph Lameter

unread,

Mar 8, 2006, 12:50:32 PM3/8/06

to

On Wed, 8 Mar 2006, David Howells wrote:

> Hmmm... I would hope people looking at this doc would understand that, but
> I'll see what I can come up with.
>
> > Note that IA64 has a much more complete set of means to reorder stores and
> > loads. i386 and x84_64 processors can only do limited reordering. So it may
> > make sense to deal with general reordering and then explain i386 as a
> > specific limited case.
>
> Don't you need to use sacrifice_goat() for controlling the IA64? :-)

Likely...

> Besides, I'm not sure that I need to explain that any CPU is a limited case;
> I'm primarily trying to define the basic minimal guarantees you can expect
> from using a memory barrier, and what might happen if you don't. It shouldn't
> matter which arch you're dealing with, especially if you're writing a driver.

memory barrier functions have to be targeted to the processor with the
ability to do the widest amount of reordering. This is the Itanium AFAIK.

> I tried to create arch-specific sections for describing arch-specific implicit
> barriers and the extent of the explicit memory barriers on each arch, but the
> i386 section was generating lots of exceptions that it looked infeasible to
> describe them; besides, you aren't allowed to rely on such features outside of
> arch code (I count arch-specific drivers as "arch code" for this).

i386 does not fully implement things like write barriers since they have
an implicit ordering of stores.

> > Also the specific barrier functions of various locking elements varies to
> > some extend.
>
> Please elaborate.

F.e. spin_unlock has "release" semantics on IA64. That means that prior
write accesses are visible before the store, read accesses are also
completed before the store. However, the processor may perform later read
and write accesses before the results of the store become visible.

Alan Cox

unread,

Mar 8, 2006, 1:00:18 PM3/8/06

to

> i386 does not fully implement things like write barriers since they have
> an implicit ordering of stores.

Except when they don't (PPro errata cases, and the explicit support for
this in the IDT Winchip)

David Howells

unread,

Mar 8, 2006, 1:40:17 PM3/8/06

to

Alan Cox <al...@redhat.com> wrote:

> spin_unlock ensures that local CPU writes before the lock are visible
> to all processors before the lock is dropped but it has no effect on
> I/O ordering. Just a need for clarity.

So I can't use spinlocks in my driver to make sure two different CPUs don't
interfere with each other when trying to communicate with a device because the
spinlocks don't guarantee that I/O operations will stay in effect within the
locking section?

David

Alan Cox

unread,

Mar 8, 2006, 1:50:26 PM3/8/06

to

On Wed, Mar 08, 2006 at 06:35:07PM +0000, David Howells wrote:
> Alan Cox <al...@redhat.com> wrote:
>
> > spin_unlock ensures that local CPU writes before the lock are visible
> > to all processors before the lock is dropped but it has no effect on
> > I/O ordering. Just a need for clarity.
>
> So I can't use spinlocks in my driver to make sure two different CPUs don't
> interfere with each other when trying to communicate with a device because the
> spinlocks don't guarantee that I/O operations will stay in effect within the
> locking section?

If you have

CPU #0

spin_lock(&foo->lock)
writel(0, &foo->regnum)
writel(1, &foo->data);
spin_unlock(&foo->lock);

CPU #1
spin_lock(&foo->lock);
writel(4, &foo->regnum);
writel(5, &foo->data);
spin_unlock(&foo->lock);

then on some NUMA infrastructures the order may not be as you expect. The
CPU will execute writel 0, writel 1 and the second CPU later will execute
writel 4 writel 5, but the order they hit the PCI bridge may not be the
same order. Usually such things don't matter but in a register windowed
case getting 0/4/1/5 might be rather unfortunate.

See Documentation/DocBook/deviceiobook.tmpl (or its output)

The following case is safe

spin_lock(&foo->lock);
writel(0, &foo->regnum);
reg = readl(&foo->data);
spin_unlock(&foo->lock);

as the real must complete and it forces the write to complete. The pure write
case used above should be implemented as

spin_lock(&foo->lock);
writel(0, &foo->regnum);
writel(1, &foo->data);
mmiowb();
spin_unlock(&foo->lock);

The mmiowb ensures that the writels will occur before the writel from another
CPU then taking the lock and issuing a writel.

Welcome to the wonderful world of NUMA

Alan

David Howells

unread,

Mar 8, 2006, 2:10:16 PM3/8/06

to

Alan Cox <al...@redhat.com> wrote:

> spin_lock(&foo->lock);
> writel(0, &foo->regnum);

I presume there only needs to be an mmiowb() here if you've got the
appropriate CPU's I/O memory window set up to be weakly ordered.

> writel(1, &foo->data);
> mmiowb();
> spin_unlock(&foo->lock);

David

David Howells

unread,

Mar 8, 2006, 2:10:21 PM3/8/06

to

Alan Cox <al...@redhat.com> wrote:

> then on some NUMA infrastructures the order may not be as you expect.

Oh, yuck!

Okay... does NUMA guarantee the same for ordinary memory accesses inside the
critical section?

David

Andi Kleen

unread,

Mar 8, 2006, 2:10:22 PM3/8/06

to

On Wednesday 08 March 2006 19:59, David Howells wrote:
> Alan Cox <al...@redhat.com> wrote:
>
> > then on some NUMA infrastructures the order may not be as you expect.
>
> Oh, yuck!
>
> Okay... does NUMA guarantee the same for ordinary memory accesses inside the
> critical section?

If you use barriers the ordering should be the same on cc/NUMA vs SMP.
Otherwise it wouldn't be "cc"

But it might be quite unfair.

-Andi

Linus Torvalds

unread,

Mar 8, 2006, 2:30:32 PM3/8/06

to

On Wed, 8 Mar 2006, David Howells wrote:

> Alan Cox <al...@redhat.com> wrote:
>
> > spin_lock(&foo->lock);
> > writel(0, &foo->regnum);
>
> I presume there only needs to be an mmiowb() here if you've got the
> appropriate CPU's I/O memory window set up to be weakly ordered.

Actually, since the different NUMA things may have different paths to the
PCI thing, I don't think even the mmiowb() will really help. It has
nothing to serialize _with_.

It only orders mmio from within _one_ CPU and "path" to the destination.
The IO might be posted somewhere on a PCI bridge, and and depending on the
posting rules, the mmiowb() just isn't relevant for IO coming through
another path.

Of course, to get into that deep doo-doo, your IO fabric must be separate
from the memory fabric, and the hardware must be pretty special, I think.

So for example, if you are using an Opteron with it's NUMA memory setup
between CPU's over HT links, from an _IO_ standpoint it's not really
anything strange, since it uses the same fabric for memory coherency and
IO coherency, and from an IO ordering standpoint it's just normal SMP.

But if you have a separate IO fabric and basically two different CPU's can
get to one device through two different paths, no amount of write barriers
of any kind will ever help you.

So in the really general case, it's still basically true that the _only_
thing that serializes a MMIO write to a device is a _read_ from that
device, since then the _device_ ends up being the serialization point.

So in the exteme case, you literally have to do a read from the device
before you release the spinlock, if ordering to the device from two
different CPU's matters to you. The IO paths simply may not be
serializable with the normal memory paths, so spinlocks have absolutely
_zero_ ordering capability, and a write barrier on either the normal
memory side or the IO side doesn't affect anything.

Now, I'm by no means claiming that we necessarily get this right in
general, or even very commonly. The undeniable fact is that "big NUMA"
machines need to validate the drivers they use separately. The fact that
it works on a normal PC - and that it's been tested to death there - does
not guarantee much anything.

The good news, of course, is that you don't use that kind of "big NUMA"
system the same way you'd use a regular desktop SMP. You don't plug in
random devices into it and just expect them to work. I'd hope ;)

Linus

David Howells

unread,

Mar 8, 2006, 2:40:25 PM3/8/06

to

Linus Torvalds <torv...@osdl.org> wrote:

> Actually, since the different NUMA things may have different paths to the
> PCI thing, I don't think even the mmiowb() will really help. It has
> nothing to serialize _with_.

On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction then? Those
do inter-component synchronisation.

David

David Howells

unread,

Mar 8, 2006, 2:40:35 PM3/8/06

to

The attached patch documents the Linux kernel's memory barriers.

I've updated it from the comments I've been given.

Note that the per-arch notes sections are gone because it's clear that there
are so many exceptions, that it's not worth having them.

I've added a list of references to other documents.

I've tried to get rid of the concept of memory accesses appearing on the bus;
what matters is apparent behaviour with respect to other observers in the
system.

I'm not sure that any mention interrupts vs interrupt disablement should be
retained... it's unclear that there is actually anything that guarantees that
stuff won't leak out of an interrupt-disabled section and into an interrupt
handler. Paul Mackerras says this isn't valid on powerpc, and looking at the
code seems to confirm that, barring implicit enforcement by the CPU.

There's also some uncertainty with respect to spinlocks vs I/O accesses on
NUMA.

Signed-Off-By: David Howells <dhow...@redhat.com>
---
warthog>diffstat -p1 /tmp/mb.diff

Documentation/memory-barriers.txt | 781 ++++++++++++++++++++++++++++++++++++++
1 files changed, 781 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644

index 0000000..6eeb7e4
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,781 @@

+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Where are memory barriers needed?
+
+ - Accessing devices.
+ - Multiprocessor interaction.
+ - Interrupts.
+

+ (*) Explicit kernel compiler barriers.
+
+ (*) Explicit kernel memory barriers.

+
+ (*) Implicit kernel memory barriers.
+
+ - Locking functions.

+ - Interrupt disabling functions.

+ - Miscellaneous functions.
+

+ (*) Inter-CPU locking barrier effects.
+
+ - Locks vs memory accesses.
+ - Locks vs I/O accesses.
+
+ (*) Kernel I/O barrier effects.

+When there's a system with more than one processor, the CPUs in the system may
+be working on the same set of data at the same time. This can cause
+synchronisation problems, and the usual way of dealing with them is to use
+locks - but locks are quite expensive, and so it may be preferable to operate
+without the use of a lock if at all possible. In such a case accesses that
+affect both CPUs may have to be carefully ordered to prevent error.

+SMP memory barriers are normally nothing more than compiler barriers on a
+kernel compiled for a UP system because the CPU orders overlapping accesses
+with respect to itself, and so CPU barriers aren't needed.

+
+
+INTERRUPTS
+----------
+
+A driver may be interrupted by its own interrupt service routine, and thus they
+may interfere with each other's attempts to control or access the device.
+
+This may be alleviated - at least in part - by disabling interrupts (a form of
+locking), such that the critical operations are all contained within the

+interrupt-disabled section in the driver. Whilst the driver's interrupt

+EXPLICIT KERNEL COMPILER BARRIERS
+=================================

+
+The Linux kernel has an explicit compiler barrier function that prevents the
+compiler from moving the memory accesses either side of it to the other side:
+
+ barrier();
+
+This has no direct effect on the CPU, which may then reorder things however it
+wishes.
+
+
+In addition, accesses to "volatile" memory locations and volatile asm

+statements act as implicit compiler barriers. Note, however, that the use of
+volatile has two negative consequences:
+
+ (1) it causes the generation of poorer code, and
+
+ (2) it can affect serialisation of events in code distant from the declaration
+ (consider a structure defined in a header file that has a volatile member
+ being accessed by the code in a source file).
+
+The Linux coding style therefore strongly favours the use of explicit barriers
+except in small and specific cases. In general, volatile should be avoided.
+
+
+===============================
+EXPLICIT KERNEL MEMORY BARRIERS
+===============================
+

+There is no guarantee that some intervening piece of off-the-CPU hardware[*]
+will not reorder the memory accesses. CPU cache coherency mechanisms should

+propegate the indirect effects of a memory barrier between CPUs.
+

+ [*] For information on bus mastering DMA and coherency please read:
+
+ Documentation/pci.txt
+ Documentation/DMA-mapping.txt
+ Documentation/DMA-API.txt

+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+

+There are some more advanced barrier functions:

+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write

+ barrier after it, depending on the function. They aren't guaranteed to
+ insert anything more than a compiler barrier in a UP compilation.

+
+
+===============================
+IMPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+Some of the other functions in the linux kernel imply memory barriers, amongst
+them are locking and scheduling functions and interrupt management functions.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+LOCKING FUNCTIONS
+-----------------
+

+All the following locking functions imply barriers:

+at all, especially with respect to I/O accesses, unless combined with interrupt
+disabling operations.
+
+See also the section on "Inter-CPU locking barrier effects".

+INTERRUPT DISABLING FUNCTIONS
+-----------------------------
+
+Functions that disable interrupts (LOCK equivalent) and enable interrupts
+(UNLOCK equivalent) will barrier memory and I/O accesses versus memory and I/O
+accesses done in the interrupt handler. This prevents an interrupt routine
+interfering with accesses made in a interrupt-disabled section of code and vice
+versa.
+
+Note that whilst disabling or enabling interrupts acts as a compiler barriers
+under all circumstances, they only act as memory barriers with respect to
+interrupts, not with respect to nested sections.

+=================================
+INTER-CPU LOCKING BARRIER EFFECTS
+=================================
+
+On SMP systems locking primitives give a more substantial form of barrier: one
+that does affect memory access ordering on other CPUs, within the context of
+conflict on any particular lock.
+
+
+LOCKS VS MEMORY ACCESSES
+------------------------
+
+Consider the following: the system has a pair of spinlocks (N) and (Q), and
+three CPUs; then should the following sequence of events occur: