Making sense of Memory Barriers

3,976 views
Skip to first unread message

Dain Ironfoot

unread,
Jun 12, 2016, 7:28:14 PM6/12/16
to mechanica...@googlegroups.com
Hi Guys,

I am attempting to understand memory barriers at a level useful for java lock-free programmers.
This level, I feel, is somewhere between learning about volatile from java books and learning about working of Store/Load buffers from an x86 manual.
I spent some time reading a bunch of blogs/cookbooks and have come up with the summary shown below.

LFENCE
====================================================
Name             : LFENCE/Load Barrier/Acquire Fence
Barriers          : LoadLoad + LoadStore
Details           : Given sequence {Load1, LFENCE, Load2, Store1}, the barrier ensures that Load1 can't be moved south and Load2 and Store1 can't be                  moved north of the barrier. Note that Load2 and Store1 can still be reordered.
Buffer Effect    : Causes the contents of the LoadBuffer (pending loads) to be processed for that CPU.
                        This makes program state exposed from other CPUs visible to this CPU before Load2 and Store1 are executed.
Cost on x86    : Either very cheap or a no-op.
Java instruction: Reading a volatile variable, Unsafe.loadFence()
   

SFENCE
====================================================
Name              : SFENCE/Store Barrier/Release Fence
Barriers           : StoreStore + LoadStore
Details            : Given sequence {Load1, Store1, SFENCE, Store2, Load2}, the barrier ensures that Load1 and Store1 can't be moved south and Store2 can't be moved north of the barrier.
                       Note that Load1 and Store1 can still be reordered AND Load2 can be  moved north of the barrier.
Buffer Effect    : Causes the contents of the StoreBuffer flushed to cache for the CPU on which it is issued.
                        This will make program state visible to other CPUs before Store2 and Load1 are executed.
Cost on x86    : Either very cheap or a no-op.
Java instruction: lazySet(), Unsafe.storeFence(), Unsafe.putOrdered*()


MFENCE
====================================================
Name            : MFENCE/Full Barrier/Fence
Barriers         : StoreLoad
Details          : Obtains the effects of the other three barrier. So can serve as a general-purpose barrier.
                      Given sequence {Load1, Store1, MFENCE, Store2, Load2}, the barrier ensures that Load1 and Store1 can't be moved south and Store2 and               Load2 can't be moved north of the barrier.
                      Note that Load1 and Store1 can still be reordered AND Store2 and Load2 can still be reordered.
                 
Buffer Effect   : Causes the contents of the LoadBuffer (pending loads) to be processed for that CPU.
                      AND
                      Causes the contents of the StoreBuffer flushed to cache for the CPU on which it is issued.

Cost on x86    : The most expensive kind.
Java instruction: Writing to a volatile, Unsafe.fullFence(), Using locks
   
My questions are:

a) Is my understanding correct or am I missed something critical?
b) If both SFENCE and MFENCE drains the storeBuffer (invalidates cacheline and waits for acks from other cpus), why is one a no-op and the other a very expensive op?

Thanks

** Edited to make the info easier to read.

Jonathan Yu

unread,
Jun 13, 2016, 5:02:30 PM6/13/16
to mechanica...@googlegroups.com
Hey Dain,

I certainly don't know anything, but just wanted to make sure you've seen Aleksey's posts on this topic:

These relate the memory barriers defined in Intel's manual with the Java memory model.

Good luck!

Jonathan

On Sun, Jun 12, 2016 at 4:28 PM, Dain Ironfoot <vicva...@gmail.com> wrote:
Hi Guys,

I am attempting to understand memory barriers at a level useful for java lock-free programmers.
This level, I feel, is somewhere between learning about volatile from java books and learning about Store/Load buffers from an x86 manual.
I spent some time reading a bunch of blogs/cookbooks and have come up with the summary shown below.
Will you fine Gents/Ladies have a look and point out any omissions/mistakes I have made.

Many thanks in advance.


NameBarriers
Use
Cost on X86
Java Instructions
Load Barrier
or
LFENCE
or 
Acquire Fence
LoadLoad
+
LoadStore
Given sequence

Load1
LoadLoad Barrier
Load2
Store1

The barrier ensures that Load1 can't be moved south and Load2 and Store1 can't be moved north of the barrier. 
Note that Load2 and Store1 can still be reordered.

The barrier also causes the contents of the LoadBuffer (pending loads) to be processed for that CPU. This makes program state exposed from other CPUs visible to this CPU before Load2 and Store1 are executed.
Either very cheap or a no-op.
-Reading a volatile variable
-Unsafe.loadFence()

Store Barrier
or
SFENCE
or 
Release Fence

StoreStore 
+
LoadStore
Given sequence

Load1
Store1
StoreStore Barrier
Store2
Load2

The barrier ensures that Load1 and Store1 can't be moved south and Store2 can't be moved north of the barrier.

Note that Load1 and Store1 can still be reordered AND Load2 can be moved north of the barrier.

Empties the StoreBuffer. No other instructions will start executing on this core, until ALL entries in the SOBs have been processed.

Either very cheap or a no-op.
-lazySet()
-Unsafe.storeFence()
-Unsafe.putOrdered()

Stores are not immediately visible to other CPUs but they are not reordered. 
 
Works when you have a single writer.


Full Barrier
or
MFENCE
or 
Fence


StoreLoad

(Obtains the effects of the other three barrier. So can serve as a general-purpose barrier.)
Load1
Store1
Full Barrier
Store2
Load2

The barrier ensures that Load1 and Store1 can't be moved south and Store2 and Load2 can't be moved north of the barrier.

Note that Load1 and Store1 can still be reordered AND Store2 and Load2 can still be reordered.

Ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier.
The most expensive kind.
-Writing a volatile variable
-Unsafe.fullFence()
-Using locks



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Jonathan Yu @jawnsy on LinkedInTwitterGitHubFacebook
“Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.” — Samuel Beckett, Worstward Ho (1983) 

“In an adaptive environment, winning comes from adapting to change by continuously experimenting and identifying new options more quickly and economically than others. The classical strategist's mantra of sustainable competitive advantage becomes one of serial temporary advantage.” — Navigating the Dozens of Different Strategy Options (HBR)

Valery Ovchinnikov

unread,
Jun 17, 2016, 4:51:03 AM6/17/16
to mechanical-sympathy
Hi Dain,

I'm no expert, but to my understanding you have some inaccuracies.
It's not useful to speak about store/load buffers flushing in context of memory barriers.
As far as those barriers are not mapped one-to-one to the processor instructions and communication mechanisms between cores may differ between processors/architectures.
e.g. your table stands that release barrier "Empties the StoreBuffer".
This is not actually correct.

The thing is that we've got different architectures with different hardware memory models out there.
x86 has strong memory ordering, i.e. all stores may be thought as they are release and all reads are acquire.
While ARM or POWER both have weak memory orderings.

The other thing is that barriers can't describe specific cases like IRIW (Independent Reads of Independent Writes).

I'd suggest you reading Peter Sewell's papers (http://www.cl.cam.ac.uk/~pes20/).

The good starting point for Software Engineer is his (et al.) papers on C++11 Memory Model, which can give one an instrument to reason about concurrent programs with weak/relaxed memory orderings.
To better understand IRIW and all that stuff about weak hardware memory models I'd give you a link Alexey Shipilev gave me recently:http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf



Speaking about intuition. 
My personal mental model is that x86 doesn't implement invalidation queues (load buffers). 
Barriers (there's single one actually) obtain exclusive access to the link shared between all cores, so sequential consistency is held.
ARM and POWER is thought as if every core had connection to all the others (no single link with exclusive access).
Thus updates in release/acquire semantics can come to different cores in different order (lwsync).
And when one requires sequential consistency all those links are locked (sync).
BUT those are just parts of my meta-physics and may be terribly incorrect models.

понедельник, 13 июня 2016 г., 2:28:14 UTC+3 пользователь Dain Ironfoot написал:

Gil Tene

unread,
Jun 21, 2016, 1:37:54 PM6/21/16
to mechanical-sympathy
The main tip I give people that try to understand fences, barriers, and memory ordering rules is to completely (and very intentionally) ignore and avoid thinking about any CPU's internal details (like buffers or queues or flushes or such). The simple rule to remember is this: before a CPU ever sees the instructions you think you are running, a compiler (static, JIT, whatever) will have a chance to completely shuffle those instructions around and give them to the cpu in any order it sees fit. So what a CPU actually does with barriers has no functional/correntness implications, as it is the compiler's job to make the CPU do what it needs to. Barriers/fences/memory models are first and foremost compiler concerns, and have both a correctness and (potential and profound) performance implications. The CPU concerns are only secondary, and only effect performance (never correctness). This is critical to understand and get your head around before trying to reason about what fences/barriers/etc. *mean* to program code.

For example, the sort of code mutation we are talking about is not just a simple change in order. It is also a combination of reordering with other optimizations. E.g. a repreated load from the same location in a loop can (and will) float "up above" the loop, execute only once (instead of once per loop iteration) and cache the loaded result in a register, unless barriers/fences/ordering rules prevent it from doing so. Similarly redundant stores to the same location can be folded into a single store, unless rules prevent it.

With that in mind, you can start looking at the various barriers and fence semantics available in various languages (that actually define them) and in various ad-hoc tooling (like e.g. libraries and conventions used in C).

I like to use the simplistic LoadLoad, LoadStore, StoreStore, StoreLoad mental model as starting point for thinking about barriers, as it can be used to model most (not all) barrier semantics. Those are simple to understand:
- LoadStore means "all loads that precede this fence will appear (from the point of view of any other thread) to execute before the next store appears to execute.
- LoadLoad means "all loads that precede this fence will appear (from the point of view of any other thread) to execute before the next load appears to execute.
etc. etc.
The name tells you what the rule is, clear and simple. Nothing more is promised, and nothing less.

I like to think of Acquire and Release (which are somewhat more restrictive) in terms of "what would a lock need to do?" (even if no lock is involved). Thinking in terms of "in a critical section protected by a lock, what sort of code motion would the acquire/release fences associated with the lock acquire and release action allow?" allows me to answer questions about them without looking up a definition or thinking about the more abstract notions that define them. E.g. A lock acquire can allow preceding loads and stores that (in the program) appear "before" the lock acquire to "float forward past the acquire and into the locked region". But it cannot allow loads and stores that (in the program) appear "after" the lock acquire to "float to backwards past the acquire and out of the locked region". So an acquire fence needs to enforce those rules. Similarly, release can allow loads and stores that follow the release to "float back past the release and into the locked region", but cannot allow loads and stores that precede it to "float forward past the release and out of the locked region". 

Since acquire doesn't (necessarily) involve a load or a store, Acquire doesn't quite equate to LoadLoad|LoadStore. It can be when acquiring the lock involves a load followed by LoadLoad|LoadStore, but there are lock implementations where that doesn't actually happen, and uses of Acquire fences with no locking or loads involved. Similarly, since a release doesn't (necessarily) involve a load or a store, it is not quite equivalent to StoreLoad|StoreStore. (But can be when releasing involves a store followed by StoreLoad|StoreStore.

HTH

Avi Kivity

unread,
Jun 22, 2016, 3:53:57 AM6/22/16
to mechanica...@googlegroups.com

100% agree!  In the kernel people ignore this, but only by splattering memory clobbers everywhere, to defeat any compiler reordering or collapsing.  But Gil's approach is much better, and works with languages that have a memory model.

--

Jean-Philippe BEMPEL

unread,
Jun 22, 2016, 4:36:07 AM6/22/16
to mechanical-sympathy
Agreed with Gil.
You do not care about hardware memory barriers. I usually found many blog poss about them and mix it with compiler memory barriers. There is however a clear distinction you should make between them.

For an example of instruction reordering done by compiler:

Cheers

Dain Ironfoot

unread,
Jul 5, 2016, 5:53:43 PM7/5/16
to mechanica...@googlegroups.com
Gil thanks for the wonderfully simple yet clear explanation on the barriers.

May I ask you to please help me understand the related "memory visibility" guarantees that these barriers provide in JMM.
AFAIK, in java, we have the following language level instructions which deals with memory visibility & ordering.

volatile read (Atomicxxx get)
volatile write (Atomicxxx set)
Atomicxxx lazySet (Unsafe.putOrdered*)

Unsafe.loadFence()
Unsafe.storeFence()
Unsafe.fullFence()

Lock acquire
Lock release

final fields in the constructor

Given your model of "LoadLoad", "LoadStore", "StoreStore", "StoreLoad"; is it helpful to map them to these instructions above?
Also, how do you think about the memory visibility that are provided by these instructions?

Many thanks


Gil Tene

unread,
Jul 6, 2016, 12:48:54 PM7/6/16
to mechanica...@googlegroups.com
It's actually fairly hard to precisely map the JMM to barriers.

The job for (some of us) runtime implementors it fairly easy, as there are clear rules we can follow that while overly conservative, are *sufficient* for meeting the memory model requirements. You can find those in Doug Lea's excellent JMM cookbook (http://gee.cs.oswego.edu/dl/jmm/cookbook.html), and you'll see a matrix there that explains the sort of LoadLoad | LoadStore | StoreStore | StoreLoad combos would be sufficient to place between various operations for the memory model to be adhered to. However, it is critical to understand that these are not rules that *you* can rely on as a programmer. The cookbook is written for runtime and compiler implementors and provides an "if you follow this you are safe in your implementation" set, but runtimes are allowed to be more aggressive than the rules specified and still meet the JMM requirements. And they do. So programs that run on those runtimes are not guaranteed the sufficient rules in the cookbook will actually be followed. E.g. lock coarsening is an optimization that is more aggressive than the cookbook rules, but is allowed by the JMM. And so is lock biasing. And so is the ability to eliminate monitor and/or volatile operations on provably-thread-local objects and fields, e.g. when escape analysis can show that no other thread will ever observe a given object.

A specific quality that the JMM has that is "looser" than the notion of fences or barriers is that the rules apply to specific variables and not globally to all memory operations. While fences and barriers are usually described in a global sense (a rule that applies to all pervious/subsequent stores or loads), the rules of the JMM only apply to operations in other threads that interact with the same volatile variable or monitor in question. E.g. with regards to other threads operations on the same volatile variable, a volatile read will appear to have the equivalent of a LoadLoad | LoadSore between the volatile read operation any any subsequent loads or stores (seen in program order in the thread). But this promise does not apply against other threads that do not interact with the same volatile. So e.g. if the volatile can be proven to be thread local, a volatile read has no ordering implications. The same is true for volatile stores, so a volatile store will appear to have a LoadStore|StoreStore fence between any preceding loads and store of the volatile store operation when considered from the point of view of another thread operating on the same volatile field. But it cannot be relied on to create a global StoreStore ro LoadStore fence, or the equivalent of an acquire or release, since it can be optimized away under certain conditions (like e.g. if the field was a member of an an object that was proven to be thread-local via escape analysis and therefore is known to not have any other threads interacts with it). The same caveats apply to monitor enter and monitor exist.

On Tuesday, July 5, 2016 at 2:53:43 PM UTC-7, Dain Ironfoot wrote:
Gil thanks for the wonderfully simple yet clear explanation on the barriers.

May I ask you to please help me understand the related "memory visibility" guarantees that these barriers provide in JMM.
AFAIK, in java, we have the following language level instructions which deals with memory visibility & ordering.

volatile read (Atomicxxx get)
volatile write (Atomicxxx set)
 
Atomicxxx lazySet (Unsafe.putOrdered*)

lazySet is not described or related to in any way in the current (Java SE 8 or prior) JMM. In implementation, it is usually equivalent to a StoreStore fence preceding the set operation. This quality seems to be relied on by a lot of concurrent Java code (including code within the JDK), as it is a relatively cheap barrier.
 

Unsafe.loadFence() 
Unsafe.storeFence()
Unsafe.fullFence()

Similarly not defined in the JMM. And (like everything else in Unsafe) not defined by the Java SE spec. But described in JEP 171 in a way that would make them equivalent to:

Unsafe.loadFence() : LoadLoad | LoadStore
Unsafe.storeFence(): StoreLoad | StoreStore
Unsafe.fullFence(): LoadLoad | LoadStore | StoreLoad | StoreStore
 

Lock acquire
Lock release

I assume you are referring to j.u.c Locks here. According to the j.u.c.locks.Lock JavaDoc, those have the same semantics as monitor enter and monitor exist.
 
final fields in the constructor

This is defined in the JMM. With regards to other threads that may observe the contents of the final field, it can be thought of loosely as having a StoreStore fence between the final field initializer and any subsequent store operation that would make the object that contains the final field visible to other threads (e.g. storing it's reference into a heap object). But keep in mind the same JMM loosening possibility: if the field can be proven to not be visible to other threads (e.g. via escape analysis) then there is no ordering required. So depending on the runtime, it may not exist.

Dain Ironfoot

unread,
Jul 10, 2016, 9:43:51 PM7/10/16
to mechanical-sympathy
Gil, thank you so very much! I think I speak on behalf of a lot of developers that this is pure gold.

Reply all
Reply to author
Forward
0 new messages