Avoiding expensive memory barriers

1,136 views
Skip to first unread message

Avi Kivity

unread,
Feb 15, 2018, 11:29:27 AM2/15/18
to mechanical-sympathy
Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst)
taking the top row in a profile? Here's the complicated story of how we
took it down:


https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/

Dan Eloff

unread,
Mar 18, 2018, 2:16:37 PM3/18/18
to mechanica...@googlegroups.com
You don't need memory barriers to implement an SPSC queue for x86. You can do a relaxed store to the queue followed by a release write to producer_idx. As long as consumer begins with an acquire load from producer_idx it is guaranteed to see all stores to the queue memory before producer_idx, according to the happens before ordering. There are no memory barriers on x86 for acquire/release semantics.

The release/acquire semantics have no meaning when used with different memory locations, but if used on producer_idx when synchronizing the consumer, and consumer_idx when synchronizing the producer, it should work.



On Thu, Feb 15, 2018 at 8:29 AM, Avi Kivity <a...@scylladb.com> wrote:
Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst) taking the top row in a profile? Here's the complicated story of how we took it down:


https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Avi Kivity

unread,
Mar 19, 2018, 4:19:08 AM3/19/18
to mechanica...@googlegroups.com, Dan Eloff

The release write is a memory barrier. It's not an SFENCE or another fancy instruction, but it is a memory barrier from the application writer's point of view.


The C++ code


    x.store(5, std::memory_order_relaxed)


has two effects on x86:

  1. generate a write to x that is a single instruction (e.g. mov $5, x)
  2. prevent preceding writes from being reordered by the compiler (they are implicitly ordered by the processor on x86).
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Francesco Nigro

unread,
Mar 19, 2018, 10:23:57 AM3/19/18
to mechanica...@googlegroups.com
HI guys!

Re 

>  2. prevent preceding writes from being reordered by the compiler (they are implicitly ordered by the processor on x86).

 I have had a quite interesting chat with Jeff Preshing (http://preshing.com/) about relaxed guarantees and he pointed me this one to help on it:
https://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001946.html

And quoting himself:
  1. Jeff Preshing

    As I wrote above, "each atomic var has its own modification order"... once a value in that modification is seen, older ones cannot be seen again

     
    12 feb 2017
  2. Jeff Preshing

    Note this guarantee has nothing to do with acquire/release... it's a separate guarantee, that each atomic var being consistent with its own modification order


Cheers,
Franz


Il giorno lunedì 19 marzo 2018 09:19:08 UTC+1, Avi Kivity ha scritto:

The release write is a memory barrier. It's not an SFENCE or another fancy instruction, but it is a memory barrier from the application writer's point of view.


The C++ code


    x.store(5, std::memory_order_relaxed)


has two effects on x86:

  1. generate a write to x that is a single instruction (e.g. mov $5, x)
  2. prevent preceding writes from being reordered by the compiler (they are implicitly ordered by the processor on x86).



On 03/18/2018 08:16 PM, Dan Eloff wrote:
You don't need memory barriers to implement an SPSC queue for x86. You can do a relaxed store to the queue followed by a release write to producer_idx. As long as consumer begins with an acquire load from producer_idx it is guaranteed to see all stores to the queue memory before producer_idx, according to the happens before ordering. There are no memory barriers on x86 for acquire/release semantics.

The release/acquire semantics have no meaning when used with different memory locations, but if used on producer_idx when synchronizing the consumer, and consumer_idx when synchronizing the producer, it should work.


On Thu, Feb 15, 2018 at 8:29 AM, Avi Kivity <a...@scylladb.com> wrote:
Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst) taking the top row in a profile? Here's the complicated story of how we took it down:


https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Dan Eloff

unread,
Mar 19, 2018, 10:41:41 AM3/19/18
to mechanica...@googlegroups.com
We're getting a little confused on the terminology. That's a compiler barrier, as it prevents the compiler from reordering certain instructions beyond it (I don't think relaxed prevents any reordering, but release and acquire do.) I know you understand this stuff given your background, I just want to clarify the terminology for the sake of the discussion.

The original post and article discuss real memory barriers like mfence. These prevent the CPU from reordering loads and stores. Which should be unnecessary for SPSC queues on x86 because it gives strong enough guarantees about reordering, in this case, without that.


On Mon, Mar 19, 2018, 1:19 AM Avi Kivity <a...@scylladb.com> wrote:

The release write is a memory barrier. It's not an SFENCE or another fancy instruction, but it is a memory barrier from the application writer's point of view.


The C++ code


    x.store(5, std::memory_order_relaxed)


has two effects on x86:

  1. generate a write to x that is a single instruction (e.g. mov $5, x)
  2. prevent preceding writes from being reordered by the compiler (they are implicitly ordered by the processor on x86).



On 03/18/2018 08:16 PM, Dan Eloff wrote:
You don't need memory barriers to implement an SPSC queue for x86. You can do a relaxed store to the queue followed by a release write to producer_idx. As long as consumer begins with an acquire load from producer_idx it is guaranteed to see all stores to the queue memory before producer_idx, according to the happens before ordering. There are no memory barriers on x86 for acquire/release semantics.

The release/acquire semantics have no meaning when used with different memory locations, but if used on producer_idx when synchronizing the consumer, and consumer_idx when synchronizing the producer, it should work.


On Thu, Feb 15, 2018 at 8:29 AM, Avi Kivity <a...@scylladb.com> wrote:
Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst) taking the top row in a profile? Here's the complicated story of how we took it down:


https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Avi Kivity

unread,
Mar 21, 2018, 1:44:11 PM3/21/18
to mechanica...@googlegroups.com, Dan Eloff

The mfence was not in the SPSC queue implementation, but in the code that allows wakes a potentially sleeping MPSC consumer.


  producer_idx.store(..., memory_order_release)

  if (consumer_is_sleeping) wake_it_up()


an mfence (or a clever trick) is required between these two lines.

Dan Eloff

unread,
Mar 21, 2018, 3:03:03 PM3/21/18
to Avi Kivity, mechanica...@googlegroups.com
Sorry, I understand now. Yes, you're absolutely correct, the mfence (or similar) is required there.

x86 makes strong guarantees about instruction reordering
  • Loads are not reordered with other loads.
  • Stores are not reordered with other stores.
  • Stores are not reordered with older loads.
  • In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
  • In a multiprocessor system, stores to the same location have a total order.
  • In a multiprocessor system, locked instructions have a total order.
  • Loads and stores are not reordered with locked instructions.
But that comes with a very important non-guarantee:
  • Loads may be reordered with older stores to different locations

Because producer_idx and consumer_is_sleeping are different variables, the load to consumer_is_sleeping may be reordred by the processor and execute before the store to producer_idx, which could cause the consumer to sleep with work available, and not be woken for it.


On Wed, Mar 21, 2018 at 10:44 AM, Avi Kivity <a...@scylladb.com> wrote:

The mfence was not in the SPSC queue implementation, but in the code that allows wakes a potentially sleeping MPSC consumer.


  producer_idx.store(..., memory_order_release)

  if (consumer_is_sleeping) wake_it_up()


an mfence (or a clever trick) is required between these two lines.


On 03/19/2018 04:41 PM, Dan Eloff wrote:
We're getting a little confused on the terminology. That's a compiler barrier, as it prevents the compiler from reordering certain instructions beyond it (I don't think relaxed prevents any reordering, but release and acquire do.) I know you understand this stuff given your background, I just want to clarify the terminology for the sake of the discussion.

The original post and article discuss real memory barriers like mfence. These prevent the CPU from reordering loads and stores. Which should be unnecessary for SPSC queues on x86 because it gives strong enough guarantees about reordering, in this case, without that.


On Mon, Mar 19, 2018, 1:19 AM Avi Kivity <a...@scylladb.com> wrote:

The release write is a memory barrier. It's not an SFENCE or another fancy instruction, but it is a memory barrier from the application writer's point of view.


The C++ code


    x.store(5, std::memory_order_relaxed)


has two effects on x86:

  1. generate a write to x that is a single instruction (e.g. mov $5, x)
  2. prevent preceding writes from being reordered by the compiler (they are implicitly ordered by the processor on x86).



On 03/18/2018 08:16 PM, Dan Eloff wrote:
You don't need memory barriers to implement an SPSC queue for x86. You can do a relaxed store to the queue followed by a release write to producer_idx. As long as consumer begins with an acquire load from producer_idx it is guaranteed to see all stores to the queue memory before producer_idx, according to the happens before ordering. There are no memory barriers on x86 for acquire/release semantics.

The release/acquire semantics have no meaning when used with different memory locations, but if used on producer_idx when synchronizing the consumer, and consumer_idx when synchronizing the producer, it should work.


On Thu, Feb 15, 2018 at 8:29 AM, Avi Kivity <a...@scylladb.com> wrote:
Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst) taking the top row in a profile? Here's the complicated story of how we took it down:


https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Mar 21, 2018, 5:27:54 PM3/21/18
to mechanical-sympathy
Daniel, if what you originally meant is "You don't need memory barriers that incur any costs on x86 to implement an SPSC queue.", instead of the stated "You don't need memory barriers to implement an SPSC queue for x86.", you and Avi may be on the same page. I think.

Barriers are barriers no matter what level you express them in. Arguably, all source-code-stated barriers are "compiler barriers". And from an expression point of view, most barriers are "language memory model barriers" (this is true even in C or MASM when calling implementation-specific things that establish a required (or the more common I-sure-hope-the-compiler-listens-to-me-and-doesn't-mess-with-this) ordering. At the language/source-code levels some barriers might be explicit and some are implicit (e.g. a read from a volatile field in Java includes implicit barrier semantics, but a VarHandle.loadLoadFence() is an explicit barrier). The compiler (or macro assembler, etc.) will adhere to the language semantics barriers in its own choices on ordering and potential elimination of code (before any CPU instructions actually get emitted), and it may translate some of those semantic barriers to machine instructions that enforce them (stated or implicit). The x86 memory model includes all sorts of implicit and "free" barriers. e.g. x86 instructions generally imply load-load, and load-store, and store-store, but not store-load (Only specific x86 instructions establish a store-load ordering). 

On x86, compilers don't need to emit barrier instruction in order to maintain load-load or load-store, or store-store ordering, or to enforce acquire fences (loadLoad and loadStore combined) and release fences (loadStore and storeStore combined). But the compilers themselves still need to know what fences/barriers they need to enforce in their own code-jumbling. E.g. without a source-code-semantics barrier requiring store-store ordering, the compiler may feely reorder any two stores in your code, and emit the stores in that new order, which can easily wreck the correctness of otherwise very-fast-on-x86 SPSC queue code, even on x86.

None of this opines on whether or not SPSC can be done without a store-load barrier so that on x86 no barrier instructions would be needed. I don't know if it can.

On Monday, March 19, 2018 at 7:41:41 AM UTC-7, Daniel Eloff wrote:
We're getting a little confused on the terminology. That's a compiler barrier, as it prevents the compiler from reordering certain instructions beyond it (I don't think relaxed prevents any reordering, but release and acquire do.) I know you understand this stuff given your background, I just want to clarify the terminology for the sake of the discussion.

The original post and article discuss real memory barriers like mfence. These prevent the CPU from reordering loads and stores. Which should be unnecessary for SPSC queues on x86 because it gives strong enough guarantees about reordering, in this case, without that.


On Mon, Mar 19, 2018, 1:19 AM Avi Kivity <a...@scylladb.com> wrote:

The release write is a memory barrier. It's not an SFENCE or another fancy instruction, but it is a memory barrier from the application writer's point of view.


The C++ code


    x.store(5, std::memory_order_relaxed)


has two effects on x86:

  1. generate a write to x that is a single instruction (e.g. mov $5, x)
  2. prevent preceding writes from being reordered by the compiler (they are implicitly ordered by the processor on x86).



On 03/18/2018 08:16 PM, Dan Eloff wrote:
You don't need memory barriers to implement an SPSC queue for x86. You can do a relaxed store to the queue followed by a release write to producer_idx. As long as consumer begins with an acquire load from producer_idx it is guaranteed to see all stores to the queue memory before producer_idx, according to the happens before ordering. There are no memory barriers on x86 for acquire/release semantics.

The release/acquire semantics have no meaning when used with different memory locations, but if used on producer_idx when synchronizing the consumer, and consumer_idx when synchronizing the producer, it should work.


On Thu, Feb 15, 2018 at 8:29 AM, Avi Kivity <a...@scylladb.com> wrote:
Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst) taking the top row in a profile? Here's the complicated story of how we took it down:


https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Dan Eloff

unread,
Mar 22, 2018, 12:41:55 AM3/22/18
to mechanica...@googlegroups.com
Hi Gil,

Yes, you summed up what I've been trying to communicate, but much more eloquently. I think we're all on the same page. With any concurrent access across threads you really need some kind of compiler guarantees / barrier around re-ordering and optimizing/altering of instructions. Crossing fingers isn't a good approach, data races are fundamentally unsafe.

Sometimes this requires adding "real" fencing/ordering instructions so the CPU doesn't re-order things, often on x86 this can be avoided because x86 gives very strong guarantees around instruction reordering by default.

So for any kind of concurrent data structure, like an SPSC queue, you're going to need compiler barriers like std atomics offer with release/acquire, relaxed, etc. For the specific case of SPSC queues on x86 this is enough. At least with some common designs no additional fencing instructions are required for the CPU. However, once you add putting a thread to sleep and waking it up, then you do really need additional fencing as Avi demonstrated. Or at least I don't see any desirable alternative.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages