Why not a FULL BARRIERS memory order besides memory_order_seq_cst?

250 views
Skip to first unread message

Nemo Yu

unread,
Feb 2, 2016, 8:31:25 AM2/2/16
to std-dis...@isocpp.org
1) memory_order_seq_cst is stronger than full barriers(LoadLoad, StoreStore, StoreLoad, LoadStore), why C++ doesn't provide a full-barrier? I don't know if there is such a practical architecture where "sequentially consistent" != full-barrier, but we assume it does without loss of generality.

2) I found one possible implementation to achieve a StoreLoad barrier: 
fetch_add(&addr, 0, memory_order_release); 
    How does it work?

3) With respect to 1), when will the following case happen?
  • thread 1 writes: a=1; b=2;
  • thread 2 sees: a=1 then b=2;
  • thread 3 sees: b=1 then a=1;
I will appreciate it if you can give a detail about specific scenario, such like architecture/intrinsics etc..

4) What's the difference in practice between Acquire/Release fences and Acquire/Release operations(e.g. a load with Acquire/Release)? AFAIK the implementation is the same with memory fences, although they are not equal in the standard.

Thiago Macieira

unread,
Feb 2, 2016, 10:02:20 PM2/2/16
to std-dis...@isocpp.org
On Tuesday 02 February 2016 21:31:23 Nemo Yu wrote:
> 1) memory_order_seq_cst is stronger than full barriers(LoadLoad,
> StoreStore, StoreLoad, LoadStore), why C++ doesn't provide a full-barrier?

That's memory_order_acq_rel.

> 2) I found one possible implementation to achieve a StoreLoad barrier:
>
> fetch_add(&addr, 0, memory_order_release);
>
> How does it work?

You need to ask your CPU vendor how they implemented it. Assuming they have a
memory order different from full barrier.

For example, on IA-64, the above could be implemented by an fetchadd4.rel
instruction.

> 3) With respect to 1), when will the following case happen?
>
> - thread 1 writes: a=1; b=2;
> - thread 2 sees: a=1 then b=2;
> - thread 3 sees: b=1 then a=1;
>
> I will appreciate it if you can give a detail about specific scenario, such
> like architecture/intrinsics etc..

Please be more specific. This scenario doesn't make sense because of lack of
information. Please specify:
a) what types are a and b (I assume we're talking about atomic<int>)
b) what you meant by "write". Did you mean store-relaxed, store-release or
store-CST?
c) what the values of a and b were before all of this happened.

If I assume a = b = 0 when everything started, then the case above will never
happen because b is never assigned the value of 1. The *principle* of atomics
is that you can never see an intermediate value, so any observer of b must see
either 0 or 2, never something else.

> 4) What's the difference in practice between Acquire/Release fences and
> Acquire/Release operations(e.g. a load with Acquire/Release)? AFAIK the
> implementation is the same with memory fences, although they are not equal
> in the standard.

On x86, no difference. The LFENCE/SFENCE/MFENCE instructions are not useful on
main memory (cache-backed). They're only used for uncached memory (MMIO), so
compilers do not need to emit them (GCC does anyway).

On most architectures where memory order does matter, instructions either have
an associated order by themselves or there's an extra instruction to do the
fence. Taking the example of IA-64:

* there's ldN and ldN.acq; stN and stN.rel
* there's fetchaddN.acq and fetchaddN.rel, cmpxchg.acq and cmpxchg.rel
* xchgN always has acquire semantics

If you want to implement any order stricter than what the instruction permits,
you insert an mf to force a full barrier.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Nemo Yu

unread,
Feb 2, 2016, 11:13:39 PM2/2/16
to std-dis...@isocpp.org
> 1) memory_order_seq_cst is stronger than full barriers(LoadLoad,
> StoreStore, StoreLoad, LoadStore), why C++ doesn't provide a full-barrier?

That's memory_order_acq_rel.

Sorry, it is not appropriate to use the words "full barriers" here. I mean the full barriers by LoadLoad+StoreStore+StoreLoad+LoadStore. The acl_rel only provides LoadLoad+StoreStore+LoadStore, so it lacks a StoreLoad to complete the "full barriers" I meant in context.

> 2) I found one possible implementation to achieve a StoreLoad barrier:
>
> fetch_add(&addr, 0, memory_order_release);
>
>     How does it work?

You need to ask your CPU vendor how they implemented it. Assuming they have a
memory order different from full barrier.


C/C++11 Operationx86 implementation
Load Seq_Cst:LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
Store Seq Cst:MOV (into memory)


Please be more specific. This scenario doesn't make sense because of lack of
information. Please specify:

It is a typo. Correct to this:

3) With respect to 1), when will the following case happen?
  • a=b=0; thread 1, 2, 3 starts at the same time.
  • thread 1 writes: a=1; b=1;
  • thread 2 sees: a=1 then b=1;
  • thread 3 sees: b=1 then a=1;
    All operations are relaxed. The question is about the expense of memory/caches synchronization.

    --

    ---
    You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Discussion" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to std-discussio...@isocpp.org.
    To post to this group, send email to std-dis...@isocpp.org.
    Visit this group at https://groups.google.com/a/isocpp.org/group/std-discussion/.

    Thiago Macieira

    unread,
    Feb 3, 2016, 12:08:33 AM2/3/16
    to std-dis...@isocpp.org
    Em quarta-feira, 3 de fevereiro de 2016, às 12:13:36 PST, Nemo Yu escreveu:
    > > Please be more specific. This scenario doesn't make sense because of lack
    > > of
    > > information. Please specify:
    > It is a typo. Correct to this:
    >
    > 3) With respect to 1), when will the following case happen?
    >
    > - a=b=0; thread 1, 2, 3 starts at the same time.
    > - thread 1 writes: a=1; b=1;
    > - thread 2 sees: a=1 then b=1;
    > - thread 3 sees: b=1 then a=1;
    >
    > All operations are relaxed. The question is about the expense of
    > memory/caches synchronization.

    Please reread your post before sending and check that it is correct.

    The case above is the trivial one when the stores have happened before threads
    2 and 3 got to read a and b.

    Nemo Yu

    unread,
    Feb 3, 2016, 12:27:13 AM2/3/16
    to std-dis...@isocpp.org
    It is an unclear expression, illustrate it:

    int a=0; b=0; // not atomics
    Thread 1: a.store(1); b.store(1);
    Thread 2: if(b.load()==1) a.store(2);
    Thread 3: if(a.load()==2) assert(b.load()==1);


    which says the assert can fail if Loads/Stores are not sequentially consistent. It shows the delay of value updating(when will it happen?). I give another example:

    int a=0; b=0; // not atomics
    Thread 1: a.store(1); b.store(1); // all relaxed
    Thread 2: if(b.load()==1) assert(a.load()==1); // all relaxed
    Thread 3: if(a.load()==1) assert(b.load()==1); // all relaxed

    I think both asserts can fail. In other word, when will the thread 2 and 3 see different orders of storing?


    Thiago Macieira

    unread,
    Feb 3, 2016, 12:56:04 AM2/3/16
    to std-dis...@isocpp.org
    On quarta-feira, 3 de fevereiro de 2016 13:27:11 PST Nemo Yu wrote:
    > It is an unclear expression, illustrate it:
    >
    > int a=0; b=0; // not atomics

    well, obviously they are atomic. You meant
    atomic<int> a{0}, b{0}; // not stored atomically

    > Thread 1: a.store(1); b.store(1);
    > Thread 2: if(b.load()==1) a.store(2);
    > Thread 3: if(a.load()==2) assert(b.load()==1);
    >
    > (the example above comes from
    > http://en.cppreference.com/w/cpp/atomic/memory_order)
    >
    > which says the assert can fail if Loads/Stores are not sequentially
    > consistent. It shows the delay of value updating(when will it happen?).

    Correct. It is theoretically possible.

    Suppose that a and be are located in different cachelines, possibly even
    different pages.

    Thread 1 will store 1 in both a and b. First of all, since the stores are
    relaxed, they could be reordered by the compiler. Assuming that they aren't,
    they can still be reordered by the processor or the cache controller.

    Thread 2 has observed b to have the value of 1, which means its local cache
    controller obtained the global cacheline for b. Then it performs a store of a.

    Thread 3 observes that store of a, but it's theoretically possible that its
    cache controller has not yet obtained the update of b. So it can observe the
    initial value of b, that is zero.

    This can happen in architectures where some portions of the memory are
    "closer" to some processors than others.

    > I give another example:
    >
    > int a=0; b=0; // not atomics
    > Thread 1: a.store(1); b.store(1); // all relaxed
    > Thread 2: if(b.load()==1) assert(a.load()==1); // all relaxed
    > Thread 3: if(a.load()==1) assert(b.load()==1); // all relaxed
    >
    > I think both asserts can fail. In other word, when will the thread 2 and 3
    > see different orders of storing?

    Thread 3 can fail no matter what memory order you use. Your example has a race
    condition.

    Thread 2 can fail because the stores from thread 1 may have been reorder or
    the loads in thread 2 have been. You can fix this by making Thread 1 do a
    store-release for b and Thread 2 do a load-acquire on a.

    Giovanni Piero Deretta

    unread,
    Feb 4, 2016, 9:51:52 AM2/4/16
    to ISO C++ Standard - Discussion
    On Wednesday, February 3, 2016 at 3:02:20 AM UTC, Thiago Macieira wrote:
    [...]
    > 4) What's the difference in practice between Acquire/Release fences and
    > Acquire/Release operations(e.g. a load with Acquire/Release)? AFAIK the
    > implementation is the same with memory fences, although they are not equal
    > in the standard.

    On x86, no difference. The LFENCE/SFENCE/MFENCE instructions are not useful on
    main memory (cache-backed). They're only used for uncached memory (MMIO), so
    compilers do not need to emit them (GCC does anyway).

    It is more complicated actually. First of all MFENCE is a genuine full barrier (recently documented to be sequentially consistent), so it can be used to implement atomic_thread_fence(seq_cst) or even sequentially consistent stores. In practice on actual hardware mfence is a bit slower than a locked exchange which has the same semantics and that is used instead.

    L/SFENCE are redundant on normal operations, but are useful to order non temporal load/stores even on normal (i.e. cached) memory.
     
    -- gpd
    Reply all
    Reply to author
    Forward
    0 new messages