Under the Hood of C++11 Memory Order

kennetha...@gmail.com

unread,

Mar 23, 2015, 8:21:32 PM3/23/15

to

So, I think that it's vital to some objectives of mine in order to understand what's happening at the hardware level when C++11's memory order is used.

It seems to me that it sounds like memory order is imposing some specifics on what the compiler generates as far as cache flushing goes, which is *very* important to me. In fact, I'd like to be able to control exactly when the cache is flushed or reloaded, and be able to prevent memory operations from triggering cache reads.

Can someone tell me more about how C++ 11's new memory order is implemented at a low level? And can someone tell me how to achieve that goal of cache control?

Richard Damon

unread,

Mar 24, 2015, 12:15:32 AM3/24/15

to

Fundamentally, the new C++11's memory order doesn't say anything about
"cache" as it doesn't mention a cache at all, it doesn't even require
there BE a cache. The standard doesn't talk with respect to a cache, but
when changes can't/may/must be visible to another piece of code.

Practically, to implement the semantics imposed, on a processor with a
cache, there will be things that need to be done, but it is fairly
processor and implementation specific.

If you really need to know/control the details of what is happening, you
are going to need to get down into assembly and/or details of the
implementation.

kennetha...@gmail.com

unread,

Mar 24, 2015, 1:45:17 AM3/24/15

to

On Tuesday, March 24, 2015 at 12:15:32 AM UTC-4, Richard Damon wrote:

I'll be on x86/64 and arm. But that's all; first x86/64 then arm. Arm is a second, distant priority.

Christian Gollwitzer

unread,

Mar 24, 2015, 4:06:30 AM3/24/15

to

Am 24.03.15 um 06:44 schrieb kennetha...@gmail.com:

> On Tuesday, March 24, 2015 at 12:15:32 AM UTC-4, Richard Damon wrote:
>> On 3/23/15 8:21 PM, ken***@gmail.com wrote:
>>> Can someone tell me more about how C++ 11's new memory order is
>>> implemented at a low level? And can someone tell me how to achieve
>>> that goal of cache control?
>>

>> If you really need to know/control the details of what is happening, you
>> are going to need to get down into assembly and/or details of the
>> implementation.
>
> I'll be on x86/64 and arm. But that's all; first x86/64 then arm. Arm is a second, distant priority.
>

If you have a good understanding of the corresponding assembly, maybe
just compiling test programs and looking at the assembly output will
help? e.g. using gcc do g++ -S test.cpp and inspect test.S. Play with
the optimization flags, at least -O1 should eliminate many trivial
function calls to template code.

Christian

Nobody

unread,

Mar 24, 2015, 4:17:25 AM3/24/15

to

Memory order is typically unrelated to caching. What it actually does is
to:

a) cause certain operations to be atomic, either by using atomic CPU
instructions (if such an instruction exists for the operation in question)
or by guarding non-atomic instructions or instruction sequences with
mutual exclusion primtives,

b) prevent the compiler from re-ordering instructions in a way that would
be safe for single-threaded code but unsafe for multi-threaded code, and

c) prevent the CPU from re-ordering instructions in a way that would be
safe for single-threaded code but unsafe for multi-threaded code.

Caching only comes into the picture if you have multiple caches which are
not automatically synchronised (e.g. via bus snooping). In that case,
the compiler may need to place atomic variables in uncached memory, or add
explicit flushes.

Chris Vine

unread,

Mar 24, 2015, 9:24:16 AM3/24/15

to

You cannot control the instruction caches or data caches in the way you
hope for, and cache flushing can occur quite independently from memory
ordering between threads, such by doing memory operations which cause a
cache miss, or by having a context switch.

Even if you use relaxed memory ordering you cannot control the time at
which cores synchronize (if that is what you really mean) because even
with relaxed memory ordering C++11 requires visibility "within a
reasonable amount of time" (§29.3/13), whatever that may be taken to
imply.

You can of course do things which will provoke memory synchronization,
such as by issuing fence instructions or by carrying out a C++11
acquire and release operation (note caches are always coherent anyway,
in x86/64 at least). But to the best of my knowledge there is no
instruction to _prevent_ caches doing their thing.

You might want to read this:

http://mechanical-sympathy.blogspot.co.uk/2013/02/cpu-cache-flushing-fallacy.html

Chris

kennetha...@gmail.com

unread,

Mar 24, 2015, 10:49:20 AM3/24/15

to

Oh ok. I had hoped that by trying to make my worker threads locally proximal to one another that caches could be flushed in a coherent way extremely quickly. Like, say threads one and two share a portion of memory. Before that memory goes out, what if another core could do a data structure coherent merge of the data? I thought that if I postured the problem so that work was associative and commutative, then no matter the order of thread precedence, the final result would be correct.

Now it seems like cache control isn't something you directly manage, but is done by the processor for you, and you can only try to get the compiler to select specific instructions that are suited to what you want and hopefully reduce unnecessary processor work as a result of those operations.

Please let me know your thoughts :)

Öö Tiib

unread,

Mar 24, 2015, 11:53:33 AM3/24/15

to

If your algorithm involves several threads to share and to mutate
same portion of memory then it is not scalable algorithm and if
you want to improve that then the easiest is to reduce such
sharing.

> Now it seems like cache control isn't something you directly manage,
> but is done by the processor for you, and you can only try to get the
> compiler to select specific instructions that are suited to what you
> want and hopefully reduce unnecessary processor work as a result
> of those operations.
>
> Please let me know your thoughts :)

Profile and use usual techniques of reducing cache misses (like improving
locality, merging arrays, loop interchange, loop fusion) and leave the rest
up to compiler and hardware to deal with.

Chris Vine

unread,

Mar 24, 2015, 7:37:02 PM3/24/15

to

On Tue, 24 Mar 2015 07:48:41 -0700 (PDT)
kennetha...@gmail.com wrote:
[snip]

> Oh ok. I had hoped that by trying to make my worker threads locally
> proximal to one another that caches could be flushed in a coherent
> way extremely quickly. Like, say threads one and two share a portion
> of memory. Before that memory goes out, what if another core could do
> a data structure coherent merge of the data? I thought that if I
> postured the problem so that work was associative and commutative,
> then no matter the order of thread precedence, the final result would
> be correct.
>
> Now it seems like cache control isn't something you directly manage,
> but is done by the processor for you, and you can only try to get the
> compiler to select specific instructions that are suited to what you
> want and hopefully reduce unnecessary processor work as a result of
> those operations.
>
> Please let me know your thoughts :)

If you are trying to avoid the costs of memory synchronization by
resorting to tricks with caches you are doomed. It won't work.

The only way out is to design your algorithms to minimize contention,
by avoiding shared data as much as possible. Where some data sharing
is inevitable and is adversely affecting scalability, examine whether
there is a lock free data structure available to do what you want. Then
again, there is lock free and there is lock free: release-acquire
atomics basically come free at the hardware level on x86/64 because
x86/64 is strongly ordered - the costs over relaxed memory ordering
mainly come down to the additional restrictions imposed on compiler
reordering when optimizing. However, sequential consistency is the
default for C++11 atomics and does not come free (on x86/64 an mfence
instruction is needed which imposes synchronization overhead).

(As an aside, in my view, 99% of all uses of atomics only need
release-acquire semantics in a well designed program, and contrary to
what some people say, release-acquire semantics for atomics are
perfectly easy to reason about: you just have to realize that you are
only synchronizing between the two threads carrying out a particular
release (store) and acquire (load) operation, and not enforcing a total
memory order for all reads and writes of atomic variables in the
program.)

Foremost, profile to see what the problem really is.

Chris

Scott Lurndal

unread,

Mar 25, 2015, 1:53:37 PM3/25/15

to

kennetha...@gmail.com writes:
>So, I think that it's vital to some objectives of mine in order to understa=
>nd what's happening at the hardware level when C++11's memory order is used=
>.
>
>It seems to me that it sounds like memory order is imposing some specifics =
>on what the compiler generates as far as cache flushing goes, which is *ver=
>y* important to me. In fact, I'd like to be able to control exactly when th=
>e cache is flushed or reloaded, and be able to prevent memory operations fr=
>om triggering cache reads.

Good luck. The cache is completely controlled by hardware. You have
no control over it (other than limited flush capabilities) from user-mode
code.

It certainly has nothing to do with C++11 memory ordering, which is related
more to the processor memory model (i.e. when loads can be reordered, or
loads ordered with respect to stores, or stores reorderd with respect to
loads).

Vir Campestris

unread,

Mar 25, 2015, 5:54:43 PM3/25/15

to

On 24/03/2015 23:36, Chris Vine wrote:
> If you are trying to avoid the costs of memory synchronization by
> resorting to tricks with caches you are doomed. It won't work.

Oh, it's worse than that. It might work most of the time...

Andy

Chris M. Thomasson

unread,

Mar 29, 2015, 4:31:14 PM3/29/15

to

> wrote in message
> news:23cba108-ca07-42b7...@googlegroups.com...

> So, I think that it's vital to some objectives of mine in order to
> understand
> what's happening at the hardware level when C++11's memory order is used.

> It seems to me that it sounds like memory order is imposing some specifics
> on what the compiler generates as far as cache flushing goes, which is
> *very*
> important to me. In fact, I'd like to be able to control exactly when the
> cache
> is flushed or reloaded, and be able to prevent memory operations from
> triggering cache reads.

> Can someone tell me more about how C++ 11's new memory order is
> implemented at a low level?

Basically, they are implemented using whatever it takes to achieve the goal
of the memory ordering constraints you specify. Just beware of
memory_order_consume... It can be tricky to get it right. AFAICT, one of
the main reasons it exists is to support Read-Copy, Update...

> And can someone tell me how to achieve that goal of cache control?

The cache control that you want is beyond your control at this level;
actually,
forget about cache, and think about visibility. However, you can design your
data-structures to be “cache friendly”, so to speak. Basically, try really
hard
to get around false-sharing.

Something along the lines of:

https://groups.google.com/d/msg/comp.programming.threads/kR2OxyF5IAg/rRgHOIrcNoAJ

I can go into more detail if you want. But I am a bit time constrained right
now.

Sorry!

;^o