How to achieve synchronization between threads in ASM

158 views
Skip to first unread message

Genom Genom

unread,
Jun 6, 2011, 7:52:30 AM6/6/11
to Scalable Synchronization Algorithms
Hello,

First, my apologies if i'm in the wrong place, but i thought it could
be relevant to ask for help here.

I'm having fun with multithreading (i'm reading Anthony William's
book) on linux (ubuntu). I've understood the memory model and atomic
operations. The problem is i would like to code my own atomic
operation, to learn.

For example, i'm compiling this code (c++0x) :

==========================
#include <iostream>
#include <thread>
#include <assert.h>

volatile int x = 0;
int _pad1[4096]; //padding for cache : without padding, assert do not
fire.
volatile int y = 0;
int _pad2[4096];
int z = 0;
void store()
{
x = 1;
y = 1;
//__asm __volatile("MFENCE");
}

void load()
{
// __asm __volatile("MFENCE");
while (!x)
;;
// __asm __volatile("MFENCE");
if (y == 1)
z = 1;
}

int main()
{
std::thread t1(store); //create thread
std::thread t2(load);

t1.join();
t2.join();

assert(z == 1);
}

==========================

First of all, i've made a first version without fences. On x86, it
loads seems to be atomics, but i really do not know how to synchronize
values between threads (how to be sure y will be updated, in load). I
then learned about MFENCE, LFENCE, SFENCE and made a second version
(fences commented) but doesn't work.

* So, the questions : How can i achieve value synchronisation in ASM
and what is the simplest method ? Is the use of fences is great ? Can
it be optimized by other methods in ASM ?

* Also, why padding make the assert fire ?

Regards :)

Nicola Bonelli

unread,
Jun 6, 2011, 8:48:31 AM6/6/11
to lock...@googlegroups.com
Hello,

you are missing the synchronize-with relation between the two threads.
A barrier in the store() with at least the release semantic (between
the assignments) is required to synchronize-with a barrier with at
least acquire semantic in the load().
If x is used to check the condition in the load(), than you have to
assign y first, call the fence and then assign x in the store().

The barrier also ensures the happens-before relation :-)

According to your example...

The store() could be something like:

y = 1;
__asm __volatile("SFENCE");
x = 1;

and the store():

do {
__asm __volatile("LFENCE");
} while(!x);

assert(y==1);
z=1;

My two cents,
Nicola


--
The difference between theory and practice is bigger in practice than in theory.

Anthony Williams

unread,
Jun 6, 2011, 9:05:31 AM6/6/11
to lock...@googlegroups.com
On 06/06/11 12:52, Genom Genom wrote:
> I'm having fun with multithreading (i'm reading Anthony William's
> book) on linux (ubuntu).

Great!

> I've understood the memory model and atomic
> operations. The problem is i would like to code my own atomic
> operation, to learn.

Getting this right is tricky. You really need to understand what the
processor is doing, and what the compiler is doing.

> For example, i'm compiling this code (c++0x) :
>
> ==========================
> #include<iostream>
> #include<thread>
> #include<assert.h>
>
> volatile int x = 0;
> int _pad1[4096]; //padding for cache : without padding, assert do not
> fire.
> volatile int y = 0;
> int _pad2[4096];
> int z = 0;
> void store()
> {
> x = 1;
> y = 1;
> //__asm __volatile("MFENCE");
> }
>
> void load()
> {
> // __asm __volatile("MFENCE");
> while (!x)
> ;;
> // __asm __volatile("MFENCE");
> if (y == 1)
> z = 1;
> }

x is stored first, so even with the fences, just because x is 1, there
is no guarantee that store() has even written a value to y yet. You need
to swap the assignments to x and y, and put the fence between them.

> int main()
> {
> std::thread t1(store); //create thread
> std::thread t2(load);
>
> t1.join();
> t2.join();
>
> assert(z == 1);
> }
>
> ==========================
>
> First of all, i've made a first version without fences. On x86, it
> loads seems to be atomics, but i really do not know how to synchronize
> values between threads (how to be sure y will be updated, in load). I
> then learned about MFENCE, LFENCE, SFENCE and made a second version
> (fences commented) but doesn't work.

On x86 you often don't need fences. However, you need to tell the
compiler what you're doing, so it doesn't optimize the code in a way
that breaks your expectations.

> * So, the questions : How can i achieve value synchronisation in ASM
> and what is the simplest method ? Is the use of fences is great ? Can
> it be optimized by other methods in ASM ?

For the fences to have any effect on the gcc optimizer, they need to be
annotated as affecting "memory".

__asm__ __volatile__ ("mfence" : : : "memory");

You can optimize the code better by doing everything with __asm__
statements.

> * Also, why padding make the assert fire ?

Without the padding, x and y are on the same cache line, so they will
likely be written back to main memory together (though the compiler/cpu
might flush the cache line after the write of the first variable). With
the padding, they are on separate cache lines, so will be written
separately. Consequently, the assert will fire more often.

If you are serious about learning how to write atomics, read the Intel
CPU manuals, especially volume 3A:
http://www.intel.com/products/processor/manuals/

Also, read the AMD manuals:
http://developer.amd.com/documentation/guides/Pages/default.aspx#manuals

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Nikos Anastopoulos

unread,
Jun 6, 2011, 9:33:14 PM6/6/11
to lock...@googlegroups.com

On x86 you often don't need fences. However, you need to tell the compiler what you're doing, so it doesn't optimize the code in a way that breaks your expectations.

Apart from preventing hardware and compiler optimizations (reordering), fences are supposed to make memory operations globally visible. For example, store fences "guarantee that every store instruction that precedes in program order the SFENCE instruction is globally visible before any store instruction that follows the SFENCE instruction is globally visible" (taken from x86 instruction set manual). I am not sure if Intel documents how this is actually implemented in its processors. Does a write-buffer flush occur? And if not, how else global visibility can be guaranteed?

On an interesting article about QPI (http://blogs.oracle.com/dave/entry/qpi_quiescence ) Dave Dice writes: 
"Τo allay a common misconception it's worth pointing out that barriers (...) are typically implemented as processor-local operations and don't cause any distinguished action on the bus or interconnect and instead simply instruct the processor to ensure that prior stores become visible before subsequent loads (...). That is, they don't force anything to happen -- such as coherence messages on the bus -- that were not already destined to occur. Instead, they simply enforce an order, momentarily reconciling program and memory order. Crucially, at least with current x86 and SPARC implementations, barriers don't force anything to occur off-processor. That also means they don't impede or impair scalability."

Don't these excerpts in bold contradict somehow the assumption ("misconception"?) about write-buffer flush?

Anthony Williams

unread,
Jun 7, 2011, 4:21:07 AM6/7/11
to lock...@googlegroups.com
On 07/06/11 02:33, Nikos Anastopoulos wrote:
>
> On x86 you often don't need fences. However, you need to tell the
> compiler what you're doing, so it doesn't optimize the code in a way
> that breaks your expectations.
>
>
> Apart from preventing hardware and compiler optimizations (reordering),
> fences are supposed to make memory operations globally visible. For
> example, store fences /"guarantee that every store instruction that

> precedes in program order the SFENCE instruction is globally visible
> //before any store instruction that follows the SFENCE instruction is
> globally visible"/ (taken from x86 instruction set manual). I am not

> sure if Intel documents how this is actually implemented in its
> processors. Does a write-buffer flush occur? And if not, how else global
> visibility can be guaranteed?

SFENCE ensures that the values written before the fence are flushed to
memory before those written afterwards. It doesn't mean the flush has to
happen at a particular time though.

It is particularly useful with "non temporal" stores (MOVNTI and
friends), which otherwise are completely unordered.

Note I said you **often** don't need fences. Sometimes you do. For
ordering of unrelated writes (sequential consistency) you need to either
use a LOCKed instruction (such as XCHG) for your stores, or you need to
use MFENCE.

> On an interesting article about QPI
> (http://blogs.oracle.com/dave/entry/qpi_quiescence ) Dave Dice writes:

> /"οΏ½o allay a common misconception it's worth pointing out that barriers
> (...) are typically implemented as processor-local operations and *don't
> cause any distinguished action on the bus or interconnect* and instead


> simply instruct the processor to ensure that prior stores become visible
> before subsequent loads (...). That is, they don't force anything to
> happen -- such as coherence messages on the bus -- that were not already
> destined to occur. Instead, they simply enforce an order, momentarily
> reconciling program and memory order. Crucially, at least with current

> x86 and SPARC implementations, barriers *don't force anything to occur
> off-processor*. That also means they don't impede or impair scalability."/
> /
> /


> Don't these excerpts in bold contradict somehow the assumption
> ("misconception"?) about write-buffer flush?

The write buffer is not necessarily flushed to main memory; fences just
enforce the ORDER in which things are flushed.

Kimo Crossman

unread,
Jun 7, 2011, 4:23:04 AM6/7/11
to lock...@googlegroups.com
I just want to say this discussion is fascinating - thank you everyone. :-)

2011/6/7 Anthony Williams <antho...@gmail.com>
On 07/06/11 02:33, Nikos Anastopoulos wrote:

   On x86 you often don't need fences. However, you need to tell the
   compiler what you're doing, so it doesn't optimize the code in a way
   that breaks your expectations.


Apart from preventing hardware and compiler optimizations (reordering),
fences are supposed to make memory operations globally visible. For
example, store fences /"guarantee that every store instruction that
precedes in program order the SFENCE instruction is globally visible
//before any store instruction that follows the SFENCE instruction is
globally visible"/ (taken from x86 instruction set manual). I am not
sure if Intel documents how this is actually implemented in its
processors. Does a write-buffer flush occur? And if not, how else global
visibility can be guaranteed?

SFENCE ensures that the values written before the fence are flushed to memory before those written afterwards. It doesn't mean the flush has to happen at a particular time though.

It is particularly useful with "non temporal" stores (MOVNTI and friends), which otherwise are completely unordered.

Note I said you **often** don't need fences. Sometimes you do. For ordering of unrelated writes (sequential consistency) you need to either use a LOCKed instruction (such as XCHG) for your stores, or you need to use MFENCE.
On an interesting article about QPI
(http://blogs.oracle.com/dave/entry/qpi_quiescence ) Dave Dice writes:
/"Τo allay a common misconception it's worth pointing out that barriers

(...) are typically implemented as processor-local operations and *don't
cause any distinguished action on the bus or interconnect* and instead
simply instruct the processor to ensure that prior stores become visible
before subsequent loads (...). That is, they don't force anything to
happen -- such as coherence messages on the bus -- that were not already
destined to occur. Instead, they simply enforce an order, momentarily
reconciling program and memory order. Crucially, at least with current
x86 and SPARC implementations, barriers *don't force anything to occur
off-processor*. That also means they don't impede or impair scalability."/
/
/
Don't these excerpts in bold contradict somehow the assumption
("misconception"?) about write-buffer flush?

Nikos Anastopoulos

unread,
Jun 7, 2011, 5:41:47 AM6/7/11
to lock...@googlegroups.com
2011/6/7 Anthony Williams <antho...@gmail.com>

On 07/06/11 02:33, Nikos Anastopoulos wrote:

   On x86 you often don't need fences. However, you need to tell the
   compiler what you're doing, so it doesn't optimize the code in a way
   that breaks your expectations.


Apart from preventing hardware and compiler optimizations (reordering),
fences are supposed to make memory operations globally visible. For
example, store fences /"guarantee that every store instruction that
precedes in program order the SFENCE instruction is globally visible
//before any store instruction that follows the SFENCE instruction is
globally visible"/ (taken from x86 instruction set manual). I am not
sure if Intel documents how this is actually implemented in its
processors. Does a write-buffer flush occur? And if not, how else global
visibility can be guaranteed?

SFENCE ensures that the values written before the fence are flushed to memory before those written afterwards. It doesn't mean the flush has to happen at a particular time though.

So, as far as I understand, the coherence protocol is responsible to make writes visible in the order they left the write buffer. Correct?

Anthony Williams

unread,
Jun 7, 2011, 5:46:50 AM6/7/11
to lock...@googlegroups.com
On 07/06/11 10:41, Nikos Anastopoulos wrote:
> 2011/6/7 Anthony Williams <antho...@gmail.com
> <mailto:antho...@gmail.com>>

Yes, that matches my understanding.

Genom Genom

unread,
Jun 9, 2011, 3:58:54 AM6/9/11
to Scalable Synchronization Algorithms
Thank you everybody ! That's exactly what i was looking for !

Nicola Bonelli

unread,
Jun 15, 2011, 3:38:04 PM6/15/11
to lock...@googlegroups.com
Hello,

even if this is a typical case for using acquire/release semantic
(Antony correct me if I'm wrong), it turns out that if we go native,
the use of memory fence is not required under AMD64 architectures.

Loads are indeed not reordered with loads, and store are not reordered
with stores.

In this case, we can just use a simple compiler barrier:

__asm __volatile ("" ::: "momory")

that prevent the compiler from reordering instructions. There's a
considerable performance gain in comparison to using the pair
lfence/sfence.

Besides, since a load can be moved prior to store, I wonder if the
c++11 <atomic> implementation will be able to determine when it's
possible to use just compiler barrier (guess it's hard if not
impossibile), or if we have to pay the fee for writing portable code.

The same reasoning applies to std::atomic_thread_fence()...
I'd like to know what Antony thinks about this.

Ciao,

Anthony Williams

unread,
Jun 15, 2011, 4:39:21 PM6/15/11
to lock...@googlegroups.com
On 15/06/11 20:38, Nicola Bonelli wrote:
> even if this is a typical case for using acquire/release semantic
> (Antony correct me if I'm wrong), it turns out that if we go native,
> the use of memory fence is not required under AMD64 architectures.

That is correct.

> In this case, we can just use a simple compiler barrier:
>

> __asm __volatile ("" ::: "memory")


>
>that prevent the compiler from reordering instructions.

I would want to verify that it does indeed issue the correct
instructions in the correct place, and the optimizer doesn't move the
variable access.

> There's a
> considerable performance gain in comparison to using the pair
> lfence/sfence.

Agreed. LFENCE and SFENCE are only really of use with non-temporal stores.

> Besides, since a load can be moved prior to store, I wonder if the
> c++11<atomic> implementation will be able to determine when it's
> possible to use just compiler barrier (guess it's hard if not
> impossibile), or if we have to pay the fee for writing portable code.

C++0x atomics will tend to use minimal assembly instructions, and should
not insert unnecessary fence instructions.

> The same reasoning applies to std::atomic_thread_fence()...

atomic_thread_fence is just a compiler barrier for everything except
memory_order_seq_cst on x86/AMD64 architectures.

Samy Bahra

unread,
Jun 15, 2011, 5:04:52 PM6/15/11
to lock...@googlegroups.com, lock...@googlegroups.com
You will also need to use memory barriers if you are overlapping atomic operations of different widths to the same target range. You can find some information on this in the 3rd volume of the Intel manuals.

I triggered the necessary conditions for this with an implementation of bytelocks on a Core 2 processor.

Sent from my iPhone

Anthony Williams

unread,
Jun 15, 2011, 5:14:39 PM6/15/11
to lock...@googlegroups.com
On 15/06/11 22:04, Samy Bahra wrote:
> You will also need to use memory barriers if you are overlapping
> atomic operations of different widths to the same target range. You
> can find some information on this in the 3rd volume of the Intel
> manuals.

Yes, all accesses to a given memory location must use the same width,
otherwise the internal synchronization doesn't work.

Genom Genom

unread,
Jul 25, 2011, 11:49:48 AM7/25/11
to Scalable Synchronization Algorithms
Hello,

You are saying that piece of code do not need fences, at least on x86.

Is it because of the rule "reads may be reordered with respect to
writes that come earlier in program order as long as those writes are
to a different memory location" ?

Because in this code, the last instruction of the load function is a
store to X and the other thread is spinning on X's memory location.

Am i right, or i'm just missing something ?

Regards
Reply all
Reply to author
Forward
0 new messages