you are missing the synchronize-with relation between the two threads.
A barrier in the store() with at least the release semantic (between
the assignments) is required to synchronize-with a barrier with at
least acquire semantic in the load().
If x is used to check the condition in the load(), than you have to
assign y first, call the fence and then assign x in the store().
The barrier also ensures the happens-before relation :-)
According to your example...
The store() could be something like:
y = 1;
__asm __volatile("SFENCE");
x = 1;
and the store():
do {
__asm __volatile("LFENCE");
} while(!x);
assert(y==1);
z=1;
My two cents,
Nicola
--
The difference between theory and practice is bigger in practice than in theory.
Great!
> I've understood the memory model and atomic
> operations. The problem is i would like to code my own atomic
> operation, to learn.
Getting this right is tricky. You really need to understand what the
processor is doing, and what the compiler is doing.
> For example, i'm compiling this code (c++0x) :
>
> ==========================
> #include<iostream>
> #include<thread>
> #include<assert.h>
>
> volatile int x = 0;
> int _pad1[4096]; //padding for cache : without padding, assert do not
> fire.
> volatile int y = 0;
> int _pad2[4096];
> int z = 0;
> void store()
> {
> x = 1;
> y = 1;
> //__asm __volatile("MFENCE");
> }
>
> void load()
> {
> // __asm __volatile("MFENCE");
> while (!x)
> ;;
> // __asm __volatile("MFENCE");
> if (y == 1)
> z = 1;
> }
x is stored first, so even with the fences, just because x is 1, there
is no guarantee that store() has even written a value to y yet. You need
to swap the assignments to x and y, and put the fence between them.
> int main()
> {
> std::thread t1(store); //create thread
> std::thread t2(load);
>
> t1.join();
> t2.join();
>
> assert(z == 1);
> }
>
> ==========================
>
> First of all, i've made a first version without fences. On x86, it
> loads seems to be atomics, but i really do not know how to synchronize
> values between threads (how to be sure y will be updated, in load). I
> then learned about MFENCE, LFENCE, SFENCE and made a second version
> (fences commented) but doesn't work.
On x86 you often don't need fences. However, you need to tell the
compiler what you're doing, so it doesn't optimize the code in a way
that breaks your expectations.
> * So, the questions : How can i achieve value synchronisation in ASM
> and what is the simplest method ? Is the use of fences is great ? Can
> it be optimized by other methods in ASM ?
For the fences to have any effect on the gcc optimizer, they need to be
annotated as affecting "memory".
__asm__ __volatile__ ("mfence" : : : "memory");
You can optimize the code better by doing everything with __asm__
statements.
> * Also, why padding make the assert fire ?
Without the padding, x and y are on the same cache line, so they will
likely be written back to main memory together (though the compiler/cpu
might flush the cache line after the write of the first variable). With
the padding, they are on separate cache lines, so will be written
separately. Consequently, the assert will fire more often.
If you are serious about learning how to write atomics, read the Intel
CPU manuals, especially volume 3A:
http://www.intel.com/products/processor/manuals/
Also, read the AMD manuals:
http://developer.amd.com/documentation/guides/Pages/default.aspx#manuals
Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976
On x86 you often don't need fences. However, you need to tell the compiler what you're doing, so it doesn't optimize the code in a way that breaks your expectations.
SFENCE ensures that the values written before the fence are flushed to
memory before those written afterwards. It doesn't mean the flush has to
happen at a particular time though.
It is particularly useful with "non temporal" stores (MOVNTI and
friends), which otherwise are completely unordered.
Note I said you **often** don't need fences. Sometimes you do. For
ordering of unrelated writes (sequential consistency) you need to either
use a LOCKed instruction (such as XCHG) for your stores, or you need to
use MFENCE.
> On an interesting article about QPI
> (http://blogs.oracle.com/dave/entry/qpi_quiescence ) Dave Dice writes:
> /"οΏ½o allay a common misconception it's worth pointing out that barriers
> (...) are typically implemented as processor-local operations and *don't
> cause any distinguished action on the bus or interconnect* and instead
> simply instruct the processor to ensure that prior stores become visible
> before subsequent loads (...). That is, they don't force anything to
> happen -- such as coherence messages on the bus -- that were not already
> destined to occur. Instead, they simply enforce an order, momentarily
> reconciling program and memory order. Crucially, at least with current
> x86 and SPARC implementations, barriers *don't force anything to occur
> off-processor*. That also means they don't impede or impair scalability."/
> /
> /
> Don't these excerpts in bold contradict somehow the assumption
> ("misconception"?) about write-buffer flush?
The write buffer is not necessarily flushed to main memory; fences just
enforce the ORDER in which things are flushed.
On 07/06/11 02:33, Nikos Anastopoulos wrote:SFENCE ensures that the values written before the fence are flushed to memory before those written afterwards. It doesn't mean the flush has to happen at a particular time though.
On x86 you often don't need fences. However, you need to tell the
compiler what you're doing, so it doesn't optimize the code in a way
that breaks your expectations.
Apart from preventing hardware and compiler optimizations (reordering),
fences are supposed to make memory operations globally visible. For
example, store fences /"guarantee that every store instruction that
precedes in program order the SFENCE instruction is globally visible
//before any store instruction that follows the SFENCE instruction is
globally visible"/ (taken from x86 instruction set manual). I am not
sure if Intel documents how this is actually implemented in its
processors. Does a write-buffer flush occur? And if not, how else global
visibility can be guaranteed?
It is particularly useful with "non temporal" stores (MOVNTI and friends), which otherwise are completely unordered.
Note I said you **often** don't need fences. Sometimes you do. For ordering of unrelated writes (sequential consistency) you need to either use a LOCKed instruction (such as XCHG) for your stores, or you need to use MFENCE.
On an interesting article about QPI
(http://blogs.oracle.com/dave/entry/qpi_quiescence ) Dave Dice writes:
/"Τo allay a common misconception it's worth pointing out that barriers
(...) are typically implemented as processor-local operations and *don't
cause any distinguished action on the bus or interconnect* and instead
simply instruct the processor to ensure that prior stores become visible
before subsequent loads (...). That is, they don't force anything to
happen -- such as coherence messages on the bus -- that were not already
destined to occur. Instead, they simply enforce an order, momentarily
reconciling program and memory order. Crucially, at least with current
x86 and SPARC implementations, barriers *don't force anything to occur
off-processor*. That also means they don't impede or impair scalability."/
/
/
Don't these excerpts in bold contradict somehow the assumption
("misconception"?) about write-buffer flush?
On 07/06/11 02:33, Nikos Anastopoulos wrote:SFENCE ensures that the values written before the fence are flushed to memory before those written afterwards. It doesn't mean the flush has to happen at a particular time though.
On x86 you often don't need fences. However, you need to tell the
compiler what you're doing, so it doesn't optimize the code in a way
that breaks your expectations.
Apart from preventing hardware and compiler optimizations (reordering),
fences are supposed to make memory operations globally visible. For
example, store fences /"guarantee that every store instruction that
precedes in program order the SFENCE instruction is globally visible
//before any store instruction that follows the SFENCE instruction is
globally visible"/ (taken from x86 instruction set manual). I am not
sure if Intel documents how this is actually implemented in its
processors. Does a write-buffer flush occur? And if not, how else global
visibility can be guaranteed?
Yes, that matches my understanding.
even if this is a typical case for using acquire/release semantic
(Antony correct me if I'm wrong), it turns out that if we go native,
the use of memory fence is not required under AMD64 architectures.
Loads are indeed not reordered with loads, and store are not reordered
with stores.
In this case, we can just use a simple compiler barrier:
__asm __volatile ("" ::: "momory")
that prevent the compiler from reordering instructions. There's a
considerable performance gain in comparison to using the pair
lfence/sfence.
Besides, since a load can be moved prior to store, I wonder if the
c++11 <atomic> implementation will be able to determine when it's
possible to use just compiler barrier (guess it's hard if not
impossibile), or if we have to pay the fee for writing portable code.
The same reasoning applies to std::atomic_thread_fence()...
I'd like to know what Antony thinks about this.
Ciao,
That is correct.
> In this case, we can just use a simple compiler barrier:
>
> __asm __volatile ("" ::: "memory")
>
>that prevent the compiler from reordering instructions.
I would want to verify that it does indeed issue the correct
instructions in the correct place, and the optimizer doesn't move the
variable access.
> There's a
> considerable performance gain in comparison to using the pair
> lfence/sfence.
Agreed. LFENCE and SFENCE are only really of use with non-temporal stores.
> Besides, since a load can be moved prior to store, I wonder if the
> c++11<atomic> implementation will be able to determine when it's
> possible to use just compiler barrier (guess it's hard if not
> impossibile), or if we have to pay the fee for writing portable code.
C++0x atomics will tend to use minimal assembly instructions, and should
not insert unnecessary fence instructions.
> The same reasoning applies to std::atomic_thread_fence()...
atomic_thread_fence is just a compiler barrier for everything except
memory_order_seq_cst on x86/AMD64 architectures.
I triggered the necessary conditions for this with an implementation of bytelocks on a Core 2 processor.
Sent from my iPhone
Yes, all accesses to a given memory location must use the same width,
otherwise the internal synchronization doesn't work.