The Intel x86/x64 reference guide (8.2.2) states the following:
A) Reads are not reordered with other reads.
B) Writes are not reordered with older reads.
C) Writes to memory are not reordered with other writes
D) Reads may be reordered with older writes to different locations but
not with older writes to the same location.
Based on the above, will the following act as full memory barrier
(prevent reordering):
mov [esp], eax #1 write
mov eax, [esp] #2 read
Seems like based on the above rules:
1) #1 and #2 - no reorder per (D)
2) old reads - no reorder with #2 per (A)
3) old writes - no reorder with #1 per (C)
4) new reads - no reorder with #2 per (A)
5) new writes - no reorder with #1 and #2 per (C) and (B)
Can the above act like the 'mfence' instruction?
I must be wrong about something here or otherwise x86 reordering is
quite limited...
I also heard that mfence is somewhat expensive (compared with not
having it).
Will (B) and (C) allow something like spinlock release to be
implemented using simple (non locked) write on x86?
I think that I saw some article that mentioned this.
Thanks,
Rani
> Will (B) and (C) allow something like spinlock release to be
> implemented using simple (non locked) write on x86?
> I think that I saw some article that mentioned this.
Unless you have to support SMP PPro machines (some of which had a
bug), a spinlock release can be done with a simple write on x86.
DS
Please quote.
> B) Writes are not reordered with older reads.
Please quote.
> C) Writes to memory are not reordered with other writes
Please quote.
> D) Reads may be reordered with older writes to different locations but
> not with older writes to the same location.
>
When reads refer to the same location as a previous write, they will
usually not be read at all, if the write was just before.
They result will be that they will be executed just as if being read
usually before the write actually would take place...
> Based on the above, will the following act as full memory barrier
> (prevent reordering):
> mov [esp], eax #1 write
> mov eax, [esp] #2 read
...like here.
>
> Seems like based on the above rules:
> 1) #1 and #2 - no reorder per (D)
> 2) old reads - no reorder with #2 per (A)
> 3) old writes - no reorder with #1 per (C)
> 4) new reads - no reorder with #2 per (A)
> 5) new writes - no reorder with #1 and #2 per (C) and (B)
>
> Can the above act like the 'mfence' instruction?
That would be impressive.
> I must be wrong about something here or otherwise x86 reordering is
> quite limited...
> I also heard that mfence is somewhat expensive (compared with not
> having it).
Certainly more expensive than not having it in terms of execution time.
Like most CPU instructions BTW, maybe some NOPs here and there actually
do speed things up.
When someone says it is expensive, it is usually meant that
if atomicity is required, and a memory barrier semantics isn't, an
atomic might be the right tool.
>
> Will (B) and (C) allow something like spinlock release to be
> implemented using simple (non locked) write on x86?
> I think that I saw some article that mentioned this.
Maybe. I mean maybe you saw such an article ;)
>
> Thanks,
> Rani
Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3
(3A & 3B):
System Programming Guide (http://www.intel.com/Assets/PDF/manual/
325384.pdf)
8.2.2 Memory Ordering in P6 and More Recent Processor Families
> > D) Reads may be reordered with older writes to different locations but
> > not with older writes to the same location.
>
> When reads refer to the same location as a previous write, they will
> usually not be read at all, if the write was just before.
> They result will be that they will be executed just as if being read
> usually before the write actually would take place...
I see yet such be transparent hence obey the above rules...
> > Based on the above, will the following act as full memory barrier
> > (prevent reordering):
> > mov [esp], eax #1 write
> > mov eax, [esp] #2 read
>
> ...like here.
Simpler form:
push eax #1 store
pop eax #2 load
>
> > Seems like based on the above rules:
> > 1) #1 and #2 - no reorder per (D)
> > 2) old reads - no reorder with #2 per (A)
> > 3) old writes - no reorder with #1 per (C)
> > 4) new reads - no reorder with #2 per (A)
> > 5) new writes - no reorder with #1 and #2 per (C) and (B)
>
> > Can the above act like the 'mfence' instruction?
> That would be impressive.
> > I must be wrong about something here or otherwise x86 reordering is
> > quite limited...
I actually got bitten due to write-read x86 CPU reordering (missed
during testing since misaligned cache line is also required) so I will
try to write down a test that shows the write-read reordering case but
not the write-dummy_read-read (I have x4, x8 and x16 cores machines
for tryout).
Thanks,
Rani
We do target such x86 variants so thanks for the info.
I guess that new reads CPU reordering "into" the lock (before the
releasing write) are fine hence such optimization is valid per the x86
rules.
Thanks,
Rani
Thr 1
lock_by_xchg
read mem1
Thr2
write ->mem1
write -> spinlock
Isn't that already ordered? For thr2, writes are seen in consistent
order, for Thr 1 there is total ordering anyway.
Otherwise, see 8.2.3.5. Isn't this affecting
> push eax #1 store
> pop eax #2 load
>> >
>>> > > Seems like based on the above rules:
>>> > > 1) #1 and #2 - no reorder per (D)
>
>> D) Reads may be reordered with older writes to different locations but
>> > > not with older writes to the same location.
>>
Here, #2 will happen before #1, though it gets #1's value.
Edek
PS. Sorry for my previous remark, I never do threads targeting a
CPU family. These guarantees are not general threading guarantees,
and 8.2.5 near the end. I guess you're can retarget whatever
you're writing, when a new CPU is available.
I was actually bitten by oversight in which I wrote code that
accidentally assumed that write-read will not be reordered.
The write was to +4 offset of the read and in my test cases there was
no cache line misalignment hence I missed the deadly race (resulting
with hang).
By now I have coded the proper fix but then I start wondering about
x86 having implicit barriers in general.
(FWIW, the related facility itself is "slim shutdown event" in which
the underlying (win32) event is rarely needed hence it's created on-
demand).
> Otherwise, see 8.2.3.5. Isn't this affecting
>
> > push eax #1 store
> > pop eax #2 load
>
> >>> > > Seems like based on the above rules:
> >>> > > 1) #1 and #2 - no reorder per (D)
>
> >> D) Reads may be reordered with older writes to different locations but
> >> > > not with older writes to the same location.
>
> Here, #2 will happen before #1, though it gets #1's value.
True but the rules seems to forbids such from being visible.
> PS. Sorry for my previous remark, I never do threads targeting a
> CPU family. These guarantees are not general threading guarantees,
> and 8.2.5 near the end. I guess you're can retarget whatever
> you're writing, when a new CPU is available.
In windows we are also targeting various architectures so I personally
try to stay away from CPU based specializations being so subtle.
OTOH, having such target specific knowledge might be beneficial for
others (e.g. compilers can avoid implicit barriers or leverage them).
Rani
> PS. Sorry for my previous remark, I never do threads targeting a
> CPU family. These guarantees are not general threading guarantees,
> and 8.2.5 near the end. I guess you're can retarget whatever
> you're writing, when a new CPU is available.
These are guarantees that Intel promises for all future CPUs
supporting the same instruction set. AMD has made similar promises. So
you only need to retarget if you change instruction sets. I recommend
including memory fence macros and they can expand to nothing at all on
x86 platforms. Not only will that ease portability, but it will make
it much clearer to any reviewer why the code is safe and it will make
it less likely that someone later modifying the code will break it
because they didn't realize precisely how it worked.
DS
The Intel guide proves me *wrong* since the ordering rules are not as
straight forward as I thought (Edek was right).
8.2.3.5 Intra-Processor Forwarding Is Allowed
Processor 0 Processor 1
mov [ _x], 1 mov [ _y], 1
mov r1, [ _x] mov r3, [ _y]
mov r2, [ _y] mov r4, [ _x]
Initially x = y = 0
r2 = 0 and r4 = 0 is allowed
The memory-ordering model imposes no constraints on the order in which
the two
stores appear to execute by the two processors. This fact allows
processor 0 to see
its store before seeing processor 1's, while processor 1 sees its
store before seeing
processor 0's. (Each processor is self consistent.) This allows r2 = 0
and r4 = 0.
In practice, the reordering in this example can arise as a result of
store-buffer
forwarding. While a store is temporarily held in a processor's store
buffer, it can
satisfy the processor's own loads but is not visible to (and cannot
satisfy) loads by
other processors.
</x86 guide>
Now I need to re-figure why the non locked write works for spinlock
release...
Sorry,
Rani
> Now I need to re-figure why the non locked write works for spinlock
> release...
Because it doesn't matter if other CPUs are delayed in seeing it.
DS
Because preceding reads and writes can not 'sink' below the write (be
delayed past it). x86 writes have 'release' semantics.
regards,
alexander.
The "Intra-Processor Forwarding" relaxation doesn't seem to be deduced
from the x86 ordering rules so I started wondering about the soundness
of the rules in general.
Performance-wise the relaxation is what one might expect since
otherwise the implicit full barrier seems to significantly limit the
reordering opportunities.
This seems like a case in which the ordering rules are well
established but implementation details leakage slightly "broke"
them...
My "concern" about the non-locked write was the reads might break away
from it due to "Intra-Processor Forwarding":
bool flag = m_flag;
unlock(); // forbids read re-order
The guide is also explicit about this case (non-locked has release
semantics):
Example 8-2. Stores Are Not Reordered with Older Loads
Processor 0 Processor 1
mov r1, [ _x] mov r2, [ _y]
mov [ _y], 1 mov [ _x], 1
Initially x = y = 0. r1 = 1 and r2 = 1 is not allowed
</guide>
I know that other architectures allow such read-write reordering
though I'm not sure whether it actually makes them faster (e.g. stores
are rare) and whether x86 has too much (windows) code at risk when
relaxing this case (I heard that some x86 actually has control bit to
allow read-write reordering).
Thanks,
Rani
Quiz:
Processor 0 Processor 1
mov [ _y], 1
mov [ _x], 1 mov r1, [ _y]
mov [ _y], 1 mov r2, [ _x]
Initially x = y = 0.
Quiz: r1 = 1 and r2 = 0 is not allowed???
You can see the above inconsistency when combining the following:
1) Stores Are Not Reordered with Other Stores - no allowing the above
2) Intra-Processor Forwarding is Allowed - allowing the
above
1) Example 8-1. Stores Are Not Reordered with Other Stores
Processor 0 Processor 1
mov [ _x], 1 mov r1, [ _y]
mov [ _y], 1 mov r2, [ _x]
Initially x = y = 0. r1 = 1 and r2 = 0 is not allowed
2) Example 8-5. Intra-Processor Forwarding is Allowed
Processor 0 Processor 1
mov [ _x], 1 mov [ _y], 1
mov r1, [ _x] mov r3, [ _y]
mov r2, [ _y] mov r4, [ _x]
Initially x = y = 0. r2 = 0 and r4 = 0 is allowed
Rani
Is allowed.
I would rather see that on p1: _y (r1) differs from p0 _y.
That does not break threading stuff, on processor 1 the
difference is between
a)
1 -> _y
release_spinlock()
r1 <- _y
and
b)
1 -> _y
r1 <- _y
release_spinlock()
In general, if release_spinlock would be a "black box" with
possible mfence semantics, this would break stuff. (r1 could
have wrong value).
But AFAIK release_lock only prevents writes from moving down, not
reads moving up; if you're implementing a simple lock, that is
basic stuff; am I missing something?
>
> You can see the above inconsistency when combining the following:
> 1) Stores Are Not Reordered with Other Stores - no allowing the above
> 2) Intra-Processor Forwarding is Allowed - allowing the
> above
>
> 1) Example 8-1. Stores Are Not Reordered with Other Stores
> Processor 0 Processor 1
> mov [ _x], 1 mov r1, [ _y]
> mov [ _y], 1 mov r2, [ _x]
> Initially x = y = 0. r1 = 1 and r2 = 0 is not allowed
>
> 2) Example 8-5. Intra-Processor Forwarding is Allowed
> Processor 0 Processor 1
> mov [ _x], 1 mov [ _y], 1
> mov r1, [ _x] mov r3, [ _y]
> mov r2, [ _y] mov r4, [ _x]
> Initially x = y = 0. r2 = 0 and r4 = 0 is allowed
>
> Rani
The semantics is clever: there are two aspects of it, one is what
the virtual processor does, and abother is how the actions are visible
to other virtual processors and how others processor's action are
visible to it.
I would note that the name of the memory model includes the forwarding
in its name in the manual.
Edek
How? Due to forwarding the read will move up(-to forwarded write), not
down.
>
> The guide is also explicit about this case (non-locked has release
> semantics):
> Example 8-2. Stores Are Not Reordered with Older Loads
> Processor 0 Processor 1
> mov r1, [ _x] mov r2, [ _y]
> mov [ _y], 1 mov [ _x], 1
> Initially x = y = 0. r1 = 1 and r2 = 1 is not allowed
> </guide>
Again, loads can be done way before writes; not the other way
round. Sounds like nice prefetching, and does not break the rules.
>
> I know that other architectures allow such read-write reordering
> though I'm not sure whether it actually makes them faster (e.g. stores
> are rare) and whether x86 has too much (windows) code at risk when
> relaxing this case (I heard that some x86 actually has control bit to
> allow read-write reordering).
Why this memory model - which seems quite strict
for me - helps anything? Why not leave everything to fences? Or is
it actually a part of implementation of fences (e.g. speculative loads
and their processing get discarded when fence is hit)?
Edek
My brain is officially numb now as I can't see such trivial
observations.
P0 is not executing to allow the above...
Rani
Rani
FWIW, x86 stores that operate on write-back memory have `release'
semantics...