64 bit atomic reads on x86

RedMercury

unread,

Mar 1, 2006, 2:04:40 AM3/1/06

to

Hi there,

Is it safe to assume that split 32-bit reads of a 64-bit value will be
consistent if the 64-bit value is only written to with a lock
cmpxchg8b?

For other cases, is movq the only way to go?

Thanks!

Dan

RedMercury

unread,

Mar 1, 2006, 2:06:13 AM3/1/06

to

Sorry, other cases meaning aligned atomic 64-bit read.

Sean Kelly

unread,

Mar 1, 2006, 4:29:58 PM3/1/06

to

RedMercury wrote:
> Hi there,
>
> Is it safe to assume that split 32-bit reads of a 64-bit value will be
> consistent if the 64-bit value is only written to with a lock
> cmpxchg8b?

What about this:

thread 1 reads first 32 bits
thread 2 isses 64-bit write
thread 1 reads second 32 bits

Sean

dick.dunbar

unread,

Mar 1, 2006, 11:55:46 PM3/1/06

to

I've seen algorithms in PowerPC 32-bit code reading out the 64-bit
clock.

- Read high 32-bit value
- Read low 32-bit value
- Re-Read high 32-bit value, and compare to original
- Exit if equal, else start again

This is hardware writing the values, not another thread, and it is
special instructions to read the clock, not memory fetches. In 64-bit
mode, there is a single instruction to read the clock.

David Schwartz

unread,

Mar 2, 2006, 12:15:26 AM3/2/06

to

"dick.dunbar" <dick....@gmail.com> wrote in message
news:1141275346.9...@j33g2000cwa.googlegroups.com...

> I've seen algorithms in PowerPC 32-bit code reading out the 64-bit
> clock.
>
> - Read high 32-bit value
> - Read low 32-bit value
> - Re-Read high 32-bit value, and compare to original
> - Exit if equal, else start again

That's fine for a clock or counter, but can burn you for general use.
For example:

1) Starting value is 0, 0
2) You read the high value, 0.
3) Another thread writes 1, 1.
4) You read the low value, 1.
5) Another thread writes 0, 0.
6) You re-read the high value, and get 0
7) They're equal, so you accept 0, 1 as the value.

Except no thread ever wrote a value with its high part different from
its low part.

DS

RedMercury

unread,

Mar 2, 2006, 4:51:16 AM3/2/06

to

So what x86 solutions would there be? The only thing I can think of is
using movq (should be atomic on 64 bit boundaries iirc) and issing an
emms straight afterwards. emms is expensive though, so I've heard..

Joe Seigh

unread,

Mar 2, 2006, 8:11:38 AM3/2/06

to

That's the old Aix logic. I thought they changed it to be more
like Lamport's clock algorithm, i.e. if high parts are equal,
return zero as low part value rather than looping.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Sean Kelly

unread,

Mar 2, 2006, 10:51:21 AM3/2/06

to

The easiest thing would be to use cmpxchg8b for the load as well. Zero
out edx, eax, ecx, and ebx and execute the instruction. The result of
the load portion will be in the appropriate registers, and if the write
is issued it will simply replace zero with zero. You're stuck with a
lock, but at least it's safe. Your method may work as well
however--I've never tried it.

Sean

Joe Seigh

unread,

Mar 2, 2006, 11:02:24 AM3/2/06

to

cmpxchg(8b) always writes into the storage location whether or not
the compare succeeds. So not only is your cache line made
exclusive but dirty as well.

dick.dunbar

unread,

Mar 2, 2006, 1:23:35 PM3/2/06

to

> 6) You re-read the high value, and get 0
> 7) They're equal, so you accept 0, 1 as the value.

Huh?

That is the value at the time it was read. You have an atomic read of
both parts.
The fact that the lower part changed after is irrelevant. On an atomic
clock
at this frequency it would _always_ change.

Lemme guess, you just like to criticize everything because it's easier
than thinking.

Sean Kelly

unread,

Mar 2, 2006, 1:51:41 PM3/2/06

to

Joe Seigh wrote:
> Sean Kelly wrote:
> > RedMercury wrote:
> >
> >>So what x86 solutions would there be? The only thing I can think of is
> >>using movq (should be atomic on 64 bit boundaries iirc) and issing an
> >>emms straight afterwards. emms is expensive though, so I've heard..
> >
> > The easiest thing would be to use cmpxchg8b for the load as well. Zero
> > out edx, eax, ecx, and ebx and execute the instruction. The result of
> > the load portion will be in the appropriate registers, and if the write
> > is issued it will simply replace zero with zero. You're stuck with a
> > lock, but at least it's safe. Your method may work as well
> > however--I've never tried it.
> >
> cmpxchg(8b) always writes into the storage location whether or not
> the compare succeeds. So not only is your cache line made
> exclusive but dirty as well.

Yup. And this is obviously undesirable. Is there a more efficient
method to accomplish the same thing? And what about a 64-bit store?
So far, I haven't had need for either, but now that I'm thinking about
it, I'd like to fill these gaps in an API I've written.

Sean

David Schwartz

unread,

Mar 2, 2006, 2:16:02 PM3/2/06

to

"dick.dunbar" <dick....@gmail.com> wrote in message

news:1141323815.7...@v46g2000cwv.googlegroups.com...

>> 6) You re-read the high value, and get 0
>> 7) They're equal, so you accept 0, 1 as the value.

> Huh?

> That is the value at the time it was read. You have an atomic read of
> both parts.

Huh?

> The fact that the lower part changed after is irrelevant. On an atomic
> clock
> at this frequency it would _always_ change.

The problem is not that the lower part changed after but that *both*
parts changed atomically after.

> Lemme guess, you just like to criticize everything because it's easier
> than thinking.

*Please* re-read my example and refrain from criticizing it until you
understand it.

DS

chris noonan

unread,

Mar 2, 2006, 5:08:40 PM3/2/06

to

No, don't do either of those things! Use the floating point
instructions
FILD and FIST, the variants with 64-bit integer operands. Internally
the FP registers are 80 bits wide (or something), so there is no loss
of precision. Load atomically from the memory location with FILD,
store the datum into some scratch variable with FIST, then you
can process the datum further with any instuctions you like.

I learnt this from Agner Fog's website (http://www.agner.org/assem); he
is authoritative on such matters. I have tried the method and it works.

Regards,
Chris Noonan

dick.dunbar

unread,

Mar 3, 2006, 11:58:08 AM3/3/06

to

> *Please* re-read my example

You responded to my example, which was an atomicly increasing hardware
clock.
What part of a clock needs more explanation to you.
- The upper part might change in roll-over
- The lower part will change constantly
- You have to detect roll-over

I know it's convenient to roll every example into something
you thought about once, but this is not about ABA ..
it's about reading a 64-bit value atomicly, when all you have
is 32-bit instructions to do it.

You just have to detect that the lower part you read
hasn't triggered a higher value change.

No change in higher value ... you have an atomic read.

And, if you do this again, of course you're going to get
a different value.

Message has been deleted

dick.dunbar

unread,

Mar 3, 2006, 12:09:07 PM3/3/06

to

> Use the floating point

Yes, I do this also on powerPC, when dealing with long-long in a 32-bit
world. Floats also allow you to pass back 2 integer values as a
function
return value, by packaging them up as a float.

For fine-grain performance work, this is very effective way to capture
high-precision clock values using mc_func inline functions.

Thanks for the agner reference. http://www.agner.org/assem/

gottlo...@gmail.com

unread,

Mar 3, 2006, 1:12:29 PM3/3/06

to

dick.dunbar wrote:
> > *Please* re-read my example
>
> You responded to my example, which was an atomicly increasing hardware
> clock.
> What part of a clock needs more explanation to you.
> - The upper part might change in roll-over
> - The lower part will change constantly
> - You have to detect roll-over
>
> I know it's convenient to roll every example into something
> you thought about once, but this is not about ABA ..
> it's about reading a 64-bit value atomicly, when all you have
> is 32-bit instructions to do it.
>
> You just have to detect that the lower part you read
> hasn't triggered a higher value change.
>
> No change in higher value ... you have an atomic read.
>

The OP was asking about a general situation. You gave a solution for a
monotonically increasing situation (a clock). DS said, basically,
'that's fine for a clock' (ie he agrees with you for that case), 'but
not a general value', then you ask why he is talking about ABA.

Answer: because the original question was about a general situation,
not a always increasing value such as a clock.

The End.

Sean Kelly

unread,

Mar 3, 2006, 2:52:53 PM3/3/06

to

dick.dunbar wrote:
> > Use the floating point
>
> Yes, I do this also on powerPC, when dealing with long-long in a 32-bit
> world. Floats also allow you to pass back 2 integer values as a
> function
> return value, by packaging them up as a float.
>
> For fine-grain performance work, this is very effective way to capture
> high-precision clock values using mc_func inline functions.

Yup, it's a clever trick. And I suppose if a memory barrier is
desired, cmpxchg8b is still available for loads. However, I can't
think of a way to perform a single-instruction store with a memory
barrier (say I want to give it acquire semantics). Typically, I'd use
lock xchg, but that won't fly for 64-bit values. Any suggestions?

Sean

David Schwartz

unread,

Mar 3, 2006, 11:53:11 PM3/3/06

to

"dick.dunbar" <dick....@gmail.com> wrote in message

news:1141405088....@v46g2000cwv.googlegroups.com...

>> *Please* re-read my example

> You responded to my example, which was an atomicly increasing hardware
> clock.

By explaining that it only works for a clock.

> What part of a clock needs more explanation to you.
> - The upper part might change in roll-over
> - The lower part will change constantly
> - You have to detect roll-over

That's not what I disagreed with. I disagreed with your responding to
the OP with an algorithm that only works for a clock when the OP asked for a
general-purpose algorithm.

> I know it's convenient to roll every example into something
> you thought about once, but this is not about ABA ..
> it's about reading a 64-bit value atomicly, when all you have
> is 32-bit instructions to do it.

Exactly.

> You just have to detect that the lower part you read
> hasn't triggered a higher value change.
>
> No change in higher value ... you have an atomic read.

Not in the OP's case.

> And, if you do this again, of course you're going to get
> a different value.

The problem is that, for the case the OP asked about, you may read a
value that was never written. I presume that is precisely what the OP wanted
to avoid.

DS

google...@coloraura.com

unread,

Mar 4, 2006, 12:11:18 PM3/4/06

to

I think the true problem with his algorithm. lays in this example:
Values is (A,.B)
Thread 1 reads A to get (A, x)
Thread 2 writes (C, D)
Thread 1 reads D to get (A, D) (already a broken value)
Thread 3 writes (A, E)
Thread 1 reads A (for the check). Its the same, so Thread 1 thinks (A,
D) is a valid value.
Whoops. (A, D) was NEVER in the source value.

dick.dunbar

unread,

Mar 5, 2006, 6:49:20 PM3/5/06

to

The original post asked how to do a consistency read of 64-bit value on
32-bit machine.

It did not say why the value was being read, what the data looks like,
nor what consistency really means. Because of the constant nit-picking
on this forum, I was very careful to say the data being read was an
atomicly increasing clock.

Clearly there are other types of 64-bit data that can be read, and
someone is looking for a fast algorithm to do that.

1. Use floating point. It's a single atomic read and write operation
for 64-bits.
The OP said it was written with "lock cmpxchg8b". Will you get
yourself in trouble?
2. Stop being cute; stop hammering the memory bus.
"Use mutex; Be Happy." - Bil Lewis

"Multithreaded Programming with Pthreads" by Lewis is an excellent
complement to Dave's book "Programming with Posix Threads".

The systems I work on are huge, and I seem to spend an inordinate
amount of time diagnosing / fixing application errors that involve
these home-grown tricks. The techniques might have avoided a
bottleneck on small windows machines, but on very large systems, these
things do not scale ... and it is usually the memory subsystem that
takes the punishment.

Finding these scalability implementation problems can also be
problematic when you using "inline" functions. The algorithm doesn't
show up as a single point of contention, but rather the whole
application seems to be running slowly, with no visible problem using
the usual application profiling tools.

If your hardware doesn't support 64-bit atomic read/write, use mutex.
Modern systems have optimized mutex to be very fast. Talking about
problems in the abstract, without knowing the data, the OS, threads
implementation, hardware revision of the cpu's can get you into big
trouble.

David Schwartz

unread,

Mar 6, 2006, 2:27:23 AM3/6/06

to

"dick.dunbar" <dick....@gmail.com> wrote in message

news:1141602560....@u72g2000cwu.googlegroups.com...

> 1. Use floating point. It's a single atomic read and write operation
> for 64-bits.

Are you sure that's correct? I wasn't aware that x86 guaranteed 64-bit
atomic reads/writes of floating point values. The only sources I could find
that claimed it was contained other obvious errors (for example, that all
instructions were atomic) so I don't trust them.

DS

dick.dunbar

unread,

Mar 6, 2006, 2:51:08 PM3/6/06

to

> Are you sure that's correct? I

If you, or anyone else reading this, isn't sure, then don't play games.
Use mutex to get atomic reads.

I am one-hundred-percent sure of my experience, I just haven't had
enough of it.

I would only use the FP trick on a windows platform if the results of
the read
didn't matter to me ... i.e., I'm using it for performance clock reads.

To cover the bases a bit more, I'd probably do a locked fp instruction,
but only after I measured the performance of that and compared it to
using
a mutex.

These questions don't have "guaranteed" answers. It very much depends
on your data, your purpose for using atomic operations, and
whether your algorithms can recover from incorrect reads.

The literature is full of this lore.

For my purposes, I always "take chances" when I'm studying performance,
and never take chances when I release product for customers.

That doesn't prevent mistakes from happening when the machine
architecture
starts to change underneath.

chris noonan

unread,

Mar 7, 2006, 12:39:39 PM3/7/06

to

David Schwartz wrote:

> "dick.dunbar" <dick....@gmail.com> wrote:
> > 1. Use floating point. It's a single atomic read and write operation
> > for 64-bits.
>
> Are you sure that's correct? I wasn't aware that x86 guaranteed 64-bit
> atomic reads/writes of floating point values. The only sources I could find
> that claimed it was contained other obvious errors (for example, that all
> instructions were atomic) so I don't trust them.
>

It's stated in the IA-32 Intel Architecture Software Developer's
Manual Volume 3: System Programming Guide, dated 2002
Section 7.1.1 Guaranteed Atomic Operations
"The Pentium 4, Intel Xeon, and P6 family, and Pentium processors
guarantee that the following additional operations will always be
carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary.
..."

Regards,
Chris Noonan

Phil Frisbie, Jr.

unread,

Mar 7, 2006, 2:42:47 PM3/7/06

to

chris noonan wrote:

> David Schwartz wrote:
>
> It's stated in the IA-32 Intel Architecture Software Developer's
> Manual Volume 3: System Programming Guide, dated 2002
> Section 7.1.1 Guaranteed Atomic Operations
> "The Pentium 4, Intel Xeon, and P6 family, and Pentium processors
> guarantee that the following additional operations will always be
> carried out atomically:
> Reading or writing a quadword aligned on a 64-bit boundary.
> ..."

Just a little clarification:

The key phrase here is "aligned on a 64-bit boundary". As long as the programmer
has not played with packing directives or is not casting from a byte, short, or
long array, this will be true with any compiler I know of.

--
Phil Frisbie, Jr.
Hawk Software
http://www.hawksoft.com

chris noonan

unread,

Mar 8, 2006, 9:58:30 AM3/8/06

to

Phil Frisbie, Jr. wrote:

> chris noonan wrote:
> > It's stated in the IA-32 Intel Architecture Software Developer's
> > Manual Volume 3: System Programming Guide, dated 2002
> > Section 7.1.1 Guaranteed Atomic Operations
> > "The Pentium 4, Intel Xeon, and P6 family, and Pentium processors
> > guarantee that the following additional operations will always be
> > carried out atomically:
> > Reading or writing a quadword aligned on a 64-bit boundary.
> > ..."
>
> Just a little clarification:
>
> The key phrase here is "aligned on a 64-bit boundary". As long as the programmer
> has not played with packing directives or is not casting from a byte, short, or
> long array, this will be true with any compiler I know of.
>

But go back a few years, to Microsoft Visual C++ 6.0. That compiler
only
aligns procedure stack frames to 32-bits, therefore local variables of
type
double can be misaligned.

In fact the Intel document I quoted has other material, which I
interpret
to mean that all 64-bit accesses are atomic. If for example the datum
straddles a cache line, it is the responsibility of the chipset to make
the
access atomic, coordinated by signals coming from the processor itself.

Regards,
Chris Noonan

chris noonan

unread,

Mar 8, 2006, 2:19:12 PM3/8/06

to

chris noonan wrote:
>straddles a cache line, it is the responsibility of the chipset to make

I meant: straddles a cache line boundary

David Schwartz

unread,

Mar 8, 2006, 5:10:51 PM3/8/06

to

"chris noonan" <use...@leapheap.co.uk> wrote in message
news:1141845552.9...@u72g2000cwu.googlegroups.com...

> chris noonan wrote:
>>straddles a cache line, it is the responsibility of the chipset to make

> I meant: straddles a cache line boundary

I wonder if it really does that. I can see some cases that would require
some very specialized logic just to handle that case. Consider, for example,
if two CPUs get concurrent writes to the same 64-bit value that straddles a
cache-line boundary and each CPU holds exclusively one of the two cache
lines.

DS

chris noonan

unread,

Mar 9, 2006, 2:40:07 PM3/9/06

to

David Schwartz wrote:
> > chris noonan wrote:
> > >straddles a cache line, it is the responsibility of the chipset to make
> > I meant: straddles a cache line boundary
>
> I wonder if it really does that. I can see some cases that would require
> some very specialized logic just to handle that case. Consider, for example,
> if two CPUs get concurrent writes to the same 64-bit value that straddles a
> cache-line boundary and each CPU holds exclusively one of the two cache
> lines.
>

The wording in the Intel document that I quoted earlier is this:
"Accesses to cacheable memory that are split
across bus widths, cache lines, and page boundaries
are not guaranteed to be atomic by the Pentium 4,
Intel Xeon, P6 family, Pentium and Intel486 processors.
The Pentium 4, Intel Xeon, and P6 family processors
provide bus control signals that permit external
memory subsystems to make split accesses atomic;
however nonaligned data accesses will seriously
impact the performance of the processor and should
be avoided."

I will make further enquiries.

Chris

chris noonan

unread,

Mar 14, 2006, 6:35:59 PM3/14/06

to

I have not been able to gather any support for the proposition that
unaligned 64-bit accesses are guaranteed atomic on Intel processor
systems.

On the Intel website I found the "Intel Multiprocessor Specification".
That document specifically warns against assuming that unaligned
accesses are atomic. However it is dated May 1997 (version 1.4),
hence presumably refers to the older processors in the quotation
above (i.e.Pentium plain and 486).

I also looked up an Intel processor pinout ("Intel Pentium 4 Processor
on 90nm processor", dated February 2005) in search of the mysterious
bus control signals alluded to. I could find no suitable candidate
apart from the well-known LOCK#.

Assuming that LOCK# is used, and though my knowledge of this area
is slight, I am able to guess how the concurrent access attempt to a
64-bit datum split across cache lines is resolved. It would follow the
same pattern as a simple locked access to an aligned datum in a
cache line held remotely:

Both processors assert LOCK#; one of them wins the contention; the
loser resiles and is forced to transfer the cache line using the HIT#
and HITM# snoop-stall signals. Meanwhile, the other processors are
unable to interfere with the cache line already held by the winning
processor, because LOCK# prevents them initiating bus transactions.

When I have time, I will check whether unaligned 64-bit accesses are
in fact atomic on a specific multiprocessor machine.

Chris

RedMercury

unread,

Mar 15, 2006, 3:17:30 AM3/15/06

to

Doh!

I was thinking about using x87, but worried that loading the values as
a single might trash them when they go to get stored, with the fpu
performing funny rounding/nan/qnan malarky. I completely forgot about
fild/fistp :)

I guess another alternative is to pad the value out to 16 bytes and use
movps?

chris noonan

unread,

Mar 26, 2006, 2:59:05 PM3/26/06

to

chris noonan wrote:

> chris noonan wrote:
> > The wording in the Intel document that I quoted earlier is this:
> > "Accesses to cacheable memory that are split
> > across bus widths, cache lines, and page boundaries
> > are not guaranteed to be atomic by the Pentium 4,
> > Intel Xeon, P6 family, Pentium and Intel486 processors.
> > The Pentium 4, Intel Xeon, and P6 family processors
> > provide bus control signals that permit external
> > memory subsystems to make split accesses atomic;
> > however nonaligned data accesses will seriously
> > impact the performance of the processor and should
> > be avoided."
> >
> > I will make further enquiries.
> >
> I have not been able to gather any support for the proposition that
> unaligned 64-bit accesses are guaranteed atomic on Intel processor
> systems.

[snip]

> When I have time, I will check whether unaligned 64-bit accesses are
> in fact atomic on a specific multiprocessor machine.
>

I have written a multithreaded program that performs lots of 64-bit
memory accesses in various alignments, and tries to detect
non-atomic accesses. It turns out that DS is absolutely right. I own
an early machine of suitable type (Compaq ProLiant 5000 loaded with
four PentiumPro), and my program reports inconsistencies when the
locations accessed are not 64-bit aligned.

The program (Win32 binary and source) is available for download at:
http://www.leapheap.com/straddle.zip
It is dead easy to use. Run it from a DOS window under Microsoft
Windows NT/2000/XP; it takes a few minutes to complete. There
are two variants, type:
"straddle 64"
to get the behaviour described above,
"straddle 32"
is a sort of control that replaces each 64-bit access with two 32-bit
accesses. You expect inconsistencies in that mode, even
uniprocessor.

I am interested in other peoples' results using the program.

Chris

Alexander Terekhov

unread,

Mar 27, 2006, 2:45:43 PM3/27/06

to

chris noonan wrote:
[...]

> I am interested in other peoples' results using the program.

2 x Opteron 248 MP:

offset: aligned, 800000000 reads consistent, 0 reads inconsistent
offset: 1 byte , 799989998 reads consistent, 10002 reads inconsistent
offset: 2 bytes, 799990141 reads consistent, 9859 reads inconsistent
offset: 3 bytes, 799990434 reads consistent, 9566 reads inconsistent
offset: 4 bytes, 799991878 reads consistent, 8122 reads inconsistent
offset: 5 bytes, 799990444 reads consistent, 9556 reads inconsistent
offset: 6 bytes, 799990084 reads consistent, 9916 reads inconsistent
offset: 7 bytes, 799990257 reads consistent, 9743 reads inconsistent
INCONSISTENCIES DETECTED

regards,
alexander.

chris noonan

unread,

Mar 30, 2006, 2:57:10 PM3/30/06

to

Alexander Terekhov wrote:

Thanks Alex. On my quad machine I get ten times as many
inconsistencies. A factor of three is easily accounted for: there
are three processors that can interfere with a read or write as
opposed to one processor for dual CPU. The remaining factor
of (roughly) three I cannot explain. Maybe the AMD microcode
to handle split accesses is slicker than the Intel microcode.

Chris