Is it safe to assume that split 32-bit reads of a 64-bit value will be
consistent if the 64-bit value is only written to with a lock
cmpxchg8b?
For other cases, is movq the only way to go?
Thanks!
Dan
What about this:
thread 1 reads first 32 bits
thread 2 isses 64-bit write
thread 1 reads second 32 bits
Sean
- Read high 32-bit value
- Read low 32-bit value
- Re-Read high 32-bit value, and compare to original
- Exit if equal, else start again
This is hardware writing the values, not another thread, and it is
special instructions to read the clock, not memory fetches. In 64-bit
mode, there is a single instruction to read the clock.
> I've seen algorithms in PowerPC 32-bit code reading out the 64-bit
> clock.
>
> - Read high 32-bit value
> - Read low 32-bit value
> - Re-Read high 32-bit value, and compare to original
> - Exit if equal, else start again
That's fine for a clock or counter, but can burn you for general use.
For example:
1) Starting value is 0, 0
2) You read the high value, 0.
3) Another thread writes 1, 1.
4) You read the low value, 1.
5) Another thread writes 0, 0.
6) You re-read the high value, and get 0
7) They're equal, so you accept 0, 1 as the value.
Except no thread ever wrote a value with its high part different from
its low part.
DS
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
The easiest thing would be to use cmpxchg8b for the load as well. Zero
out edx, eax, ecx, and ebx and execute the instruction. The result of
the load portion will be in the appropriate registers, and if the write
is issued it will simply replace zero with zero. You're stuck with a
lock, but at least it's safe. Your method may work as well
however--I've never tried it.
Sean
That is the value at the time it was read. You have an atomic read of
both parts.
The fact that the lower part changed after is irrelevant. On an atomic
clock
at this frequency it would _always_ change.
Lemme guess, you just like to criticize everything because it's easier
than thinking.
Yup. And this is obviously undesirable. Is there a more efficient
method to accomplish the same thing? And what about a 64-bit store?
So far, I haven't had need for either, but now that I'm thinking about
it, I'd like to fill these gaps in an API I've written.
Sean
>> 6) You re-read the high value, and get 0
>> 7) They're equal, so you accept 0, 1 as the value.
> Huh?
> That is the value at the time it was read. You have an atomic read of
> both parts.
Huh?
> The fact that the lower part changed after is irrelevant. On an atomic
> clock
> at this frequency it would _always_ change.
The problem is not that the lower part changed after but that *both*
parts changed atomically after.
> Lemme guess, you just like to criticize everything because it's easier
> than thinking.
*Please* re-read my example and refrain from criticizing it until you
understand it.
DS
No, don't do either of those things! Use the floating point
instructions
FILD and FIST, the variants with 64-bit integer operands. Internally
the FP registers are 80 bits wide (or something), so there is no loss
of precision. Load atomically from the memory location with FILD,
store the datum into some scratch variable with FIST, then you
can process the datum further with any instuctions you like.
I learnt this from Agner Fog's website (http://www.agner.org/assem); he
is authoritative on such matters. I have tried the method and it works.
Regards,
Chris Noonan
You responded to my example, which was an atomicly increasing hardware
clock.
What part of a clock needs more explanation to you.
- The upper part might change in roll-over
- The lower part will change constantly
- You have to detect roll-over
I know it's convenient to roll every example into something
you thought about once, but this is not about ABA ..
it's about reading a 64-bit value atomicly, when all you have
is 32-bit instructions to do it.
You just have to detect that the lower part you read
hasn't triggered a higher value change.
No change in higher value ... you have an atomic read.
And, if you do this again, of course you're going to get
a different value.
Yes, I do this also on powerPC, when dealing with long-long in a 32-bit
world. Floats also allow you to pass back 2 integer values as a
function
return value, by packaging them up as a float.
For fine-grain performance work, this is very effective way to capture
high-precision clock values using mc_func inline functions.
Thanks for the agner reference. http://www.agner.org/assem/
The OP was asking about a general situation. You gave a solution for a
monotonically increasing situation (a clock). DS said, basically,
'that's fine for a clock' (ie he agrees with you for that case), 'but
not a general value', then you ask why he is talking about ABA.
Answer: because the original question was about a general situation,
not a always increasing value such as a clock.
The End.
Yup, it's a clever trick. And I suppose if a memory barrier is
desired, cmpxchg8b is still available for loads. However, I can't
think of a way to perform a single-instruction store with a memory
barrier (say I want to give it acquire semantics). Typically, I'd use
lock xchg, but that won't fly for 64-bit values. Any suggestions?
Sean
>> *Please* re-read my example
> You responded to my example, which was an atomicly increasing hardware
> clock.
By explaining that it only works for a clock.
> What part of a clock needs more explanation to you.
> - The upper part might change in roll-over
> - The lower part will change constantly
> - You have to detect roll-over
That's not what I disagreed with. I disagreed with your responding to
the OP with an algorithm that only works for a clock when the OP asked for a
general-purpose algorithm.
> I know it's convenient to roll every example into something
> you thought about once, but this is not about ABA ..
> it's about reading a 64-bit value atomicly, when all you have
> is 32-bit instructions to do it.
Exactly.
> You just have to detect that the lower part you read
> hasn't triggered a higher value change.
>
> No change in higher value ... you have an atomic read.
Not in the OP's case.
> And, if you do this again, of course you're going to get
> a different value.
The problem is that, for the case the OP asked about, you may read a
value that was never written. I presume that is precisely what the OP wanted
to avoid.
DS
I think the true problem with his algorithm. lays in this example:
Values is (A,.B)
Thread 1 reads A to get (A, x)
Thread 2 writes (C, D)
Thread 1 reads D to get (A, D) (already a broken value)
Thread 3 writes (A, E)
Thread 1 reads A (for the check). Its the same, so Thread 1 thinks (A,
D) is a valid value.
Whoops. (A, D) was NEVER in the source value.
It did not say why the value was being read, what the data looks like,
nor what consistency really means. Because of the constant nit-picking
on this forum, I was very careful to say the data being read was an
atomicly increasing clock.
Clearly there are other types of 64-bit data that can be read, and
someone is looking for a fast algorithm to do that.
1. Use floating point. It's a single atomic read and write operation
for 64-bits.
The OP said it was written with "lock cmpxchg8b". Will you get
yourself in trouble?
2. Stop being cute; stop hammering the memory bus.
"Use mutex; Be Happy." - Bil Lewis
"Multithreaded Programming with Pthreads" by Lewis is an excellent
complement to Dave's book "Programming with Posix Threads".
The systems I work on are huge, and I seem to spend an inordinate
amount of time diagnosing / fixing application errors that involve
these home-grown tricks. The techniques might have avoided a
bottleneck on small windows machines, but on very large systems, these
things do not scale ... and it is usually the memory subsystem that
takes the punishment.
Finding these scalability implementation problems can also be
problematic when you using "inline" functions. The algorithm doesn't
show up as a single point of contention, but rather the whole
application seems to be running slowly, with no visible problem using
the usual application profiling tools.
If your hardware doesn't support 64-bit atomic read/write, use mutex.
Modern systems have optimized mutex to be very fast. Talking about
problems in the abstract, without knowing the data, the OS, threads
implementation, hardware revision of the cpu's can get you into big
trouble.
> 1. Use floating point. It's a single atomic read and write operation
> for 64-bits.
Are you sure that's correct? I wasn't aware that x86 guaranteed 64-bit
atomic reads/writes of floating point values. The only sources I could find
that claimed it was contained other obvious errors (for example, that all
instructions were atomic) so I don't trust them.
DS
I am one-hundred-percent sure of my experience, I just haven't had
enough of it.
I would only use the FP trick on a windows platform if the results of
the read
didn't matter to me ... i.e., I'm using it for performance clock reads.
To cover the bases a bit more, I'd probably do a locked fp instruction,
but only after I measured the performance of that and compared it to
using
a mutex.
These questions don't have "guaranteed" answers. It very much depends
on your data, your purpose for using atomic operations, and
whether your algorithms can recover from incorrect reads.
The literature is full of this lore.
For my purposes, I always "take chances" when I'm studying performance,
and never take chances when I release product for customers.
That doesn't prevent mistakes from happening when the machine
architecture
starts to change underneath.
Regards,
Chris Noonan
> David Schwartz wrote:
>
> It's stated in the IA-32 Intel Architecture Software Developer's
> Manual Volume 3: System Programming Guide, dated 2002
> Section 7.1.1 Guaranteed Atomic Operations
> "The Pentium 4, Intel Xeon, and P6 family, and Pentium processors
> guarantee that the following additional operations will always be
> carried out atomically:
> Reading or writing a quadword aligned on a 64-bit boundary.
> ..."
Just a little clarification:
The key phrase here is "aligned on a 64-bit boundary". As long as the programmer
has not played with packing directives or is not casting from a byte, short, or
long array, this will be true with any compiler I know of.
--
Phil Frisbie, Jr.
Hawk Software
http://www.hawksoft.com
In fact the Intel document I quoted has other material, which I
interpret
to mean that all 64-bit accesses are atomic. If for example the datum
straddles a cache line, it is the responsibility of the chipset to make
the
access atomic, coordinated by signals coming from the processor itself.
Regards,
Chris Noonan
I meant: straddles a cache line boundary
> chris noonan wrote:
>>straddles a cache line, it is the responsibility of the chipset to make
> I meant: straddles a cache line boundary
I wonder if it really does that. I can see some cases that would require
some very specialized logic just to handle that case. Consider, for example,
if two CPUs get concurrent writes to the same 64-bit value that straddles a
cache-line boundary and each CPU holds exclusively one of the two cache
lines.
DS
I will make further enquiries.
Chris
On the Intel website I found the "Intel Multiprocessor Specification".
That document specifically warns against assuming that unaligned
accesses are atomic. However it is dated May 1997 (version 1.4),
hence presumably refers to the older processors in the quotation
above (i.e.Pentium plain and 486).
I also looked up an Intel processor pinout ("Intel Pentium 4 Processor
on 90nm processor", dated February 2005) in search of the mysterious
bus control signals alluded to. I could find no suitable candidate
apart from the well-known LOCK#.
Assuming that LOCK# is used, and though my knowledge of this area
is slight, I am able to guess how the concurrent access attempt to a
64-bit datum split across cache lines is resolved. It would follow the
same pattern as a simple locked access to an aligned datum in a
cache line held remotely:
Both processors assert LOCK#; one of them wins the contention; the
loser resiles and is forced to transfer the cache line using the HIT#
and HITM# snoop-stall signals. Meanwhile, the other processors are
unable to interfere with the cache line already held by the winning
processor, because LOCK# prevents them initiating bus transactions.
When I have time, I will check whether unaligned 64-bit accesses are
in fact atomic on a specific multiprocessor machine.
Chris
I was thinking about using x87, but worried that loading the values as
a single might trash them when they go to get stored, with the fpu
performing funny rounding/nan/qnan malarky. I completely forgot about
fild/fistp :)
I guess another alternative is to pad the value out to 16 bytes and use
movps?
The program (Win32 binary and source) is available for download at:
http://www.leapheap.com/straddle.zip
It is dead easy to use. Run it from a DOS window under Microsoft
Windows NT/2000/XP; it takes a few minutes to complete. There
are two variants, type:
"straddle 64"
to get the behaviour described above,
"straddle 32"
is a sort of control that replaces each 64-bit access with two 32-bit
accesses. You expect inconsistencies in that mode, even
uniprocessor.
I am interested in other peoples' results using the program.
Chris
2 x Opteron 248 MP:
offset: aligned, 800000000 reads consistent, 0 reads inconsistent
offset: 1 byte , 799989998 reads consistent, 10002 reads inconsistent
offset: 2 bytes, 799990141 reads consistent, 9859 reads inconsistent
offset: 3 bytes, 799990434 reads consistent, 9566 reads inconsistent
offset: 4 bytes, 799991878 reads consistent, 8122 reads inconsistent
offset: 5 bytes, 799990444 reads consistent, 9556 reads inconsistent
offset: 6 bytes, 799990084 reads consistent, 9916 reads inconsistent
offset: 7 bytes, 799990257 reads consistent, 9743 reads inconsistent
INCONSISTENCIES DETECTED
regards,
alexander.
Thanks Alex. On my quad machine I get ten times as many
inconsistencies. A factor of three is easily accounted for: there
are three processors that can interfere with a read or write as
opposed to one processor for dual CPU. The remaining factor
of (roughly) three I cannot explain. Maybe the AMD microcode
to handle split accesses is slicker than the Intel microcode.
Chris