Effectively atomic read of 16 bytes on x86

Eric Grant

unread,

Feb 7, 2012, 7:24:48 PM2/7/12

to lock...@googlegroups.com

Hello,

This question concerns the x86_64 architecture.

Consider a 64-bit pointer plus a 64-bit counter, collectively aligned to 16 bytes. The counter moves monotonically on each update of the pointer. (A classic example is the head of a nonblocking IBM/Treiber stack, though nothing about the question depends on that particular use.) Assume that we modify the pointer/counter pair only in an atomic manner, i.e., using cmpxchg16b.

Given these constraints, it would appear that we can *read* the pointer/counter pair in a manner that is "effectively" atomic, by which I mean we will (1) always read a pair that has been written by one thread or another, and (2) never read a pair that has been written one-half by one thread and one-half by another. Here's the algorithm:

temp; // same integral type as counter
counter = pair->counter;
do
{

temp = counter;
head = pair->pointer;
COMPILER_BARRIER(); // to ensure that pair->pointer is read before pair->counter
counter = pair->counter;

}
while (counter != temp);

Correct? Easier way? Bunk?

Cheers,
Eric

Samy Bahra

unread,

Feb 8, 2012, 2:46:49 AM2/8/12

to lock...@googlegroups.com, lock...@googlegroups.com

Memory model is only guaranteed for same width operations. Things are weaker if you mix widths on a target. You can try using a load barrier.

This has been brought up in an earlier thread. Sorry for brevity.

Sent from a phone

Dmitriy V'jukov

unread,

Feb 8, 2012, 2:59:42 AM2/8/12

to lock...@googlegroups.com

Hi Eric,

I think it is correct. It is effectively a seqlock:
http://en.wikipedia.org/wiki/Seqlock

You may also consider using SSE instructions to do atomic 16 byte loads.

Samy Bahra

unread,

Feb 8, 2012, 3:17:21 AM2/8/12

to lock...@googlegroups.com, lock...@googlegroups.com

It is not.

Sent from a phone

Anthony Williams

unread,

Feb 8, 2012, 3:19:38 AM2/8/12

to lock...@googlegroups.com

On 08/02/12 07:59, Dmitriy V'jukov wrote:
> On Wed, Feb 8, 2012 at 4:24 AM, Eric Grant<ma...@eagrant.com> wrote:

>> temp; // same integral type as counter
>> counter = pair->counter;
>> do
>> {
>> temp = counter;
>> head = pair->pointer;
>> COMPILER_BARRIER(); // to ensure that pair->pointer is read before
>> pair->counter
>> counter = pair->counter;
>> }
>> while (counter != temp);
>>
>> Correct? Easier way? Bunk?
>

> I think it is correct. It is effectively a seqlock:
> http://en.wikipedia.org/wiki/Seqlock

Agreed.

> You may also consider using SSE instructions to do atomic 16 byte loads.

I believe this is incorrect. The Intel manuals do not guarantee that
16-byte SSE instruction are atomic. They only guarantee that aligned
64-bit (or smaller) instructions and instructions with a LOCK prefix are
atomic, and you cannot apply a LOCK prefix to an SSE instruction.

See IA32 Software Dev. Manual Vol 3A, sections 7.1.1 and 7.1.2.2.

It might be that in practice SSE instructions are atomic, but they are
not guaranteed.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++11 thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Dmitriy V'jukov

unread,

Feb 8, 2012, 4:19:52 AM2/8/12

to lock...@googlegroups.com

On Wed, Feb 8, 2012 at 12:19 PM, Anthony Williams <antho...@gmail.com> wrote:
>
>>> temp; // same integral type as counter
>>> counter = pair->counter;
>>> do
>>> {
>>> temp = counter;
>>> head = pair->pointer;
>>> COMPILER_BARRIER(); // to ensure that pair->pointer is read before
>>> pair->counter
>>> counter = pair->counter;
>>> }
>>> while (counter != temp);
>>>
>>> Correct? Easier way? Bunk?
>>
>>
>> I think it is correct. It is effectively a seqlock:
>> http://en.wikipedia.org/wiki/Seqlock
>
>
> Agreed.
>
>
>> You may also consider using SSE instructions to do atomic 16 byte loads.
>
>
> I believe this is incorrect. The Intel manuals do not guarantee that 16-byte
> SSE instruction are atomic. They only guarantee that aligned 64-bit (or
> smaller) instructions and instructions with a LOCK prefix are atomic, and
> you cannot apply a LOCK prefix to an SSE instruction.
>
> See IA32 Software Dev. Manual Vol 3A, sections 7.1.1 and 7.1.2.2.
>
> It might be that in practice SSE instructions are atomic, but they are not
> guaranteed.

I must have brain disorder...

Ephraim Sinowitz

unread,

Feb 8, 2012, 12:34:35 PM2/8/12

to lock...@googlegroups.com

What's your memory model :)

Dmitriy V'jukov

unread,

Feb 8, 2012, 1:00:10 PM2/8/12

to lock...@googlegroups.com

It's a way too relaxed :)

Eric Grant

unread,

Feb 10, 2012, 12:53:25 PM2/10/12

to lock...@googlegroups.com

Thanks, Dmitriy. I agree with your conclusion.

Samy, I'm afraid I don't understand your point about the memory model. If the "x86-TSO" memory model (see http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf) is correct, then the writing thread's execution of the lock cmpxchg16b instruction flushes that thread's store buffer, such that the reading thread always obtains the "current" value of both halves of the pointer/counter pair. No?

Cheers,
Eric

Samy Al Bahra

unread,

Feb 10, 2012, 2:46:47 PM2/10/12

to lock...@googlegroups.com, ma...@eagrant.com

Yes, you are correct with this example. Due to the global ordering of
locked instructions, you are guaranteed to obtain a current snapshot.
However, an important thing to realize is that the model breaks down
if you apply load/store combinations of varying widths on over-lapping
targets or if you mix locked instructions on over-lapping regions of
varying widths.

Quoted from the Intel manual (V3, CH8):
"Software should access semaphores (shared memory used for signalling
between multiple processors) using identical addresses and operand
lengths. For example, if one processor accesses a semaphore using a
word access, other processors should not access the semaphore using a
byte access."

This important detail can be easily forgotten and I've seen it break
applications in the wild.

--
Samy Al Bahra [http://repnop.org]

Dmitriy V'jukov

unread,

Feb 12, 2012, 5:31:02 AM2/12/12

to lock...@googlegroups.com, ma...@eagrant.com

On Fri, Feb 10, 2012 at 10:46 PM, Samy Al Bahra <sba...@repnop.org> wrote:
> Yes, you are correct with this example. Due to the global ordering of
> locked instructions, you are guaranteed to obtain a current snapshot.
> However, an important thing to realize is that the model breaks down
> if you apply load/store combinations of varying widths on over-lapping
> targets or if you mix locked instructions on over-lapping regions of
> varying widths.
>
> Quoted from the Intel manual (V3, CH8):
> "Software should access semaphores (shared memory used for signalling
> between multiple processors) using identical addresses and operand
> lengths. For example, if one processor accesses a semaphore using a
> word access, other processors should not access the semaphore using a
> byte access."
>
> This important detail can be easily forgotten and I've seen it break
> applications in the wild.

May you please describe how exactly it broke the application? Did the
semaphore cross cache line boundary? I was always curious how it can
happen.

Daniel Eloff

unread,

Mar 15, 2012, 3:20:13 PM3/15/12

to lock...@googlegroups.com, ma...@eagrant.com

What was the consensus here as regards to the atomicity of 16 byte reads? I asked a similar question on stackoverflow: http://stackoverflow.com/questions/9726566/atomic-16-byte-read-on-x64-cpus

My opinion is that the write is done atomically with cmpxchg16, so threads can only see the full 16 byte writes. As long as the readers use SSE so that a read doesn't span a context switch, I think it must be an atomic read. Is that correct?

Effectively atomic read of 16 bytes on x86_64 without cmpxchg16b?

Eric Grant

Samy Bahra

Dmitriy V'jukov

Samy Bahra

Anthony Williams

Dmitriy V'jukov

Ephraim Sinowitz

Dmitriy V'jukov

Eric Grant

Samy Al Bahra

Dmitriy V'jukov

Daniel Eloff