Atomic & Lockless: does std::atomic support int128_t completely and locklessly?

2,022 views
Skip to first unread message

Nemo Yu

unread,
Jan 17, 2016, 3:55:09 AM1/17/16
to std-dis...@isocpp.org
jhStart with some very easy examples. A linked list:

struct { void* node; void* next; };

which can be atomic by the std::atomic<int128_t>. Well, another one is a vector:
 
struct { size_t length; void* data};
 
which can also be atomic by the std::atomic<int128_t>. I can't tell you more about the benefit of atomic int128_t, but it does NOT a must for the standard, instead simulated by lock sometimes. Here goes my questions.
 
1) I did some searches in StackOverflow and knew that, only some very old machines not have the 16-byte CAS support, such like ARM 6, old AMDs and Intel M Series. The requirement for architecture is CMPXHCG(x86_64) and LDREX/STREX(RISC). Even, Windows 8.1(or higher) cannot be installed without the atomic int128_t. So is it popular of STL or other libraries to using the atomic int128_t or lock still? "It depends on implementation" does not satisfy me.

2) Before the 16-byte CAS is widely used(actually it is not as usual yet), the RCU(read-copy-update) strategy as well as 8-byte CAS(or lock) is popular(still in linux kernel 4.3, the latest), which should not be the case, in case of we have the 16-byte CAS a long time ago, why?

3) May I conclude that, we can use the 16-byte atomic operation in most cases, if we don't deal with the antiques?

4) May I say that, platform-relative locks are NOT further used now because
we have the better atomic operation?

5) Would the RCU strategy be useful against the atomic operation?

Thank you for reading. It is a long discussion about the atomic operation in C++,
so far as I knew, the Microsoft uses it for many cases already.

Andrey Semashev

unread,
Jan 17, 2016, 5:46:56 AM1/17/16
to std-dis...@isocpp.org
On 2016-01-17 11:55, Nemo Yu wrote:
> jhStart with some very easy examples. A linked list:
>
> struct { void* node; void* next; };
>
>
> which can be atomic by the std::atomic<int128_t>. Well, another one is a
> vector:
>
> struct { size_t length; void* data};
>
> which can also be atomic by the std::atomic<int128_t>. I can't tell you
> more about the benefit of atomic int128_t, but it does NOT a must for
> the standard, instead simulated by lock sometimes.

First, the standard does not define a 128-bit integer type. It's an
extension that some compilers provide and others don't. Second, the
standard does not guarantee lock-less atomicity for types of any size.
This is because of multiple factors, including hardware support and QoI.
It does allow you to detect whether the implementation is lock-less via
macros though. Third, the above structures are unlikely to be used for
lock-free programming as the algorithm would likely be vulnerable to the
ABA problem.

> Here goes my questions.
>
> 1) I did some searches in StackOverflow and knew that, only some very
> old machines not have the 16-byte CAS support, such like ARM 6, old AMDs
> and Intel M Series. The requirement for architecture is CMPXHCG(x86_64)
> and LDREX/STREX(RISC). Even, Windows 8.1(or higher) cannot be installed
> without the atomic int128_t. So is it popular of STL or other libraries
> to using the atomic int128_t or lock still? "It depends on
> implementation" does not satisfy me.

I'm sorry, but I don't think you'll get anything stronger than that.
Obviously, there are architectures that don't support 128-bit atomic
ops, and these architectures must be supported by C++.

If you want to limit yourself to architectures that do support them, you
might as well limit yourself to select language implementations. Check
if your compiler supports 128-bit atomics. Also check Boost.Atomic, it
does support them where possible.

What could be done though is the following:
1. Describe int128_t, uint128_t & co. in the standard. As usual, they
would be optional typedefs.
2. Add macros to detect presence of those types (along with other sized
integer types).
3. Add macros to detect lock-less property of atomic<> for sized types,
like it is done in Boost.Atomic.

> 2) Before the 16-byte CAS is widely used(actually it is not as usual
> yet), the RCU(read-copy-update) strategy as well as 8-byte CAS(or lock)
> is popular(still in linux kernel 4.3, the latest), which should not be
> the case, in case of we have the 16-byte CAS a long time ago, why?

128-bit CAS is considerably more costly than 32-bit or 64-bit ones. In a
lock-free algorithm you often need to perform more than one atomic op,
so the cost quickly starts to pile up to the point where it is cheaper
to use a spin-lock or a different algorithm that involves smaller types.

> 3) May I conclude that, we can use the 16-byte atomic operation in most
> cases, if we don't deal with the antiques?

I don't think 32-bit ARM architectures can be called antiques.

> 4) May I say that, platform-relative locks are NOT further used now because
> we have the better atomic operation?

Not sure what you mean by "platform-relative locks". Mutexes are used
much much more widely than atomics directly, if that's what you mean.

> 5) Would the RCU strategy be useful against the atomic operation?

Again, not sure what you mean.

Nemo Yu

unread,
Jan 17, 2016, 7:01:32 AM1/17/16
to std-dis...@isocpp.org
First, the standard does not define a 128-bit integer type. It's an extension that some compilers provide and others don't.
 
I have to admit I was wrong, where I said a 128-bit integer type, but is beyond the standard. I correct it now: I must say 16-byte CAS instead of 128-bit atomic, and I am not sure if it is one of the standard.

In a lock-free algorithm you often need to perform more than one atomic op, so the cost quickly starts to pile up to the point where it is cheaper to use a spin-lock or a different algorithm that involves smaller types.
 
Other threads will wait if you use spin-lock rather than just a CAS, which is what we are trying to avoid.
 
I don't think 32-bit ARM architectures can be called antiques.

You may understood. If I am not wrong, LDREX/STREX is for MISC but not just ARM, which is a subset of MISC, so I am not calling 32-bit ARMs the antiques, but only ARM 6 and previous ARMs.
 
128-bit CAS is considerably more costly than 32-bit or 64-bit ones.
 
No offense, the cost of 128-bit CAS will not be much greater than 64-bit CAS or 32-bit CAS. You should know about the implementation of CAS, which is the cheapest method of synchronization(locks).

Not sure what you mean by "platform-relative locks".

Each OS implements a lot of "lock"s, such like mutex, critical section, and spin lock. May I conclude that, CAS(spin lock) is the best way for locks.

Again, not sure what you mean.

The read-copy-update(RCU) is a strategy for avoiding costly locks. 



--

--- You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-discussio...@isocpp.org.
To post to this group, send email to std-dis...@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-discussion/.

Andrey Semashev

unread,
Jan 17, 2016, 7:38:39 AM1/17/16
to std-dis...@isocpp.org
On 2016-01-17 15:00, Nemo Yu wrote:
> First, the standard does not define a 128-bit integer type. It's an
> extension that some compilers provide and others don't.
>
> I have to admit I was wrong, where I said a 128-bit integer type, but is
> beyond the standard. I correct it now: I must say 16-byte CAS instead of
> 128-bit atomic, and I am not sure if it is one of the standard.

Ok, but thers's no difference. The interface for atomic ops in C++ is
atomic<>, which can be instantiated with a type that has size of 16
bytes even in the absence of int128_t. The language cannot guarantee
lock-less operations for it, just as for any other type.

> In a lock-free algorithm you often need to perform more than one
> atomic op, so the cost quickly starts to pile up to the point where
> it is cheaper to use a spin-lock or a different algorithm that
> involves smaller types.
>
> Other threads will wait if you use spin-lock rather than just a CAS,
> which is what we are trying to avoid.

Using CAS does not guarantee a wait-free algorithm.

> I don't think 32-bit ARM architectures can be called antiques.
>
> You may understood. If I am not wrong, LDREX/STREX is for MISC but not
> just ARM, which is a subset of MISC, so I am not calling 32-bit ARMs the
> antiques, but only ARM 6 and previous ARMs.

You were arguing that architectures without support for 128-bit atomics
are antiques and we may safely assume their presence. I gave you an
example of a wide spread architecture that does not have that support.
Clearly you are wrong in your assumptions.

> 128-bit CAS is considerably more costly than 32-bit or 64-bit ones.
>
> No offense, the cost of 128-bit CAS will not be much greater than 64-bit
> CAS or 32-bit CAS.

It will. Have a look at the numbers:

http://www.agner.org/optimize/instruction_tables.pdf

I don't have the numbers for architectures other than x86. In fact, I
can't even remember another architecture that supports 128-bit atomics,
besides things like transactional memory.

> Not sure what you mean by "platform-relative locks".
>
> Each OS implements a lot of "lock"s, such like mutex, critical section,
> and spin lock. May I conclude that, CAS(spin lock) is the best way for
> locks.

No, you can't.

What is "best"? Spin locks have a very niche applicability - when the
protected region of code is very small, like a few instructions. In most
cases you want to protect a larger region of code, and blocking instead
of spinning is often more preferable choice.

> Again, not sure what you mean.
>
> The read-copy-update(RCU) is a strategy for avoiding costly locks.

I'm aware of that. What I didn't understand is your question.

Nemo Yu

unread,
Jan 17, 2016, 8:25:06 AM1/17/16
to std-dis...@isocpp.org
The interface for atomic ops in C++ is atomic<>, which can be instantiated with a type that has size of 16 bytes even in the absence of int128_t

That's it. 
 
 gave you an example of a wide spread architecture that does not have that support. Clearly you are wrong in your assumptions.

I don't think the 32-bit ARM defeats my conclusion,  or I am wrong as I don't know ARM much. The 32-bit ARM 7 does have 16-byte CAS(LDREX/SDREX) if I am not wrong. What I said the antiques are ARM 6 and previous models, but you are talking all 32-bit ARMs, sounds like you think ARM 7 and later models are except for 32-bit.
 
It will. Have a look at the numbers:

Thank you for sharing. I will read the paper and do some searching, and discuss it later.
 
In most cases you want to protect a larger region of code, and blocking instead of spinning is often more preferable choice.

I am thinking about abandoning critical section(memory interrupt) and mutex(stop all other threads), because its high cost......


Andrey Semashev

unread,
Jan 17, 2016, 8:47:54 AM1/17/16
to std-dis...@isocpp.org
On 2016-01-17 16:25, Nemo Yu wrote:
>
> I don't think the 32-bit ARM defeats my conclusion, or I am wrong as I
> don't know ARM much. The 32-bit ARM 7 does have 16-byte CAS(LDREX/SDREX)
> if I am not wrong. What I said the antiques are ARM 6 and previous
> models, but you are talking all 32-bit ARMs, sounds like you think ARM 7
> and later models are except for 32-bit.

ARMv7 and ARMv6K have LDREXD/STREXD (notice the trailing D) which
perform a 64-bit exclusive load/store (or double width exclusive
load/store, hence the D). LDREX/SDREX perform a 32-bit exclusive
load/store. There are no instructions for 128-bit exclusive load/store.

> In most cases you want to protect a larger region of code, and
> blocking instead of spinning is often more preferable choice.
>
> I am thinking about abandoning critical section(memory interrupt) and
> mutex(stop all other threads), because its high cost......

You mean the context swiches? Or terminating a thread? Anyway, I don't
see how anything of that is better with spin locks - quite the contrary.

Nemo Yu

unread,
Jan 17, 2016, 9:17:11 AM1/17/16
to std-dis...@isocpp.org
 There are no instructions for 128-bit exclusive load/store.
 
I am not sure, discuss it later.

 You mean the context swiches? Or terminating a thread? Anyway, I don't see how anything of that is better with spin locks - quite the contrary.

I said abandoning critical section and mutex, but not spin-lock. The first two are pessimistic locks and the other is optimistic lock. If I have the power of 16-byte CAS, I can implement a linked list without pessimistic locks.

Tony V E

unread,
Jan 17, 2016, 10:48:06 AM1/17/16
to Nemo Yu
Many mutexes and critical sections already use spinlocks internally, spinning for a short while before going to the kernel. ‎ In a sense, they are optimistic, not pessimistic. 

Alternatively, if you are going to write lockfree code, I suggest reading/watching many/most/all the links on this page:


A summary of my talks, from one of my slides:


A Guide to Threaded Coding
Lock-free coding is the last thing you want to do.”

1.Stop sharing (Forget what you learned in Kindergarten)
2.Use Locks
3.Measure
4.Measure (your first try at measuring was probably wrong)
5.Change your Algorithm
6.GOTO 1
...
∞.    Lock-free
∞ + 1. Measure again, as lock-free doesn't guarantee faster.


‎As someone said at CppCon "Lock-free programming is hard. Correct lock-free programming is even harder."

(if anyone knows who said that, let me know - I heard it second hand, but want to give credit where credit is due)



Sent from my BlackBerry portable Babbage Device
From: Nemo Yu
Sent: Sunday, January 17, 2016 9:17 AM
Subject: Re: [std-discussion] Atomic & Lockless: does std::atomic support int128_t completely and locklessly?

Thiago Macieira

unread,
Jan 17, 2016, 12:33:50 PM1/17/16
to std-dis...@isocpp.org
On Sunday 17 January 2016 15:38:36 Andrey Semashev wrote:
> I don't have the numbers for architectures other than x86. In fact, I
> can't even remember another architecture that supports 128-bit atomics,
> besides things like transactional memory.

The only LL/SC architecture I know with a double-wide type of instruction is
ARM, with both LDREXD/STREXD on AArch32 and LDXP/STXP in AArch64 mode. The
other LL/SC architectures I have manuals for (MIPS, SPARC, PowerPC) do not
have this type of instruction.

This means those architectures in 32-bit mode must implement a locked
atomic<uint64_t>.

Double-wide atomics are more common on CAS architectures, like x86 and IA-64.
And even for IA-64, the instruction is weird because it only exchanges part of
the data (cmp16xchg8).

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Casey Carter

unread,
Jan 17, 2016, 9:50:56 PM1/17/16
to ISO C++ Standard - Discussion
On Sunday, January 17, 2016 at 9:48:06 AM UTC-6, Tony VE wrote:
‎As someone said at CppCon "Lock-free programming is hard. Correct lock-free programming is even harder."

(if anyone knows who said that, let me know - I heard it second hand, but want to give credit where credit is due)

Fedor Pikus (https://youtu.be/lVBvHbJsg5Y?t=249)
 

Anthony Williams

unread,
Jan 18, 2016, 4:12:51 AM1/18/16
to std-dis...@isocpp.org
On 17/01/16 08:55, Nemo Yu wrote:
> 1) I did some searches in StackOverflow and knew that, only some very
> old machines not have the 16-byte CAS support, such like ARM 6, old AMDs
> and Intel M Series. The requirement for architecture is CMPXHCG(x86_64)
> and LDREX/STREX(RISC). Even, Windows 8.1(or higher) cannot be installed
> without the atomic int128_t. So is it popular of STL or other libraries
> to using the atomic int128_t or lock still? "It depends on
> implementation" does not satisfy me.

As Andrey said, there are still platforms supported by C++ which do not
have 16-byte lock-free atomics, even when they have other atomic types
which are lock-free. e.g. 32-bit x86 platforms or 32-bit ARM platforms.

You might think them "antique", but they are still widely used, and
supported.

Whether or not std::atomic<> is lock-free for any given type is purely a
property of the implementation. Some platforms supported by C++ do not
provide any lock-free atomics beyond std::atomic_flag.

> 2) Before the 16-byte CAS is widely used(actually it is not as usual
> yet), the RCU(read-copy-update) strategy as well as 8-byte CAS(or lock)
> is popular(still in linux kernel 4.3, the latest), which should not be
> the case, in case of we have the 16-byte CAS a long time ago, why?

People write code to be supported on the widest possible set of
platforms. Pointer-wide atomics (32-bit on a 32-bit platform, 64-bit on
a 64-bit platform) are far more prevalent than double-pointer-wide
atomics (64-bit on a 32-bit platform, 128-bit on a 64-bit platform), so
portable code is often written to assume the former rather than the latter.

> 3) May I conclude that, we can use the 16-byte atomic operation in most
> cases, if we don't deal with the antiques?

If you have a compiler that targets a platform with 16-byte lock-free
atomics, I would hope that the compiler will provide lock-free
std::atomic<16-byte-pod-struct>. My implementation (for 64-bit x86
platforms) does, assuming the code is running on a CPU that provides
CMPXCHG16B (it checks at program startup).

> 4) May I say that, platform-relative locks are NOT further used now because
> we have the better atomic operation?

In most cases, you can write multithreaded code that uses locks and
atomics in plain C++, without resorting to platform-specific code.
Sometimes, you need platform specifics, but not often.

> 5) Would the RCU strategy be useful against the atomic operation?

RCU is algorithm dependent, so it depends what you were trying to
achieve with your atomic operation.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++11 thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Nemo Yu

unread,
Jan 18, 2016, 11:20:52 PM1/18/16
to std-dis...@isocpp.org
I am aware of and apology for a mistake: it must be a bug in my mind that I made 16-byte and double-pointer-size atomic operations equal(I knew they were not equal, but I thought they were the same when writing email, assuming we were talking about only the 64-bit architecture). So I shouldn't say a 32-bit ARM is antique, but a one without double-pointer-size operationsoperations.

The other LL/SC architectures I have manuals for (MIPS, SPARC, PowerPC) do not
have this type of instruction.
 
Thank you Macieira, your counterexamples help.


    128-bit CAS is considerably more costly than 32-bit or 64-bit ones.

No offense, the cost of 128-bit CAS will not be much greater than 64-bit
CAS or 32-bit CAS.

It will. Have a look at the numbers:

http://www.agner.org/optimize/instruction_tables.pdf

I read the tables, and watched the video

Fedor Pikus (https://youtu.be/lVBvHbJsg5Y?t=249)

 It does change my mind on atomic operations, which is high overhead.

                                                                                                                                                                          

Thank you all. I learned much, though it may not fit the purpose of this mailing list......

Thiago Macieira

unread,
Jan 19, 2016, 12:45:39 PM1/19/16
to std-dis...@isocpp.org
On Tuesday 19 January 2016 12:20:50 Nemo Yu wrote:
> I am aware of and apology for a mistake: it must be a bug in my mind that I
> made 16-byte and double-pointer-size atomic operations equal(I knew they
> were not equal, but I thought they were the same when writing email,
> assuming we were talking about only the 64-bit architecture). So I
> shouldn't say a 32-bit ARM is antique, but a one without
> double-pointer-size operationsoperations.

And that would be incorrect too. ARMv6 supports double-pointer-size atomics,
unlike most other LL/SC architectures.

> > The other LL/SC architectures I have manuals for (MIPS, SPARC, PowerPC) do
> > not
> > have this type of instruction.
>
> Thank you Macieira, your counterexamples help.

And the reason for that is the ABA problem. On LL/SC architectures, you can
detect the ABA situation during the store-conditional operation.

On architectures with pure CAS, you can't. So you use a generation counter in
a second word to ensure that you can't return to A.

PS: I misspoke in the other email. IA-64's instruction is cmp8xchg16, not
cmp16xchg8. The latter would be pointless.
Reply all
Reply to author
Forward
0 new messages