Reasons for inclusion of atomic::fetch_add/sub/and/or/xor()

429 views
Skip to first unread message

je...@preshing.com

unread,
Mar 24, 2015, 12:56:59 PM3/24/15
to std-dis...@isocpp.org
Hi,

Technically, it's possible to perform every conceivable low-level atomic operation using only std::atomic<>::compare_exchange_weak(), in a way that is lock-free if compare_exchange_weak() is lock-free.

I'm looking for the complete set of reasons why these other (low-level) member functions exist:

fetch_add
fetch_sub
fetch_and
fetch_or
fetch_xor

The obvious reasons might be:

* Because of x86. These particular operations can be implemented on x86 as a single instruction using x86's LOCK prefix. A compare_exchange_weak loop for these operations would be less efficient on x86. (On Power and ARM, they all get converted to compare_exchange_weak anyway.) 
* To make code more concise. I can see this argument for fetch_add/sub, since those operations seem to appear in a lot of lock-free algorithms. The argument for and/or/xor is less convincing.

I can't find the exact reasoning anywhere on open-std.org. Have I identified the actual reasons?

Are there any other reasons I'm not aware of?

Is there any other architecture besides x86/64 that benefits directly from this library design?

Thanks,
Jeff

Thiago Macieira

unread,
Mar 24, 2015, 2:30:27 PM3/24/15
to std-dis...@isocpp.org
On Tuesday 24 March 2015 09:56:59 je...@preshing.com wrote:
> * Because of x86. These particular operations can be implemented on x86 as
> a single instruction using x86's LOCK prefix. A compare_exchange_weak loop
> for these operations would be less efficient on x86. (On Power and ARM,
> they all get converted to compare_exchange_weak anyway.)
> * To make code more concise. I can see this argument for fetch_add/sub,
> since those operations seem to appear in a lot of lock-free algorithms. The
> argument for and/or/xor is less convincing.
>
> I can't find the exact reasoning anywhere on open-std.org. Have I
> identified the actual reasons?
>
> Are there any other reasons I'm not aware of?
>
> Is there any other architecture besides x86/64 that benefits directly from
> this library design?

IA-64. Any other non-LL/SC architecture too, except that I don't know of any
besides the two IAs.

In any case, reason #1 is enough reason to have those functions, since we're
talking about very, very low-level classes here.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Matthew Woehlke

unread,
Mar 24, 2015, 3:14:29 PM3/24/15
to std-dis...@isocpp.org
On 2015-03-24 14:30, Thiago Macieira wrote:
> On Tuesday 24 March 2015 09:56:59 je...@preshing.com wrote:
>> * Because of x86. These particular operations can be implemented on x86 as
>> a single instruction using x86's LOCK prefix. A compare_exchange_weak loop
>> for these operations would be less efficient on x86. (On Power and ARM,
>> they all get converted to compare_exchange_weak anyway.)
>> * To make code more concise. I can see this argument for fetch_add/sub,
>> since those operations seem to appear in a lot of lock-free algorithms. The
>> argument for and/or/xor is less convincing.
>>
>> I can't find the exact reasoning anywhere on open-std.org. Have I
>> identified the actual reasons?
>>
>> Are there any other reasons I'm not aware of?
>>
>> Is there any other architecture besides x86/64 that benefits directly from
>> this library design?
>
> IA-64. Any other non-LL/SC architecture too, except that I don't know of any
> besides the two IAs.

Not that anyone uses it any more :-), but I want to say SPARC has some
low level stuff of this nature besides just CAS. I would imagine that
architectures that have CAS are likely to have other similar operations,
especially as increment and decrement are very common operations to do
to an atomic (e.g. anyone who has designed an architecture to
specifically have very fast atomic reference counting).

> In any case, reason #1 is enough reason to have those functions, since we're
> talking about very, very low-level classes here.

I'll echo that; if you're doing stuff where you need *atomics*, you
probably really, really want the fastest possible atomics, and that
means ADD/INC/DEC, not just CAS, especially as the former are guaranteed
to execute in constant time, while a CAS loop is theoretically not
guaranteed to *ever* complete.

(I'm actually a little surprised, come to think of it, that there are no
INC/DEC wrappers... but maybe no interesting architecture has those that
doesn't also have at least ADD[/SUB].)

--
Matthew

Thiago Macieira

unread,
Mar 24, 2015, 4:49:37 PM3/24/15
to std-dis...@isocpp.org
On Tuesday 24 March 2015 15:14:15 Matthew Woehlke wrote:
> > In any case, reason #1 is enough reason to have those functions, since
> > we're talking about very, very low-level classes here.
>
> I'll echo that; if you're doing stuff where you need *atomics*, you
> probably really, really want the fastest possible atomics, and that
> means ADD/INC/DEC, not just CAS, especially as the former are guaranteed
> to execute in constant time, while a CAS loop is theoretically not
> guaranteed to *ever* complete.
>
> (I'm actually a little surprised, come to think of it, that there are no
> INC/DEC wrappers... but maybe no interesting architecture has those that
> doesn't also have at least ADD[/SUB].)

Going off-topic...

I'm not sure I understand your last paragraph. Why would you need a something
specific for inc/dec? The compiler should be allowed to choose the best
instruction and that may depend on other factors such as the number of bytes
required for aligning further instructions.

extern "C" {
void f(std::atomic<int> &i) { i.fetch_add(1); }
}

Clang 3.6:
f:
lock
incl (%rdi)
retq

GCC 4.9:
f:
lock addl $1, (%rdi)
ret

ICC 15:
f:
movl $1, %ecx
lock
addl %ecx, (%rdi)
ret

See: http://goo.gl/UWWYws

If you need some more specialised instructions, see <x86intrin.h> and
<immintrin.h> (_bit_scan_forward, _bittest, _rotl/_rotr, etc.)

Back on topic:

The Intel SDM says these operations can also be atomic:
add with carry (ADC)
bit test and complement (BTC)
bit test and reset (BTR)
bit test and set (BTS)
2's complement negation (NEG)
1's complement negation (NOT)
subtract with borrow (SBB)

I don't see much value in the ADC, NEG and SBB instructions being made part of
std::atomic, but the bit ones make sense. Right now, they can only be
implemented as part of a CAS loop.

Matthew Woehlke

unread,
Mar 24, 2015, 5:21:56 PM3/24/15
to std-dis...@isocpp.org
On 2015-03-24 16:49, Thiago Macieira wrote:
> On Tuesday 24 March 2015 15:14:15 Matthew Woehlke wrote:
>>> In any case, reason #1 is enough reason to have those functions, since
>>> we're talking about very, very low-level classes here.
>>
>> I'll echo that; if you're doing stuff where you need *atomics*, you
>> probably really, really want the fastest possible atomics, and that
>> means ADD/INC/DEC, not just CAS, especially as the former are guaranteed
>> to execute in constant time, while a CAS loop is theoretically not
>> guaranteed to *ever* complete.
>>
>> (I'm actually a little surprised, come to think of it, that there are no
>> INC/DEC wrappers... but maybe no interesting architecture has those that
>> doesn't also have at least ADD[/SUB].)
>
> Going off-topic...
>
> I'm not sure I understand your last paragraph. Why would you need a something
> specific for inc/dec? The compiler should be allowed to choose the best
> instruction and that may depend on other factors such as the number of bytes
> required for aligning further instructions.

Hmm... I guess if the compiler is clever enough to choose between INC
and a CAS loop if the arch has INC but not ADD, then that works. (I
think I'd thought about this at some point and then forgot again when I
wrote the above :-). Or I was thinking for some reason that wouldn't
work...)

> Back on topic:
>
> The Intel SDM says these operations can also be atomic:
> add with carry (ADC)
> bit test and complement (BTC)
> bit test and reset (BTR)
> bit test and set (BTS)
> 2's complement negation (NEG)
> 1's complement negation (NOT)
> subtract with borrow (SBB)
>
> I don't see much value in the ADC, NEG and SBB instructions being made part of
> std::atomic, but the bit ones make sense. Right now, they can only be
> implemented as part of a CAS loop.

Can't BTR/BTS/BTC be implemented via AND/OR/XOR? Per your argument
above, would you not expect/hope the compiler is clever enough to
transform an AND/OR/XOR into a BTR/BTS/BTC these where appropriate? (Or
do I not understand what these do?) Granted that optimization is a bit
more complex than the INC/DEC case, but...

--
Matthew

Tony V E

unread,
Mar 24, 2015, 5:50:53 PM3/24/15
to std-dis...@isocpp.org
On Tue, Mar 24, 2015 at 12:56 PM, <je...@preshing.com> wrote:
Hi,

Technically, it's possible to perform every conceivable low-level atomic operation using only std::atomic<>::compare_exchange_weak(), in a way that is lock-free if compare_exchange_weak() is lock-free.

I'm looking for the complete set of reasons why these other (low-level) member functions exist:

fetch_add
fetch_sub
fetch_and
fetch_or
fetch_xor

The obvious reasons might be:

* Because of x86. These particular operations can be implemented on x86 as a single instruction using x86's LOCK prefix. A compare_exchange_weak loop for these operations would be less efficient on x86. (On Power and ARM, they all get converted to compare_exchange_weak anyway.) 


I don't have any references to back me up, but I'm pretty sure this was the reason.  And/or portability - basically, if these functions weren't added, some developers would write them in asm instead, to get that missing performance.  Whenever performance is "left on the table", you know some portion of devs will write it themselves instead.  Particularly with atomics, which are all about performance.  This is also why all the memory orderings are available - if we only offered sequential consistency, some would stick to non-portable code that did acquire/release/etc.


Tony

 
* To make code more concise. I can see this argument for fetch_add/sub, since those operations seem to appear in a lot of lock-free algorithms. The argument for and/or/xor is less convincing.

I can't find the exact reasoning anywhere on open-std.org. Have I identified the actual reasons?

Are there any other reasons I'm not aware of?

Is there any other architecture besides x86/64 that benefits directly from this library design?

Thanks,
Jeff

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-discussio...@isocpp.org.
To post to this group, send email to std-dis...@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-discussion/.

Thiago Macieira

unread,
Mar 24, 2015, 7:26:44 PM3/24/15
to std-dis...@isocpp.org
On Tuesday 24 March 2015 17:21:38 Matthew Woehlke wrote:
> > The Intel SDM says these operations can also be atomic:
> > add with carry (ADC)
> > bit test and complement (BTC)
> > bit test and reset (BTR)
> > bit test and set (BTS)
> > 2's complement negation (NEG)
> > 1's complement negation (NOT)
> > subtract with borrow (SBB)
> >
> > I don't see much value in the ADC, NEG and SBB instructions being made
> > part of std::atomic, but the bit ones make sense. Right now, they can
> > only be implemented as part of a CAS loop.
>
> Can't BTR/BTS/BTC be implemented via AND/OR/XOR? Per your argument
> above, would you not expect/hope the compiler is clever enough to
> transform an AND/OR/XOR into a BTR/BTS/BTC these where appropriate? (Or
> do I not understand what these do?) Granted that optimization is a bit
> more complex than the INC/DEC case, but...

Ah, I see what you mean. You want to do:

bool bittestandset(std::atomic<int> &value, int bit)
{
const int v = 1 << bit;
return value.fetch_or(v) & v;
}

bool bittestandreset(std::atomic<int> &value, int bit)
{
const int v = 1 << bit;
return value.fetch_and(~v) & v;
}

bool bittestandcomplement(std::atomic<int> &value, int bit)
{
const int v = 1 << bit;
return value.fetch_xor(v) & v;
}

Right now, the compilers aren't smart enough to change the two above into the
instructions in question. All three created CAS loops.

So: should we consider this a simple QoI issue or should we recommend adding
the bit manipulation functions?

Thiago Macieira

unread,
Mar 24, 2015, 8:08:12 PM3/24/15
to std-dis...@isocpp.org
On Tuesday 24 March 2015 13:18:04 Tony V E wrote:
> I don't have any references to back me up, but I'm pretty sure this was the
> reason. And/or portability - basically, if these functions weren't added,
> some developers would write them in asm instead, to get that missing
> performance. Whenever performance is "left on the table", you know some
> portion of devs will write it themselves instead. Particularly with
> atomics, which are all about performance. This is also why all the memory
> orderings are available - if we only offered sequential consistency, some
> would stick to non-portable code that did acquire/release/etc.

Which is probably why we have both strong and weak compare_exchange methods.

Matthew Woehlke

unread,
Mar 25, 2015, 10:49:41 AM3/25/15
to std-dis...@isocpp.org
On 2015-03-24 19:26, Thiago Macieira wrote:
> On Tuesday 24 March 2015 17:21:38 Matthew Woehlke wrote:
>>> The Intel SDM says these operations can also be atomic:
>>> add with carry (ADC)
>>> bit test and complement (BTC)
>>> bit test and reset (BTR)
>>> bit test and set (BTS)
>>> 2's complement negation (NEG)
>>> 1's complement negation (NOT)
>>> subtract with borrow (SBB)
>>>
>>> I don't see much value in the ADC, NEG and SBB instructions being made
>>> part of std::atomic, but the bit ones make sense. Right now, they can
>>> only be implemented as part of a CAS loop.
>>
>> Can't BTR/BTS/BTC be implemented via AND/OR/XOR? Per your argument
>> above, would you not expect/hope the compiler is clever enough to
>> transform an AND/OR/XOR into a BTR/BTS/BTC these where appropriate? (Or
>> do I not understand what these do?) Granted that optimization is a bit
>> more complex than the INC/DEC case, but...
>
> Ah, I see what you mean. You want to do:
>
> bool bittestandset(std::atomic<int> &value, int bit)
> {
> const int v = 1 << bit;
> return value.fetch_or(v) & v;
> }
> [...]

That's the idea, yes. Not necessarily with the explicit helper function,
though; I was thinking of comparable operations appearing inline at a
point of use. (In particular, '1<<bit' might be written as a literal
constant, e.g. '0x0040', instead.)

Obviously, for BTS/BTR/BTC to be interesting, the specific use of the
return result which you correctly indicated is critical. (Or,
alternatively, that the result is not used.)

> Right now, the compilers aren't smart enough to change the two above into the
> instructions in question. All three created CAS loops.

Really? I was under the impression that x86_64 supports atomic
OR/XOR/AND (and some very cursory research seems to support this). Is
that not the case?

> So: should we consider this a simple QoI issue or should we recommend adding
> the bit manipulation functions?

That's a good question, and I hope you're not asking my opinion :-),
because I don't feel qualified to offer one. (For one, I don't currently
know any case offhand where I would need such operations.) Seems worth
other people, especially ones closer to the problem, weighing in, though.

--
Matthew

Thiago Macieira

unread,
Mar 25, 2015, 12:21:13 PM3/25/15
to std-dis...@isocpp.org
On Wednesday 25 March 2015 10:49:23 Matthew Woehlke wrote:
> > Right now, the compilers aren't smart enough to change the two above into
> > the instructions in question. All three created CAS loops.
>
> Really? I was under the impression that x86_64 supports atomic
> OR/XOR/AND (and some very cursory research seems to support this). Is
> that not the case?

x86 supports atomic AND, OR, XOR, SUB, NOT, NEG, etc. but those operations
don't return the old value. Even atomic ADD doesn't -- there's an extra
instruction for that, XADD.

je...@preshing.com

unread,
Mar 25, 2015, 2:11:22 PM3/25/15
to std-dis...@isocpp.org
For anyone interested, I tested GCC 4.9.2, Clang 3.4.2 and MSVC 2012 (with optimization).

If the return value from fetch_or() is not needed, they all use LOCK OR. Otherwise, they all use a LOCK CMPXCHG.

Myriachan

unread,
Mar 27, 2015, 4:11:41 PM3/27/15
to std-dis...@isocpp.org
On Tuesday, March 24, 2015 at 1:49:37 PM UTC-7, Thiago Macieira wrote:
extern "C" {
        void f(std::atomic<int> &i) { i.fetch_add(1); }
}

Clang 3.6:
f:
        lock
        incl    (%rdi)
        retq

GCC 4.9:
f:
        lock addl       $1, (%rdi)
        ret

ICC 15:
f:
        movl      $1, %ecx
        lock      
        addl      %ecx, (%rdi)
        ret      

See: http://goo.gl/UWWYws


Visual Studio 2015 (/Ox /Os /favor:INTEL64):

        movl      $1, %eax
        lock
        xaddl     %eax, (%ecx)
        ret

Reply all
Reply to author
Forward
0 new messages