max without ifs

Öö Tiib

unread,

Sep 14, 2020, 8:42:41 AM9/14/20

to

I wanted somehow at weekend to find maximum of two int32_t-s without
branches (like ifs or ? operators) in C++.
Using gcc builtins I was able to write something quite quickly. It
looks like C but it compiles as C++ too.

#include <inttypes.h> // for int32_t related stuff

int32_t mad_max(int32_t a, int32_t b)
{
int32_t d, result;
int32_t o = -1 * __builtin_ssub_overflow( a, b, &d);
__builtin_ssub_overflow( a, (o ^ (d >> 31)) & d, &result);
return result;
}

It has no undefined behaviours I hope and where it does compile there
it seems to work. Demo:
<http://coliru.stacked-crooked.com/a/19eca82b689292e5>

However with raw C++ I am in trouble. Does anyone have some idea
how to do the trick in raw C++?

Melzzzzz

unread,

Sep 14, 2020, 8:49:51 AM9/14/20

to

You have pmaxsd instrunction and I guess intrinsic for x86. No need for
this abomination and much more efficient.

--
current job title: senior software engineer
skills: c++,c,rust,go,nim,haskell...

press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala

Bonita Montero

unread,

Sep 14, 2020, 8:49:57 AM9/14/20

to

#include <cstdint>

int f( int32_t a, int32_t b )
{
int32_t mask = ~(a < b ? -1 : 0);
return a & mask | b & ~mask;
}

The ternary operation is often substituted by a "sbb regx, regx".
Depends on the compiler you use.

Bonita Montero

unread,

Sep 14, 2020, 8:54:04 AM9/14/20

to

> int f( int32_t a, int32_t b )
> {
> int32_t mask = ~(a return a & mask | b & ~mask;
> }

Or better:

int f( int32_t a, int32_t b )
{

int32_t mask = a < b ? -1 : 0;
return a & ~mask | b & mask;
}

Öö Tiib

unread,

Sep 14, 2020, 9:02:06 AM9/14/20

to

On Monday, 14 September 2020 15:49:51 UTC+3, Melzzzzz wrote:
> On 2020-09-14, Öö Tiib <oot...@hot.ee> wrote:
> > I wanted somehow at weekend to find maximum of two int32_t-s without
> > branches (like ifs or ? operators) in C++.
> > Using gcc builtins I was able to write something quite quickly. It
> > looks like C but it compiles as C++ too.
> >
> > #include <inttypes.h> // for int32_t related stuff
> >
> > int32_t mad_max(int32_t a, int32_t b)
> > {
> > int32_t d, result;
> > int32_t o = -1 * __builtin_ssub_overflow( a, b, &d);
> > __builtin_ssub_overflow( a, (o ^ (d >> 31)) & d, &result);
> > return result;
> > }
> >
> > It has no undefined behaviours I hope and where it does compile there
> > it seems to work. Demo:
> ><http://coliru.stacked-crooked.com/a/19eca82b689292e5>
> >
> > However with raw C++ I am in trouble. Does anyone have some idea
> > how to do the trick in raw C++?
>
> You have pmaxsd instrunction and I guess intrinsic for x86. No need for
> this abomination and much more efficient.

I did try to ask for C++ not assembler nor compiler builtins.

James Kuyper

unread,

Sep 14, 2020, 9:58:56 AM9/14/20

to

On 9/14/20 8:42 AM, Öö Tiib wrote:
> I wanted somehow at weekend to find maximum of two int32_t-s without
> branches (like ifs or ? operators) in C++.
> Using gcc builtins I was able to write something quite quickly. It
> looks like C but it compiles as C++ too.
>
> #include <inttypes.h> // for int32_t related stuff
>
> int32_t mad_max(int32_t a, int32_t b)
> {
> int32_t d, result;
> int32_t o = -1 * __builtin_ssub_overflow( a, b, &d);
> __builtin_ssub_overflow( a, (o ^ (d >> 31)) & d, &result);
> return result;
> }
>
> It has no undefined behaviours I hope and where it does compile there
> it seems to work. Demo:

"undefined behavior" is defined by the C standard as behavior not
defined by the C standard, even if a particular implementation provides
it's own definition for the behavior.

Your code uses the identifier __builtin_ssub_overflow, which is not
defined by the C standard, and is reserved by that standard for all
uses. Declaring or defining that identifier yourself would render the
behavior of your program undefined.

Using it when it has been declared or defined by the implementation
(which is presumably the case on your system) has behavior that depends
entirely upon how the implementation defined __builtin_ssub_overflow().
It could be a syntax error, a constraint violation, or have undefined
behavior - or it might do precisely what you expect it to do.

Using that identifier in this manner when neither your own code nor the
implementation has provided a declaration or definition for it would be
a constraint violation: at least one diagnostic is mandatory, and
producing an executable program is optional. Should an implementation
choose to generate an executable program, and if you should choose to
execute it, the behavior is undefined.

Ben Bacarisse

unread,

Sep 14, 2020, 10:05:50 AM9/14/20

to

嘱 Tiib <oot...@hot.ee> writes:

> I wanted somehow at weekend to find maximum of two int32_t-s without
> branches (like ifs or ? operators) in C++.

You are trying to beat the compiler. Both std::max and the simple code
using ?: are optimised by gcc to minimal branch-free code. You function
comes out more than twice as long.

> Using gcc builtins I was able to write something quite quickly. It
> looks like C but it compiles as C++ too.
>
> #include <inttypes.h> // for int32_t related stuff
>
> int32_t mad_max(int32_t a, int32_t b)
> {
> int32_t d, result;
> int32_t o = -1 * __builtin_ssub_overflow( a, b, &d);
> __builtin_ssub_overflow( a, (o ^ (d >> 31)) & d, &result);
> return result;
> }
>
> It has no undefined behaviours I hope and where it does compile there
> it seems to work.

You have implementation-defined behaviour, though, in the signed shift
when d is negative. And you are assuming that int32_t matches int.

> Demo:
> <http://coliru.stacked-crooked.com/a/19eca82b689292e5>
>
> However with raw C++ I am in trouble. Does anyone have some idea
> how to do the trick in raw C++?

I presume "raw" means no std::max?

--
Ben.

David Brown

unread,

Sep 14, 2020, 10:17:08 AM9/14/20

to

How about:

int32_t mad_max2(int32_t a, int32_t b) {
return a > b : a ? b;
}

Write the source clearly, and let the /compiler/ generate the
appropriate object code. It is likely to do a better job than you in
most cases.

To start with, why do you think avoiding branches is useful?
/Sometimes/ it is - but depending on the target processor, the
predictability of the branch, and the surrounding code, branches can
sometimes be handled in the cpu's decoder units before they even make it
to the execution units. That's zero cost. Let the compiler put in a
branch if that's what suits with the rest of the instructions in the code.

The typical code for mad_max2 for generic x86_64 cpus is:

mad_max2(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret

No branches. But since a function like this is likely to be inlined,
the details in practice will depend on the surrounding code.

And don't bother with x86 intrinsics. If you know more about the target
processor, and know what instruction sets it supports, then tell the
compiler about them with "-march" flags. Then it will generate pmaxsd
instructions or whatever suits best.

David Brown

unread,

Sep 14, 2020, 10:22:00 AM9/14/20

to

That is a little less bad than the OP's suggestion, as it is reasonably
comprehensible. But it is still a dog's breakfast.

It's twenty years or more since it has made any sense to write this sort
of gibberish in anything but the most specialised of circumstances.

Öö Tiib

unread,

Sep 14, 2020, 11:12:39 AM9/14/20

to

On Monday, 14 September 2020 17:05:50 UTC+3, Ben Bacarisse wrote:
> 嘱 Tiib <oot...@hot.ee> writes:
>
> > I wanted somehow at weekend to find maximum of two int32_t-s without
> > branches (like ifs or ? operators) in C++.
>
> You are trying to beat the compiler. Both std::max and the simple code
> using ?: are optimised by gcc to minimal branch-free code. You function
> comes out more than twice as long.
>
> > Using gcc builtins I was able to write something quite quickly. It
> > looks like C but it compiles as C++ too.
> >
> > #include <inttypes.h> // for int32_t related stuff
> >
> > int32_t mad_max(int32_t a, int32_t b)
> > {
> > int32_t d, result;
> > int32_t o = -1 * __builtin_ssub_overflow( a, b, &d);
> > __builtin_ssub_overflow( a, (o ^ (d >> 31)) & d, &result);
> > return result;
> > }
> >
> > It has no undefined behaviours I hope and where it does compile there
> > it seems to work.
>
> You have implementation-defined behaviour, though, in the signed shift
> when d is negative. And you are assuming that int32_t matches int.

That implementation-defined is flaw that I asked ideas of how to get
rid of. If that assumption does not hold then the code is ill-formed
so I did not bother to add additional static_asserts doing same into
example that is not what I wanted to achieve anyway.

>
> > Demo:
> > <http://coliru.stacked-crooked.com/a/19eca82b689292e5>
> >
> > However with raw C++ I am in trouble. Does anyone have some idea
> > how to do the trick in raw C++?
>
> I presume "raw" means no std::max?

Yes, since it is unknown if that std::max uses branches or does not.
Actually same is unknown about __builtin_ssub_overflow but gcc docs
say they try their best.

Tim Rentsch

unread,

Sep 14, 2020, 11:27:24 AM9/14/20

to

Neither for() nor if() nor switch() now while() nor ?: is needed
for Turing completeness:

static int
give_a( int a, int b ){
return a;
}

static int
give_b( int a, int b ){
return b;
}

int
no_control_max( int a, int b ){
int (*choose[2])( int, int ) = { give_a, give_b };
return choose[ b > a ]( a, b );
}

Compiles cleanly as both C{90,99,11} and C++{98,03,11,14,17}.

Scott Lurndal

unread,

Sep 14, 2020, 11:35:52 AM9/14/20

to

Given the intel/amd 'CMOVxx' instruction and the ARM "CSET" instruction,
the compilers will generate the most optimal branchless code for a max
check even without optimization.

Öö Tiib

unread,

Sep 14, 2020, 11:46:55 AM9/14/20

to

On Monday, 14 September 2020 17:17:08 UTC+3, David Brown wrote:
> On 14/09/2020 14:42, Öö Tiib wrote:
> > I wanted somehow at weekend to find maximum of two int32_t-s without
> > branches (like ifs or ? operators) in C++.
> > Using gcc builtins I was able to write something quite quickly. It
> > looks like C but it compiles as C++ too.
> >
> > #include <inttypes.h> // for int32_t related stuff
> >
> > int32_t mad_max(int32_t a, int32_t b)
> > {
> > int32_t d, result;
> > int32_t o = -1 * __builtin_ssub_overflow( a, b, &d);
> > __builtin_ssub_overflow( a, (o ^ (d >> 31)) & d, &result);
> > return result;
> > }
> >
> > It has no undefined behaviours I hope and where it does compile there
> > it seems to work. Demo:
> > <http://coliru.stacked-crooked.com/a/19eca82b689292e5>
> >
> > However with raw C++ I am in trouble. Does anyone have some idea
> > how to do the trick in raw C++?
> >
>
> How about:
>
> int32_t mad_max2(int32_t a, int32_t b) {
> return a > b : a ? b;
> }

Thanks, that is same what std::max compiles to. Unfortunately there
is branch on RISC-V (for example) while my convoluted gibberish did
not have branch.

> Write the source clearly, and let the /compiler/ generate the
> appropriate object code. It is likely to do a better job than you in
> most cases.
>
> To start with, why do you think avoiding branches is useful?
> /Sometimes/ it is - but depending on the target processor, the
> predictability of the branch, and the surrounding code, branches can
> sometimes be handled in the cpu's decoder units before they even make it
> to the execution units. That's zero cost. Let the compiler put in a
> branch if that's what suits with the rest of the instructions in the code.
>
> The typical code for mad_max2 for generic x86_64 cpus is:
>
> mad_max2(int, int):
> cmp edi, esi
> mov eax, esi
> cmovge eax, edi
> ret

I know, but when the target isn't x86_x64 there can be branches.

>
> No branches. But since a function like this is likely to be inlined,
> the details in practice will depend on the surrounding code.
>
> And don't bother with x86 intrinsics. If you know more about the target
> processor, and know what instruction sets it supports, then tell the
> compiler about them with "-march" flags. Then it will generate pmaxsd
> instructions or whatever suits best.

That was my hope to get rid of whatever (not exactly x86 but gcc)
specific.

Bonita Montero

unread,

Sep 14, 2020, 11:49:29 AM9/14/20

to

>> int f( int32_t a, int32_t b )
>> {
>> int32_t mask = a > return a & ~mask | b & mask;
>> }

> That is a little less bad than the OP's suggestion, as it is reasonably
> comprehensible. But it is still a dog's breakfast.

It works if the compiler supports this trick.
But support for this is very common.

Öö Tiib

unread,

Sep 14, 2020, 1:01:45 PM9/14/20

to

Yes it works where it works, thanks. David is also correct that what
std::max does (a > b ? a : b) is also often without branches on
such platforms.

Bonita Montero

unread,

Sep 14, 2020, 1:08:40 PM9/14/20

to

>> It works if the compiler supports this trick.
>> But support for this is very common.

> Yes it works where it works, thanks. David is also correct that what
> std::max does (a > b ? a : b) is also often without branches on
> such platforms.

std::max normally maps to "a > b ? a : b" and and this results usually
in a branch where a predicated move isn't possible. The trick I've shown
is common with most compilers and today's compilers detect what's going
on there and further replace it with a conditional move. So you have
both: a conditional move on platforms that support it and a non-branched
version on platforms that usually would use a branch.
And I think the SSE-code would be rather slow as you would have to store
the GPR-register into memory and then from memory to the SSE-register
and then the way back.

David Brown

unread,

Sep 14, 2020, 2:19:47 PM9/14/20

to

I can't imagine a compiler that would recognize the pattern you've given
here and generate decent code, but would fail to give at least as good
results for the simple, clean, obvious "max" expression.

Code like you've given here should be rejected by any code review if it
is not accompanied with detailed test results showing exactly why the
"max" function is performance critical, and demonstrating that this
source code gives significant measurably better results than the simple
version.

Code like the OP's, however, should be rejected outright - with a
suggestion that the programmer re-think his or her priorities in writing
code.

Bonita Montero

unread,

Sep 14, 2020, 2:22:46 PM9/14/20

to

> I can't imagine a compiler that would recognize the pattern you've given
> here and generate decent code, but would fail to give at least as good
> results for the simple, clean, obvious "max" expression.

Usually it results in "cmp regA, regB; sbb regC, regC" if the compiler
doesn't detect the data-flow and further replaces it with a CMOV if
possible. I used this pattern a long time ago and it was supported by
MSVC and gcc.

> Code like you've given here should be rejected by any code review if it
> is not accompanied with detailed test results showing exactly why the
> "max" function is performance critical, and demonstrating that this
> source code gives significant measurably better results than the simple
> version.

Idiocracy ...

David Brown

unread,

Sep 14, 2020, 2:25:28 PM9/14/20

to

On 14/09/2020 17:46, Öö Tiib wrote:
> On Monday, 14 September 2020 17:17:08 UTC+3, David Brown wrote:
>> On 14/09/2020 14:42, Öö Tiib wrote:
>>> I wanted somehow at weekend to find maximum of two int32_t-s without
>>> branches (like ifs or ? operators) in C++.
>>> Using gcc builtins I was able to write something quite quickly. It
>>> looks like C but it compiles as C++ too.
>>>
>>> #include <inttypes.h> // for int32_t related stuff
>>>
>>> int32_t mad_max(int32_t a, int32_t b)
>>> {
>>> int32_t d, result;
>>> int32_t o = -1 * __builtin_ssub_overflow( a, b, &d);
>>> __builtin_ssub_overflow( a, (o ^ (d >> 31)) & d, &result);
>>> return result;
>>> }
>>>
>>> It has no undefined behaviours I hope and where it does compile there
>>> it seems to work. Demo:
>>> <http://coliru.stacked-crooked.com/a/19eca82b689292e5>
>>>
>>> However with raw C++ I am in trouble. Does anyone have some idea
>>> how to do the trick in raw C++?
>>>
>>
>> How about:
>>
>> int32_t mad_max2(int32_t a, int32_t b) {
>> return a > b : a ? b;
>> }
>
> Thanks, that is same what std::max compiles to. Unfortunately there
> is branch on RISC-V (for example) while my convoluted gibberish did
> not have branch.
>

Have you done comprehensive tests showing that your function here is
performance critical to the code, that your gibberish is actually faster
than simple branched code, and that this applies even when used in
/real/ code where the simple version can be used for further
optimisation, unlike your gibberish? I strongly suspect not.

>> Write the source clearly, and let the /compiler/ generate the
>> appropriate object code. It is likely to do a better job than you in
>> most cases.
>>
>> To start with, why do you think avoiding branches is useful?
>> /Sometimes/ it is - but depending on the target processor, the
>> predictability of the branch, and the surrounding code, branches can
>> sometimes be handled in the cpu's decoder units before they even make it
>> to the execution units. That's zero cost. Let the compiler put in a
>> branch if that's what suits with the rest of the instructions in the code.
>>
>> The typical code for mad_max2 for generic x86_64 cpus is:
>>
>> mad_max2(int, int):
>> cmp edi, esi
>> mov eax, esi
>> cmovge eax, edi
>> ret
>
> I know, but when the target isn't x86_x64 there can be branches.
>

So what? On many targets, branches are cheap. On those that are not,
the compiler will avoid branches. And if the target port is relatively
young on gcc (like RISC-V) and not well optimised as yet, then messing
around with this stuff is just bikeshedding.

Öö Tiib

unread,

Sep 14, 2020, 2:52:40 PM9/14/20

to

If you think that I somehow considered it as good and useful example
of programming then it is incorrect. Possibly it is miscommunication of
mine that you have such impression. I only wanted to demonstrate that
the weird code gives correct answers and has no branches on platforms
with overflow flag.

>
> >> Write the source clearly, and let the /compiler/ generate the
> >> appropriate object code. It is likely to do a better job than you in
> >> most cases.
> >>
> >> To start with, why do you think avoiding branches is useful?
> >> /Sometimes/ it is - but depending on the target processor, the
> >> predictability of the branch, and the surrounding code, branches can
> >> sometimes be handled in the cpu's decoder units before they even make it
> >> to the execution units. That's zero cost. Let the compiler put in a
> >> branch if that's what suits with the rest of the instructions in the code.
> >>
> >> The typical code for mad_max2 for generic x86_64 cpus is:
> >>
> >> mad_max2(int, int):
> >> cmp edi, esi
> >> mov eax, esi
> >> cmovge eax, edi
> >> ret
> >
> > I know, but when the target isn't x86_x64 there can be branches.
> >
>
> So what? On many targets, branches are cheap. On those that are not,
> the compiler will avoid branches. And if the target port is relatively
> young on gcc (like RISC-V) and not well optimised as yet, then messing
> around with this stuff is just bikeshedding.

I know. I did try to mention that I did it at weekend. And shared
without desire to spoil anyone's mood, sorry. ;)

David Brown

unread,

Sep 14, 2020, 3:39:58 PM9/14/20

to

Ah, okay. That makes more sense. My apologies if I misunderstood your
intentions.

I have many times seen people write "smart-arse" code, thinking it was a
good idea, when almost invariably it is not. Having had to deal with
such code on occasion, I react strongly against it when I see it. I
must admit I thought it was odd to see it coming from you!

>
>>
>>>> Write the source clearly, and let the /compiler/ generate the
>>>> appropriate object code. It is likely to do a better job than you in
>>>> most cases.
>>>>
>>>> To start with, why do you think avoiding branches is useful?
>>>> /Sometimes/ it is - but depending on the target processor, the
>>>> predictability of the branch, and the surrounding code, branches can
>>>> sometimes be handled in the cpu's decoder units before they even make it
>>>> to the execution units. That's zero cost. Let the compiler put in a
>>>> branch if that's what suits with the rest of the instructions in the code.
>>>>
>>>> The typical code for mad_max2 for generic x86_64 cpus is:
>>>>
>>>> mad_max2(int, int):
>>>> cmp edi, esi
>>>> mov eax, esi
>>>> cmovge eax, edi
>>>> ret
>>>
>>> I know, but when the target isn't x86_x64 there can be branches.
>>>
>>
>> So what? On many targets, branches are cheap. On those that are not,
>> the compiler will avoid branches. And if the target port is relatively
>> young on gcc (like RISC-V) and not well optimised as yet, then messing
>> around with this stuff is just bikeshedding.
>
> I know. I did try to mention that I did it at weekend. And shared
> without desire to spoil anyone's mood, sorry. ;)
>

Fair enough.

One thing that might come out of this would be how gcc handles the
simple max code in gcc for RISC-V. If the code is not optimal (I don't
know the RISC-V well enough to tell) and there is a better branchless
pattern, then it could be worth filing it as a missed optimisation bug
in gcc. It's always useful for them to get common patterns into the
peephole optimiser for each target.

Melzzzzz

unread,

Sep 14, 2020, 9:34:54 PM9/14/20

to

Then don't bother.

Bonita Montero

unread,

Sep 15, 2020, 1:04:52 AM9/15/20

to

> Neither for() nor if() nor switch() now while() nor ?: is needed
> for Turing completeness:
> static int
> give_a( int a, int b ){
> return a;
> }
>
> static int
> give_b( int a, int b ){
> return b;
> }
>
> int
> no_control_max( int a, int b ){
> int (*choose[2])( int, int ) = { give_a, give_b };
> return choose[ b > a ]( a, b );
> }
> Compiles cleanly as both C{90,99,11} and C++{98,03,11,14,17}.

It needs a good compiler to optimize away these stupid
function-calls. Otherwise this code would be rather slow.

Öö Tiib

unread,

Sep 15, 2020, 2:56:15 AM9/15/20

to

Yes, but it usually is not generating branches like I did ask.

Your trick, Bonita, is also good as majority of compilers
(and all popular) optimize that ?: away on targets.
For example on RISC-V:

bonita_max(int, int):
slt a5,a0,a1
subw a5,zero,a5
xor a1,a0,a1
and a1,a1,a5
xor a0,a0,a1
sext.w a0,a0
ret

tim_max(int, int):
sgt a5,a1,a0
slli a4,a5,3
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
add a5,a5,a4
ld t1,0(a5)
jr t1

Interesting trick, jumping to address of called function and letting
it to return for you. Haven't profiled, my own convoluted code most
likely performs worst everywhere. :D

Bonita Montero

unread,

Sep 15, 2020, 2:58:56 AM9/15/20

to

> Interesting trick, jumping to address of called function and letting
> it to return for you. Haven't profiled, my own convoluted code most
> likely performs worst everywhere. :D

Wouldn't work with shadow-stacks which will be introduced
by Intel's control-flow-enforcement technology.

Öö Tiib

unread,

Sep 15, 2020, 3:06:57 AM9/15/20

to

Doesn't matter as I mostly try to make Intel's competitors to look
good. By my philosophy world without monopolies is better for
everybody, Intel included.

Juha Nieminen

unread,

Sep 15, 2020, 3:32:55 AM9/15/20

to

Bonita Montero <Bonita....@gmail.com> wrote:
> #include <cstdint>
>
> int f( int32_t a, int32_t b )
> {
> int32_t mask = ~(a return a & mask | b & ~mask;
> }
>
> The ternary operation is often substituted by a "sbb regx, regx".

Where's the carry flag coming from?

(Btw, if you use #include <cstdint> you have to use std::int32_t,
not int32_t. I don't think that the standard guarantees that
<cstdint> will dump the names into the global namespace.
You have to use <stdint.h> for that.)

Bonita Montero

unread,

Sep 15, 2020, 3:52:00 AM9/15/20

to

>> int f( int32_t a, int32_t b )
>> {
>> int32_t mask = ~(a > return a & mask | b & ~mask;
>> }
>> The ternary operation is often substituted by a "sbb regx, regx".

> Where's the carry flag coming from?

"cmp regA, regB; sbb regMask, regMask"

Tim Rentsch

unread,

Sep 15, 2020, 4:45:57 AM9/15/20

to

Tiib <oot...@hot.ee> writes:

> On Tuesday, 15 September 2020 08:04:52 UTC+3, Bonita Montero wrote:
>
>>> Neither for() nor if() nor switch() now while() nor ?: is needed
>>> for Turing completeness:
>>> static int
>>> give_a( int a, int b ){
>>> return a;
>>> }
>>>
>>> static int
>>> give_b( int a, int b ){
>>> return b;
>>> }
>>>
>>> int
>>> no_control_max( int a, int b ){
>>> int (*choose[2])( int, int ) = { give_a, give_b };
>>> return choose[ b > a ]( a, b );
>>> }
>>> Compiles cleanly as both C{90,99,11} and C++{98,03,11,14,17}.
>>
>> It needs a good compiler to optimize away these stupid
>> function-calls. Otherwise this code would be rather slow.
>
> Yes, but it usually is not generating branches like I did ask.

Do you mind if I ask what is motivating the original question?
Or what the criteria are for a successful answer? The answer
I gave was somewhat tongue-in-cheek since even though it was
not using if() or ?: I was pretty sure it was not what you
were looking for.

I do have a solution that uses only common operators, and no
function call trickery (and of course no library functions), and
thus likely satisfies what you would call "raw" C++, but before I
post something more I want to be sure I know what it is you're
really looking for.

> Your trick, Bonita, is also good as majority of compilers
> (and all popular) optimize that ?: away on targets.
> For example on RISC-V:
>
> bonita_max(int, int):
> slt a5,a0,a1
> subw a5,zero,a5
> xor a1,a0,a1
> and a1,a1,a5
> xor a0,a0,a1
> sext.w a0,a0
> ret
>
> tim_max(int, int):
> sgt a5,a1,a0
> slli a4,a5,3
> lui a5,%hi(.LANCHOR0)
> addi a5,a5,%lo(.LANCHOR0)
> add a5,a5,a4
> ld t1,0(a5)
> jr t1
>
> Interesting trick, jumping to address of called function and letting
> it to return for you.

As tail calls go it's one of the simplest, which makes it one
of the easiest optimizations to do. I'm surprised in a way
that compilers don't do this particular optimization even
at level -O0.

Öö Tiib

unread,

Sep 15, 2020, 9:04:19 AM9/15/20

to

On Tuesday, 15 September 2020 11:45:57 UTC+3, Tim Rentsch wrote:
> Tiib <oot...@hot.ee> writes:
>
> > On Tuesday, 15 September 2020 08:04:52 UTC+3, Bonita Montero wrote:
> >
> >>> Neither for() nor if() nor switch() now while() nor ?: is needed
> >>> for Turing completeness:
> >>> static int
> >>> give_a( int a, int b ){
> >>> return a;
> >>> }
> >>>
> >>> static int
> >>> give_b( int a, int b ){
> >>> return b;
> >>> }
> >>>
> >>> int
> >>> no_control_max( int a, int b ){
> >>> int (*choose[2])( int, int ) = { give_a, give_b };
> >>> return choose[ b > a ]( a, b );
> >>> }
> >>> Compiles cleanly as both C{90,99,11} and C++{98,03,11,14,17}.
> >>
> >> It needs a good compiler to optimize away these stupid
> >> function-calls. Otherwise this code would be rather slow.
> >
> > Yes, but it usually is not generating branches like I did ask.
>
> Do you mind if I ask what is motivating the original question?
> Or what the criteria are for a successful answer? The answer
> I gave was somewhat tongue-in-cheek since even though it was
> not using if() or ?: I was pretty sure it was not what you
> were looking for.

I understood it was tongue-in cheek but it was still interesting
to look what compilers do from it at different platforms, thanks.
Actually weather was bad at weekend and so I was reading various
interesting computation and vectorising algorithm papers (doing
lot of same calculation in parallel). So that silly idea just
randomly popped into mind.

Yes, these usually do with ordinary call. Here was (kind of
calculated) function pointer, may be that somehow affects it.

Marcel Mueller

unread,

Sep 20, 2020, 5:23:08 AM9/20/20

to

Am 15.09.20 um 08:55 schrieb Öö Tiib:

> Interesting trick, jumping to address of called function and letting
> it to return for you.

TCO (Tail Call Optimization) is probably as old as the assembler languages.

Definitely slow on modern hardware are indirect calls (or jumps) with a
calculated target. They usually result in pipeline stalls because they
are difficult to predict in hardware.

Conditional memory access is usually the fastest for min/max if
available on the target hardware.

Marcel

jacobnavia

unread,

Sep 20, 2020, 9:54:35 AM9/20/20

to

Le 14/09/2020 à 17:35, Scott Lurndal a écrit :
> David Brown <david...@hesbynett.no> writes:
>> On 14/09/2020 14:53, Bonita Montero wrote:
>>>> int f( int32_t a, int32_t b )
>>>> {

>>>> Â Â Â Â Â int32_t mask = ~(a >>> Â Â Â Â Â return a & mask | b & ~mask;

>>>> }
>>>
>>> Or better:
>>>
>>> int f( int32_t a, int32_t b )
>>> {

>>> Â Â Â Â int32_t mask = a >> Â Â Â Â return a & ~mask | b & mask;

>>> }
>>
>> That is a little less bad than the OP's suggestion, as it is reasonably
>> comprehensible. But it is still a dog's breakfast.
>>
>> It's twenty years or more since it has made any sense to write this sort
>> of gibberish in anything but the most specialised of circumstances.
>>
>
> Given the intel/amd 'CMOVxx' instruction and the ARM "CSET" instruction,
> the compilers will generate the most optimal branchless code for a max
> check even without optimization.
>

In the arm target gcc will generate this code without optimization:

18 str x0, [sp, 8]
19 str x1, [sp]
20 ldr x0, [sp, 8]
21 ldr w1, [x0]
22 ldr x0, [sp]
23 ldr w0, [x0]
24 cmp w1, w0
25 bge .L2
26 ldr x0, [sp]
27 b .L3
28 .L2:
29 ldr x0, [sp, 8]
30 .L3:

Yes, it generates branches. your assertion above is falsified