if a >= 4 then a = 3, in assembler, of course

DSF

unread,

Aug 31, 2010, 6:30:38 PM8/31/10

to

Hello!

I'm trying to see if there is a way to do the following without the
branch:

cmp edi, 4
jb here
mov edi, 3
here: ...

Must be 386-friendly code.

I'm not really concerned with the branch. I just have a nagging
feeling in the back of my mind that there is a way to do this and I
just can't remember it.

DSF

Niels Fröhling

unread,

Aug 31, 2010, 8:47:51 PM8/31/10

to

You could use the branch-free minimum(), if the branch concerns you. You should
mix it with other instructions for OoO.

mov eax, 3
sub eax, edi ; edi > 3 => eax negative distance
sbb ecx, ecx ; edi > 3 => ecx -1
and eax, ecx ; edi > 3 => eax negative distance
add edi, eax ; edi > 3 => edi += negative distance

Ciao
Niels

Terje Mathisen

unread,

Sep 1, 2010, 3:21:02 AM9/1/10

to

Niels Fröhling wrote:
> DSF wrote:
>> Hello!
>>
>> I'm trying to see if there is a way to do the following without the
>> branch:
>>
>> cmp edi, 4
>> jb here
>> mov edi, 3
>> here: ...
>>
>> Must be 386-friendly code.

Oops! I wrote my reply before I noticed this, see update below:

>>
>> I'm not really concerned with the branch. I just have a nagging
>> feeling in the back of my mind that there is a way to do this and I
>> just can't remember it.
>>
>> DSF
>
> You could use the branch-free minimum(), if the branch concerns you. You
> should mix it with other instructions for OoO.
>
> mov eax, 3
> sub eax, edi ; edi > 3 => eax negative distance
> sbb ecx, ecx ; edi > 3 => ecx -1
> and eax, ecx ; edi > 3 => eax negative distance
> add edi, eax ; edi > 3 => edi += negative distance

As Niels writes, this is definitely doable.

Since you're probably not targeting cpus which are more than 10-15 years
old, you can also use CMOV to get the same effect:

cmp edi,3
mov eax,3
cmova edi,eax ; Conditionally MOVe if EDI was Above 3

This version will normally take 3 cycles, since CMOV internally is split
into two micro-ops.

Update:

OK, with the 386 requirement this becomes harder but it might be
possible to shave a cycle off Niels version:

sub edi,3 ; Will borrow unless edi >= 3
sbb eax,eax ; -1 if (0,1,2) input
and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
add edi,3

I.e. 4 or 5 cycles on a 486, probably 8 or 9 on a 386?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Rod Pemberton

unread,

Sep 1, 2010, 4:10:11 AM9/1/10

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message
news:0gb1l7-...@ntp.tmsw.no...

> > DSF wrote:
> >> Hello!
> >>
> >> I'm trying to see if there is a way to do the following without the
> >> branch:
> >>
> >> cmp edi, 4
> >> jb here
> >> mov edi, 3
> >> here: ...
> >>
> >> Must be 386-friendly code.
>

> OK, with the 386 requirement this becomes harder but it might be
> possible to shave a cycle off Niels version:
>
> sub edi,3 ; Will borrow unless edi >= 3
> sbb eax,eax ; -1 if (0,1,2) input
> and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
> add edi,3
>
> I.e. 4 or 5 cycles on a 486, probably 8 or 9 on a 386?
>

Que? Terje, did I read that correctly? I.e., you said:

3 + "3 and up -> zero" = 3
3 + "(0,1,2) unmodified" = 3, 4, 5

I think that is: "if a>=3, then a=3" and "if a<3, a=a+3". In the subject,
the OP explicitly asked for "if a>=4, then a=3", and implicitly "if a<4,
then a=a" ... Yes?

Rod Pemberton

Terje Mathisen

unread,

Sep 1, 2010, 5:25:37 AM9/1/10

to

No, unless I made a bad mistake and outsmarted myself:

; Input: 0, 1, 2, 3, 4, 5, 6...
sub edi,3 ; -3,-2,-1, 0, 1, 2, 3...
sbb eax,eax ; -1,-1,-1, 0, 0, 0, 0...
and edi,eax ; -3,-2,-1, 0, 0, 0, 0...
add edi,3 ; 0, 1, 2, 3, 3, 3, 3...

That does seem right, doesn't it?

wolfgang kern

unread,

Sep 1, 2010, 7:10:02 AM9/1/10

to

Terje Mathisen posted:
...
___

sub edi,3 ; Will borrow unless edi >= 3
sbb eax,eax ; -1 if (0,1,2) input
and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
add edi,3

___

Good trick and worth to remember.
We should rename sbb eax,eax to "SETc eax" :)

__
wolfgang

Rod Pemberton

unread,

Sep 1, 2010, 4:09:55 PM9/1/10

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message

news:kpi1l7-...@ntp.tmsw.no...

And yet, it gives new meaning to "unmodified"...

I'll remember to trust your code and not your words.

RP

Terje Mathisen

unread,

Sep 2, 2010, 2:54:49 AM9/2/10

to

Rod Pemberton wrote:
> And yet, it gives new meaning to "unmodified"...
>
> I'll remember to trust your code and not your words.

"Use the Source, Luke!"

DSF

unread,

Sep 2, 2010, 3:47:38 AM9/2/10

to

On Wed, 01 Sep 2010 09:21:02 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

>Niels Fröhling wrote:
>> DSF wrote:
>>> Hello!
>>>
>>> I'm trying to see if there is a way to do the following without the
>>> branch:
>>>
>>> cmp edi, 4
>>> jb here
>>> mov edi, 3
>>> here: ...
>>>
>>> Must be 386-friendly code.
>
>Oops! I wrote my reply before I noticed this, see update below:
>>>
>>> I'm not really concerned with the branch. I just have a nagging
>>> feeling in the back of my mind that there is a way to do this and I
>>> just can't remember it.
>>>
>>> DSF
>>
>> You could use the branch-free minimum(), if the branch concerns you. You
>> should mix it with other instructions for OoO.

I have to remember there are no stupid questions.. What is OoO?

>>
>> mov eax, 3
>> sub eax, edi ; edi > 3 => eax negative distance
>> sbb ecx, ecx ; edi > 3 => ecx -1
>> and eax, ecx ; edi > 3 => eax negative distance
>> add edi, eax ; edi > 3 => edi += negative distance
>
>As Niels writes, this is definitely doable.
>
>Since you're probably not targeting cpus which are more than 10-15 years
>old, you can also use CMOV to get the same effect:
>
> cmp edi,3
> mov eax,3
> cmova edi,eax ; Conditionally MOVe if EDI was Above 3
>
>This version will normally take 3 cycles, since CMOV internally is split
>into two micro-ops.
>

First thing I thought of. But I'm not sure how far CMOV goes back.
This code is in the GPR branch of GPR/SSE2 code and wanted it to be as
widely usable as possible. Perhaps ridiculously so.

>Update:
>
>OK, with the 386 requirement this becomes harder but it might be
>possible to shave a cycle off Niels version:
>
> sub edi,3 ; Will borrow unless edi >= 3
> sbb eax,eax ; -1 if (0,1,2) input
> and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
> add edi,3
>
>I.e. 4 or 5 cycles on a 486, probably 8 or 9 on a 386?

I do like that your code uses one less register. That may be handy.
I started with 6502 code, so when I got to X86 it seemed like an
endless supply of registers...that feeling didn't last too long.

Terje, since you seem to be a (the?) guru here on branch prediction,
does a really short branch have any problems? For example, the code
snippet I started this thread with: (most recent version)

cmp edi, 4
jb lt4
mov edi, 3
lt4:
dec edx

The branch jumps over just one instruction. This seems like the
kind of thing where a processor could execute both possibilities and
just discard the results of the mov instruction if the branch is
taken. Maybe a better way to put it is, what happens if both
possibilities for the branch are in the pipeline?

>
>Terje

DSF

Niels Fröhling

unread,

Sep 2, 2010, 4:40:50 AM9/2/10

to

DSF wrote:

>>> You could use the branch-free minimum(), if the branch concerns you. You
>>> should mix it with other instructions for OoO.
>
> I have to remember there are no stupid questions.. What is OoO?

Out-of-order. It refers to the capacity of a CPU to execute multiple
independent instructions at the same time. 386 doesn't have it, but modern ones
do it.
For example:

>>> mov eax, 3
>>> sub eax, edi ; edi> 3 => eax negative distance
>>> sbb ecx, ecx ; edi> 3 => ecx -1
>>> and eax, ecx ; edi> 3 => eax negative distance
>>> add edi, eax ; edi> 3 => edi += negative distance

>> sub edi,3 ; Will borrow unless edi>= 3

>> sbb eax,eax ; -1 if (0,1,2) input
>> and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
>> add edi,3

can actually be executed in parallel with other code, like this (take care not
to flush the carry/borrow here):

sub edi,3 ; Will borrow unless edi>= 3

lea edx,test

sbb eax,eax ; -1 if (0,1,2) input

inc edx

and edi,eax ; (0,1,2) unmodified, 3 and up -> zero

shr edx,1
add edi,3

Executes probably identical (in cycles) as even an OoO CPU will have to work
serial on the original dependent instruction-chain. I guess others can recommend
real good OoO readings.

If you target _only_ 386, leave it be. If there is a chance it runns on a
modern CPU as well, it's worth it.

Ciao
Niels

Terje Mathisen

unread,

Sep 2, 2010, 5:47:09 AM9/2/10

to

DSF wrote:
> Terje, since you seem to be a (the?) guru here on branch prediction,
> does a really short branch have any problems? For example, the code
> snippet I started this thread with: (most recent version)
>
> cmp edi, 4
> jb lt4
> mov edi, 3
> lt4:
> dec edx
>
> The branch jumps over just one instruction. This seems like the
> kind of thing where a processor could execute both possibilities and
> just discard the results of the mov instruction if the branch is
> taken. Maybe a better way to put it is, what happens if both
> possibilities for the branch are in the pipeline?

Not what you want!

A mispredicted branch, even if it was of zero length (i.e. a NOP), will
take a long time to recover.

Long here means somewhere between 5 and 20 cycles and 20 to 40/60
instruction slots. I.e. a _lot_ of time.

However, if most of your input data is in-range (or out-of-range) so
that the branch will be well predicted (say 95%) the branch will in fact
be faster than any of the branchless versions, including the CMOV.

DSF

unread,

Sep 2, 2010, 5:03:54 PM9/2/10

to

Thank you. I understood what out-of-order is and does, I didn't
understand the acronym "OoO". I did learn something new, however. I
didn't know 386s were not capable of OoO. Good thing to know if I'm
ever writing code for 386 and down. I can write in an orderly manner
and leave it alone instead of scrambling it into a confusing mass with
fewer dependencies.

DSF

Richard Russell

unread,

Sep 2, 2010, 6:42:46 PM9/2/10

to

On 2 Sep, 09:40, Niels Fröhling <spamt...@nospicedham.adsignum.com>
wrote:

> Out-of-order. It refers to the capacity of a CPU to execute multiple
> independent instructions at the same time. 386 doesn't have it, but modern ones
> do it.

I thought I understood OoO, but you've managed to confuse me
thoroughly! What you are calling 'out of order' sounds more like what
I know as 'superscalar'. Take for example the original Pentium; AIUI
that was an *in-order* superscalar processor, meaning that it could
execute multiple instructions simultaneously but only if they were
presented in a suitable sequence. To get the best from such a
processor it is necessary to interleave instructions within the code,
as in the example you quoted.

In contrast, my understanding of 'out of order' is that the *CPU* is
able to reorder the instructions internally, so that the benefit of
multiple simultaneous execution can be obtained (to a degree)
*without* the programmer (or compiler) needing to optimize the
instruction sequence. I thought the first IA-32 CPU with out-of-order
execution was the Pentium Pro.

But I'm not sure any more.....

Richard.
http://www.rtrussell.co.uk/

Niels Fröhling

unread,

Sep 2, 2010, 11:08:23 PM9/2/10

to

Richard Russell wrote:

> I thought I understood OoO, but you've managed to confuse me
> thoroughly! What you are calling 'out of order' sounds more like what
> I know as 'superscalar'. Take for example the original Pentium; AIUI
> that was an *in-order* superscalar processor, meaning that it could
> execute multiple instructions simultaneously but only if they were
> presented in a suitable sequence. To get the best from such a
> processor it is necessary to interleave instructions within the code,
> as in the example you quoted.
>
> In contrast, my understanding of 'out of order' is that the *CPU* is
> able to reorder the instructions internally, so that the benefit of
> multiple simultaneous execution can be obtained (to a degree)
> *without* the programmer (or compiler) needing to optimize the
> instruction sequence. I thought the first IA-32 CPU with out-of-order
> execution was the Pentium Pro.
>
> But I'm not sure any more.....

You're are (and were) right. It's just nobody uses the term super-scalar (last
time I heard it was in the arch-description of the 68060 for me). OoO is cleary
a super-set of super-scalar, in terms of capabilities. So a OoO capable CPU can
do all a SS CPU can do, in terms of instruction-stream operations.
I'm too settled in x86 SSE2+ (habits, damn habbits :^), so maybe I'm not
really exact with the term.

Ciao
Niels

H. Peter Anvin

unread,

Sep 13, 2010, 12:10:44 PM9/13/10

to

On 09/02/2010 08:08 PM, Niels Fröhling wrote:
>>
>> In contrast, my understanding of 'out of order' is that the *CPU* is
>> able to reorder the instructions internally, so that the benefit of
>> multiple simultaneous execution can be obtained (to a degree)
>> *without* the programmer (or compiler) needing to optimize the
>> instruction sequence. I thought the first IA-32 CPU with out-of-order
>> execution was the Pentium Pro.
>>
>> But I'm not sure any more.....
>
> You're are (and were) right. It's just nobody uses the term
> super-scalar (last time I heard it was in the arch-description of the
> 68060 for me). OoO is cleary a super-set of super-scalar, in terms of
> capabilities. So a OoO capable CPU can do all a SS CPU can do, in terms
> of instruction-stream operations.
> I'm too settled in x86 SSE2+ (habits, damn habbits :^), so maybe I'm
> not really exact with the term.
>

People do use the term "superscalar" all the time. It's just rare to
mention it in the context of mainstream CPUs since it is universal these
days. What it means is more than one scalar execution unit that can
execute in parallel. Out-of-order doesn't necessarily mean superscalar,
although in practice noone would go through the trouble of building an
out-of-order CPU without making it superscalar.

-hpa

hopcode

unread,

Nov 7, 2010, 10:15:20 AM11/7/10

to

Hi All,i am new here.
Thank you All for the OoO explanation.
As i can see it sounds like an unsigned saturation
method on custom values.

> ; Input: 0, 1, 2, 3, 4, 5, 6...
> sub edi,3 ; -3,-2,-1, 0, 1, 2, 3...
> sbb eax,eax ; -1,-1,-1, 0, 0, 0, 0...
> and edi,eax ; -3,-2,-1, 0, 0, 0, 0...
> add edi,3 ; 0, 1, 2, 3, 3, 3, 3...
>

And.. it is portable on x64. I will add the gem
on my web-site.

A question: in the following (using eax as register):

sub eax,3
cdq
and eax,edx
add eax,3

if i am not wrong, cdq should break dependencies
on the carry flag, shouldnt it ?

Cheers,
hopcode

--

.::mrk::.
x64 Assembly Lab
http://sites.google.com/site/x64lab