I'm trying to see if there is a way to do the following without the
branch:
cmp edi, 4
jb here
mov edi, 3
here: ...
Must be 386-friendly code.
I'm not really concerned with the branch. I just have a nagging
feeling in the back of my mind that there is a way to do this and I
just can't remember it.
DSF
You could use the branch-free minimum(), if the branch concerns you. You should
mix it with other instructions for OoO.
mov eax, 3
sub eax, edi ; edi > 3 => eax negative distance
sbb ecx, ecx ; edi > 3 => ecx -1
and eax, ecx ; edi > 3 => eax negative distance
add edi, eax ; edi > 3 => edi += negative distance
Ciao
Niels
Oops! I wrote my reply before I noticed this, see update below:
>>
>> I'm not really concerned with the branch. I just have a nagging
>> feeling in the back of my mind that there is a way to do this and I
>> just can't remember it.
>>
>> DSF
>
> You could use the branch-free minimum(), if the branch concerns you. You
> should mix it with other instructions for OoO.
>
> mov eax, 3
> sub eax, edi ; edi > 3 => eax negative distance
> sbb ecx, ecx ; edi > 3 => ecx -1
> and eax, ecx ; edi > 3 => eax negative distance
> add edi, eax ; edi > 3 => edi += negative distance
As Niels writes, this is definitely doable.
Since you're probably not targeting cpus which are more than 10-15 years
old, you can also use CMOV to get the same effect:
cmp edi,3
mov eax,3
cmova edi,eax ; Conditionally MOVe if EDI was Above 3
This version will normally take 3 cycles, since CMOV internally is split
into two micro-ops.
Update:
OK, with the 386 requirement this becomes harder but it might be
possible to shave a cycle off Niels version:
sub edi,3 ; Will borrow unless edi >= 3
sbb eax,eax ; -1 if (0,1,2) input
and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
add edi,3
I.e. 4 or 5 cycles on a 486, probably 8 or 9 on a 386?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Que? Terje, did I read that correctly? I.e., you said:
3 + "3 and up -> zero" = 3
3 + "(0,1,2) unmodified" = 3, 4, 5
I think that is: "if a>=3, then a=3" and "if a<3, a=a+3". In the subject,
the OP explicitly asked for "if a>=4, then a=3", and implicitly "if a<4,
then a=a" ... Yes?
Rod Pemberton
No, unless I made a bad mistake and outsmarted myself:
; Input: 0, 1, 2, 3, 4, 5, 6...
sub edi,3 ; -3,-2,-1, 0, 1, 2, 3...
sbb eax,eax ; -1,-1,-1, 0, 0, 0, 0...
and edi,eax ; -3,-2,-1, 0, 0, 0, 0...
add edi,3 ; 0, 1, 2, 3, 3, 3, 3...
That does seem right, doesn't it?
Good trick and worth to remember.
We should rename sbb eax,eax to "SETc eax" :)
__
wolfgang
And yet, it gives new meaning to "unmodified"...
I'll remember to trust your code and not your words.
RP
"Use the Source, Luke!"
>Niels Fröhling wrote:
>> DSF wrote:
>>> Hello!
>>>
>>> I'm trying to see if there is a way to do the following without the
>>> branch:
>>>
>>> cmp edi, 4
>>> jb here
>>> mov edi, 3
>>> here: ...
>>>
>>> Must be 386-friendly code.
>
>Oops! I wrote my reply before I noticed this, see update below:
>>>
>>> I'm not really concerned with the branch. I just have a nagging
>>> feeling in the back of my mind that there is a way to do this and I
>>> just can't remember it.
>>>
>>> DSF
>>
>> You could use the branch-free minimum(), if the branch concerns you. You
>> should mix it with other instructions for OoO.
I have to remember there are no stupid questions.. What is OoO?
>>
>> mov eax, 3
>> sub eax, edi ; edi > 3 => eax negative distance
>> sbb ecx, ecx ; edi > 3 => ecx -1
>> and eax, ecx ; edi > 3 => eax negative distance
>> add edi, eax ; edi > 3 => edi += negative distance
>
>As Niels writes, this is definitely doable.
>
>Since you're probably not targeting cpus which are more than 10-15 years
>old, you can also use CMOV to get the same effect:
>
> cmp edi,3
> mov eax,3
> cmova edi,eax ; Conditionally MOVe if EDI was Above 3
>
>This version will normally take 3 cycles, since CMOV internally is split
>into two micro-ops.
>
First thing I thought of. But I'm not sure how far CMOV goes back.
This code is in the GPR branch of GPR/SSE2 code and wanted it to be as
widely usable as possible. Perhaps ridiculously so.
>Update:
>
>OK, with the 386 requirement this becomes harder but it might be
>possible to shave a cycle off Niels version:
>
> sub edi,3 ; Will borrow unless edi >= 3
> sbb eax,eax ; -1 if (0,1,2) input
> and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
> add edi,3
>
>I.e. 4 or 5 cycles on a 486, probably 8 or 9 on a 386?
I do like that your code uses one less register. That may be handy.
I started with 6502 code, so when I got to X86 it seemed like an
endless supply of registers...that feeling didn't last too long.
Terje, since you seem to be a (the?) guru here on branch prediction,
does a really short branch have any problems? For example, the code
snippet I started this thread with: (most recent version)
cmp edi, 4
jb lt4
mov edi, 3
lt4:
dec edx
The branch jumps over just one instruction. This seems like the
kind of thing where a processor could execute both possibilities and
just discard the results of the mov instruction if the branch is
taken. Maybe a better way to put it is, what happens if both
possibilities for the branch are in the pipeline?
>
>Terje
DSF
>>> You could use the branch-free minimum(), if the branch concerns you. You
>>> should mix it with other instructions for OoO.
>
> I have to remember there are no stupid questions.. What is OoO?
Out-of-order. It refers to the capacity of a CPU to execute multiple
independent instructions at the same time. 386 doesn't have it, but modern ones
do it.
For example:
>>> mov eax, 3
>>> sub eax, edi ; edi> 3 => eax negative distance
>>> sbb ecx, ecx ; edi> 3 => ecx -1
>>> and eax, ecx ; edi> 3 => eax negative distance
>>> add edi, eax ; edi> 3 => edi += negative distance
>> sub edi,3 ; Will borrow unless edi>= 3
>> sbb eax,eax ; -1 if (0,1,2) input
>> and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
>> add edi,3
can actually be executed in parallel with other code, like this (take care not
to flush the carry/borrow here):
sub edi,3 ; Will borrow unless edi>= 3
lea edx,test
sbb eax,eax ; -1 if (0,1,2) input
inc edx
and edi,eax ; (0,1,2) unmodified, 3 and up -> zero
shr edx,1
add edi,3
Executes probably identical (in cycles) as even an OoO CPU will have to work
serial on the original dependent instruction-chain. I guess others can recommend
real good OoO readings.
If you target _only_ 386, leave it be. If there is a chance it runns on a
modern CPU as well, it's worth it.
Ciao
Niels
Not what you want!
A mispredicted branch, even if it was of zero length (i.e. a NOP), will
take a long time to recover.
Long here means somewhere between 5 and 20 cycles and 20 to 40/60
instruction slots. I.e. a _lot_ of time.
However, if most of your input data is in-range (or out-of-range) so
that the branch will be well predicted (say 95%) the branch will in fact
be faster than any of the branchless versions, including the CMOV.
Thank you. I understood what out-of-order is and does, I didn't
understand the acronym "OoO". I did learn something new, however. I
didn't know 386s were not capable of OoO. Good thing to know if I'm
ever writing code for 386 and down. I can write in an orderly manner
and leave it alone instead of scrambling it into a confusing mass with
fewer dependencies.
DSF
I thought I understood OoO, but you've managed to confuse me
thoroughly! What you are calling 'out of order' sounds more like what
I know as 'superscalar'. Take for example the original Pentium; AIUI
that was an *in-order* superscalar processor, meaning that it could
execute multiple instructions simultaneously but only if they were
presented in a suitable sequence. To get the best from such a
processor it is necessary to interleave instructions within the code,
as in the example you quoted.
In contrast, my understanding of 'out of order' is that the *CPU* is
able to reorder the instructions internally, so that the benefit of
multiple simultaneous execution can be obtained (to a degree)
*without* the programmer (or compiler) needing to optimize the
instruction sequence. I thought the first IA-32 CPU with out-of-order
execution was the Pentium Pro.
But I'm not sure any more.....
Richard.
http://www.rtrussell.co.uk/
> I thought I understood OoO, but you've managed to confuse me
> thoroughly! What you are calling 'out of order' sounds more like what
> I know as 'superscalar'. Take for example the original Pentium; AIUI
> that was an *in-order* superscalar processor, meaning that it could
> execute multiple instructions simultaneously but only if they were
> presented in a suitable sequence. To get the best from such a
> processor it is necessary to interleave instructions within the code,
> as in the example you quoted.
>
> In contrast, my understanding of 'out of order' is that the *CPU* is
> able to reorder the instructions internally, so that the benefit of
> multiple simultaneous execution can be obtained (to a degree)
> *without* the programmer (or compiler) needing to optimize the
> instruction sequence. I thought the first IA-32 CPU with out-of-order
> execution was the Pentium Pro.
>
> But I'm not sure any more.....
You're are (and were) right. It's just nobody uses the term super-scalar (last
time I heard it was in the arch-description of the 68060 for me). OoO is cleary
a super-set of super-scalar, in terms of capabilities. So a OoO capable CPU can
do all a SS CPU can do, in terms of instruction-stream operations.
I'm too settled in x86 SSE2+ (habits, damn habbits :^), so maybe I'm not
really exact with the term.
Ciao
Niels
People do use the term "superscalar" all the time. It's just rare to
mention it in the context of mainstream CPUs since it is universal these
days. What it means is more than one scalar execution unit that can
execute in parallel. Out-of-order doesn't necessarily mean superscalar,
although in practice noone would go through the trouble of building an
out-of-order CPU without making it superscalar.
-hpa
Hi All,i am new here.
Thank you All for the OoO explanation.
As i can see it sounds like an unsigned saturation
method on custom values.
> ; Input: 0, 1, 2, 3, 4, 5, 6...
> sub edi,3 ; -3,-2,-1, 0, 1, 2, 3...
> sbb eax,eax ; -1,-1,-1, 0, 0, 0, 0...
> and edi,eax ; -3,-2,-1, 0, 0, 0, 0...
> add edi,3 ; 0, 1, 2, 3, 3, 3, 3...
>
And.. it is portable on x64. I will add the gem
on my web-site.
A question: in the following (using eax as register):
sub eax,3
cdq
and eax,edx
add eax,3
if i am not wrong, cdq should break dependencies
on the carry flag, shouldnt it ?
Cheers,
hopcode
--
.::mrk::.
x64 Assembly Lab
http://sites.google.com/site/x64lab