int sgn(int a)
{
if (!a) {
return (0);
} else if (a < 0) {
return (-1);
} else {
return (1);
}
}
The inline assembly code for this receives the argument in EAX, and should
use only EDX and ECX. It should preserve EAX if possible. This is not for a
specific x86 processor, but assume it should be "optimal" for an out-of-order
processor (i.e. one doesn't have to pay attention to instruction pairing and
AGIs). CMOVs are acceptable, but would limit the code to P6 and compatible
machines of course.
The best code I have so far is based on generating a 1 for positive numbers,
a -1 for negative numbers, and then conditionally zeroing the result if the
input was 0. I tried many other approaches, like conditionally incrementing
and decrementing, using multiple CMOVs, etc.
; eax = input a (no CMOV)
xor ecx, ecx ; 0
cdq ; a < 0 ? -1 : 0
cmp eax, 1 ; |a| < 1 ? CY : NC
sbb ecx, 1 ; |a| < 1 ? 0 : 0xffffffff
lea edx, [edx*2+1] ; a < 0 ? -1 : 1
and edx, ecx ; sgn(a)
; eax = input a (with CMOV)
xor ecx, ecx ; 0
cdq ; a < 0 ? -1 : 0
cmp eax, 1 ; |a| < 1 ? CY : NC
lea edx, [edx*2+1] ; a < 0 ? -1 : 1
cmovc edx, ecx ; sgn(a) = |a| < 1 ? 0 : (a < 0 ? -1 : 1)
Since I didn't make the constraints too specific, there are probably
multiple "optimal" solutions. I wouldn't be surprised if Terje has a
3-instruction solution stashed away in his bag-of-tricks :-)
-- Norbert
-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Here's a 3 instruction one for you:
cdq
cmp edx,eax
adc edx,0
;edx now contains sgn(eax)
Is that more optimal? I'm sure it's better than your CMOV version too.
Who is this Terje anyway? =)
- vulture a.k.a. Sean Stanek
Who needs three instructions.....
This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If
you can interleave other instructions, this can execute in 1 cycle
on a Pentium.
sar eax,31 ; eax= -1 or 0 1 U
lea eax,[2*eax+1] ; eax= -1 or 1 1 UV
--
Eric Korpela | An object at rest can never be
kor...@ssl.berkeley.edu | stopped.
<a href="http://sag-www.ssl.berkeley.edu/~korpela">Click for home page.</a>
Well, I guess it can't get any better than that. I tried conditional
increments like you use, but I never managed to hit upon this solution.
I think I am going to retire as a code optimizer :-(
I should clarify this for those who are about to tell me that the start and
the result will always happen in different cycles. By 1 cycles I mean
each instruction can be paired with another so averaged, each
instruction takes 1/2 cycle.
Eric
Unfortunately this does not return 0 if EAX was 0, as initially
required. You'd have to add another two simple instructions, e.g.
cmp eax, 1
sbb ecx, ecx
sar eax, 31
lea eax, [eax * 2 + ecx + 1]
Interleaved with something else, of course.
Regards,
Michael Tippach
This is a fairly well-known sequence, I'd guess it is one of those that
could be located by the (in)famous Superoptimizer.
On many x86 cpus the CDQ is very slow, so it might be better to split it
into a MOV + SAR pair instead:
mov edx,eax
sar edx,31 ; -1 or 0, correct for negative and zero input
cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff
adc edx,0 ; +1 if positive
>
> Is that more optimal? I'm sure it's better than your CMOV version too.
>
> Who is this Terje anyway? =)
Just another asm programmer. :-)
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
Well, that's 4 instructions... Norbert wanted 3. =)
Also, SHR takes 2 cycles, MOV takes 1... that's 3 cycles. That cannot be
pipelined as EDX is used as target in both instructions. CDQ is not very slow.
2 cycles on a 386, 3 on a 486, 2(?) on a pentium. All the rest is the same. The
CDQ one might be faster though. :D
Actually, CDQ works just fine on modern out-of-order x86 processors. You
are correct, it is a 2 cycle instruction on both Pentium and K6. I have
no idea why that is. All it is is a SAR by 31 with EAX as source and EDX
as destination which doesn;t update the flags. Anyhow, I much prefer CDQ
since it's a one byte instruction. If you replace it, you need 2 dependent
instructions in order to get the same semantics (i.e. don't overwrite the
source) and that takes two cycles, too.
Ahh, I didn't catch that. Not enough caffeine today I guess.
IMHO your 4 instruction solution beats a 3 instruction solution
with a CDQ at least on Pentium.
On Pentium SAR takes 1 cycle and is U pairable. This is definitely the
better solution if you can interleave this code with other code. CDQ takes 2
cycles and is not pairable. At worst, the above is 4 cycles at best
(inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles
(i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best
(interleaved) the CDQ solution is 3 cycles.
MOV + SHR of the same register is not pairable. I suppose you can pair it by
interleaving other code, but that wasn't the task at hand. :P
On a 386, SHR = 3 cycles, MOV=1, CDQ=2. For a lower processor, CDQ is a good
answer. On higher processors, CDQ will likely be downed to 1, and is one
instruction.
> cycles and is not pairable. At worst, the above is 4 cycles at best
> (inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles
> (i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best
> (interleaved) the CDQ solution is 3 cycles.
MOV+SHR even though you can interleave does not mean those cycles do not count.
They still get their 1/2 value. =)
So, 3 at best.
>I was wondering what the fastest inline code for an integer sgn() function
>could look like. It's defined as follows:
>
>int sgn(int a)
>{
> if (!a) {
> return (0);
> } else if (a < 0) {
> return (-1);
> } else {
> return (1);
> }
>}
>
>The inline assembly code for this receives the argument in EAX, and should
>use only EDX and ECX. It should preserve EAX if possible. This is not for a
>specific x86 processor, but assume it should be "optimal" for an
out-of-order
>processor (i.e. one doesn't have to pay attention to instruction pairing
and
>AGIs). CMOVs are acceptable, but would limit the code to P6 and compatible
>machines of course.
>
>...
CDQ // -1 0 0
EAX NEG // set CF if EAX <> 0
EDX EDX ADC // -1 0 1
Sorry! EAX is not preserved.
Charles Liu
[snip]
>The best code I have so far is based on generating a 1 for positive
numbers,
>a -1 for negative numbers, and then conditionally zeroing the result if the
>input was 0. I tried many other approaches, like conditionally incrementing
>and decrementing, using multiple CMOVs, etc.
[snip]
> ; eax = input a (with CMOV)
>
> xor ecx, ecx ; 0
> cdq ; a < 0 ? -1 : 0
> cmp eax, 1 ; |a| < 1 ? CY : NC
> lea edx, [edx*2+1] ; a < 0 ? -1 : 1
> cmovc edx, ecx ; sgn(a) = |a| < 1 ? 0 : (a < 0 ? -1 : 1)
I'm want to try this code but my assembler (BASM in Delphi) doesn't
recognize the cmovc instruction. I need to know the encoding of
cmovc edx, ecx
so that I can manually insert it with db's (or a dd). I think it is
something like 0x0f4187d2, but this doesn't work. (I have a Pentium II so
the cmov instruction is available)
Thanks,
Roelof
------
remove the obvious from my reply address...
There is one other possible idea, by going back to the definition:
First return 0 or -1 depending upon zero/nonzero input:
cmp eax,1
sbb ebx,ebx
At the same time, check for negative input:
mov edx,eax
sar edx,31
If the input was non-zero but not negative, return +1 instead of -1:
xor edx,ebx
sub edx,ebx
It's obvious that this version will have the same best-case latency of 4
cycles on a PII, and it will be a cycle slower on a Pentium, so the
4-instruction version is better.
Don't worry, I found it; it's 0x0f42d1.
Well, I do have to agree with you on that for now. I suppose there's pipelining
capabilities, etc. CDQ may, in the future, become pipelineable and at 1 clock.
And since Michael Abrash seems to know you, there's no point in arguing any
furthur. ;)
Yeah, it is perfectly conceivable that CDQ will become a regular Px
micro-ops.
>
> And since Michael Abrash seems to know you, there's no point in arguing any
> furthur. ;)
I've known Mike for more than 10 years, but that's absolutely no reason
to not test stuff yourself. I have absolutely no monopoly on good ideas.