Fastest integer sgn() function?

94 views

ju...@my-dejanews.com

Mar 8, 1999, 3:00:00 AM3/8/99
to
I was wondering what the fastest inline code for an integer sgn() function
could look like. It's defined as follows:

int sgn(int a)
{
if (!a) {
return (0);
} else if (a < 0) {
return (-1);
} else {
return (1);
}
}

The inline assembly code for this receives the argument in EAX, and should
use only EDX and ECX. It should preserve EAX if possible. This is not for a
specific x86 processor, but assume it should be "optimal" for an out-of-order
processor (i.e. one doesn't have to pay attention to instruction pairing and
AGIs). CMOVs are acceptable, but would limit the code to P6 and compatible
machines of course.

The best code I have so far is based on generating a 1 for positive numbers,
a -1 for negative numbers, and then conditionally zeroing the result if the
input was 0. I tried many other approaches, like conditionally incrementing
and decrementing, using multiple CMOVs, etc.

; eax = input a (no CMOV)

xor ecx, ecx ; 0
cdq ; a < 0 ? -1 : 0
cmp eax, 1 ; |a| < 1 ? CY : NC
sbb ecx, 1 ; |a| < 1 ? 0 : 0xffffffff
lea edx, [edx*2+1] ; a < 0 ? -1 : 1
and edx, ecx ; sgn(a)

; eax = input a (with CMOV)

xor ecx, ecx ; 0
cdq ; a < 0 ? -1 : 0
cmp eax, 1 ; |a| < 1 ? CY : NC
lea edx, [edx*2+1] ; a < 0 ? -1 : 1
cmovc edx, ecx ; sgn(a) = |a| < 1 ? 0 : (a < 0 ? -1 : 1)

Since I didn't make the constraints too specific, there are probably
multiple "optimal" solutions. I wouldn't be surprised if Terje has a
3-instruction solution stashed away in his bag-of-tricks :-)

-- Norbert

-----------== Posted via Deja News, The Discussion Network ==----------

Srstanek

Mar 8, 1999, 3:00:00 AM3/8/99
to

> Since I didn't make the constraints too specific, there are probably
> multiple "optimal" solutions. I wouldn't be surprised if Terje has a
> 3-instruction solution stashed away in his bag-of-tricks :-)

Here's a 3 instruction one for you:

cdq
cmp edx,eax

;edx now contains sgn(eax)

Is that more optimal? I'm sure it's better than your CMOV version too.

Who is this Terje anyway? =)

- vulture a.k.a. Sean Stanek

Eric J. Korpela

Mar 8, 1999, 3:00:00 AM3/8/99
to
In article <7c0ut5\$srj\$1...@winter.news.rcn.net>,

Srstanek <srst...@aol.com> wrote:
>
>> Since I didn't make the constraints too specific, there are probably
>> multiple "optimal" solutions. I wouldn't be surprised if Terje has a
>> 3-instruction solution stashed away in his bag-of-tricks :-)
>
>
>Here's a 3 instruction one for you:
>
> cdq
> cmp edx,eax
>
>;edx now contains sgn(eax)
>
>Is that more optimal? I'm sure it's better than your CMOV version too.

Who needs three instructions.....

This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If
you can interleave other instructions, this can execute in 1 cycle
on a Pentium.

sar eax,31 ; eax= -1 or 0 1 U
lea eax,[2*eax+1] ; eax= -1 or 1 1 UV

--
Eric Korpela | An object at rest can never be
kor...@ssl.berkeley.edu | stopped.

ju...@my-dejanews.com

Mar 8, 1999, 3:00:00 AM3/8/99
to
In article <7c0ut5\$srj\$1...@winter.news.rcn.net>,

srst...@aol.com (Srstanek) wrote:
>
> > Since I didn't make the constraints too specific, there are probably
> > multiple "optimal" solutions. I wouldn't be surprised if Terje has a
> > 3-instruction solution stashed away in his bag-of-tricks :-)
>
> Here's a 3 instruction one for you:
>
> cdq
> cmp edx,eax
>
> ;edx now contains sgn(eax)
>
> Is that more optimal? I'm sure it's better than your CMOV version too.

Well, I guess it can't get any better than that. I tried conditional
increments like you use, but I never managed to hit upon this solution.
I think I am going to retire as a code optimizer :-(

Eric J. Korpela

Mar 8, 1999, 3:00:00 AM3/8/99
to
In article <7c1600\$87m\$1...@winter.news.rcn.net>,

Eric J. Korpela <kor...@islay.ssl.berkeley.edu> wrote:
>This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If
>you can interleave other instructions, this can execute in 1 cycle
>on a Pentium.

I should clarify this for those who are about to tell me that the start and
the result will always happen in different cycles. By 1 cycles I mean
each instruction can be paired with another so averaged, each
instruction takes 1/2 cycle.

Eric

Michael Tippach

Mar 8, 1999, 3:00:00 AM3/8/99
to
Eric J. Korpela wrote:
>
> In article <7c0ut5\$srj\$1...@winter.news.rcn.net>,
> Srstanek <srst...@aol.com> wrote:
> >
> >> Since I didn't make the constraints too specific, there are probably
> >> multiple "optimal" solutions. I wouldn't be surprised if Terje has a
> >> 3-instruction solution stashed away in his bag-of-tricks :-)
> >
> >
> >Here's a 3 instruction one for you:
> >
> > cdq
> > cmp edx,eax
> >
> >;edx now contains sgn(eax)
> >
> >Is that more optimal? I'm sure it's better than your CMOV version too.
>
> Who needs three instructions.....

>
> This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If
> you can interleave other instructions, this can execute in 1 cycle
> on a Pentium.
>
> sar eax,31 ; eax= -1 or 0 1 U
> lea eax,[2*eax+1] ; eax= -1 or 1 1 UV

Unfortunately this does not return 0 if EAX was 0, as initially
required. You'd have to add another two simple instructions, e.g.

cmp eax, 1
sbb ecx, ecx
sar eax, 31
lea eax, [eax * 2 + ecx + 1]

Interleaved with something else, of course.

Regards,

Michael Tippach

Terje Mathisen

Mar 8, 1999, 3:00:00 AM3/8/99
to
Srstanek wrote:
>
> > Since I didn't make the constraints too specific, there are probably
> > multiple "optimal" solutions. I wouldn't be surprised if Terje has a
> > 3-instruction solution stashed away in his bag-of-tricks :-)
>
> Here's a 3 instruction one for you:
>
> cdq
> cmp edx,eax
>
> ;edx now contains sgn(eax)

This is a fairly well-known sequence, I'd guess it is one of those that
could be located by the (in)famous Superoptimizer.

On many x86 cpus the CDQ is very slow, so it might be better to split it
into a MOV + SAR pair instead:

mov edx,eax
sar edx,31 ; -1 or 0, correct for negative and zero input
cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff
adc edx,0 ; +1 if positive

>
> Is that more optimal? I'm sure it's better than your CMOV version too.
>

> Who is this Terje anyway? =)

Just another asm programmer. :-)

Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Srstanek

Mar 8, 1999, 3:00:00 AM3/8/99
to

> On many x86 cpus the CDQ is very slow, so it might be better to split it
> into a MOV + SAR pair instead:
>
> mov edx,eax
> sar edx,31 ; -1 or 0, correct for negative and zero input
> cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff
> adc edx,0 ; +1 if positive

Well, that's 4 instructions... Norbert wanted 3. =)
Also, SHR takes 2 cycles, MOV takes 1... that's 3 cycles. That cannot be
pipelined as EDX is used as target in both instructions. CDQ is not very slow.
2 cycles on a 386, 3 on a 486, 2(?) on a pentium. All the rest is the same. The
CDQ one might be faster though. :D

ju...@my-dejanews.com

Mar 9, 1999, 3:00:00 AM3/9/99
to
In article <7c18e5\$m1n\$1...@winter.news.rcn.net>,

kor...@islay.ssl.berkeley.edu (Eric J. Korpela) wrote:
> In article <7c1600\$87m\$1...@winter.news.rcn.net>,
> Eric J. Korpela <kor...@islay.ssl.berkeley.edu> wrote:
> >This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If
> >you can interleave other instructions, this can execute in 1 cycle
> >on a Pentium.
>
> I should clarify this for those who are about to tell me that the start and
> the result will always happen in different cycles. By 1 cycles I mean
> each instruction can be paired with another so averaged, each
> instruction takes 1/2 cycle.
>
> Eric

Actually, CDQ works just fine on modern out-of-order x86 processors. You
are correct, it is a 2 cycle instruction on both Pentium and K6. I have
no idea why that is. All it is is a SAR by 31 with EAX as source and EDX
as destination which doesn;t update the flags. Anyhow, I much prefer CDQ
since it's a one byte instruction. If you replace it, you need 2 dependent
instructions in order to get the same semantics (i.e. don't overwrite the
source) and that takes two cycles, too.

Eric J. Korpela

Mar 9, 1999, 3:00:00 AM3/9/99
to
In article <7c18e2\$lvu\$1...@winter.news.rcn.net>,

Michael Tippach <Spammer....@wuschel.demon.co.uk> wrote:
>>
>> sar eax,31 ; eax= -1 or 0 1 U
>> lea eax,[2*eax+1] ; eax= -1 or 1 1 UV
>
>Unfortunately this does not return 0 if EAX was 0, as initially
>required.

Ahh, I didn't catch that. Not enough caffeine today I guess.
IMHO your 4 instruction solution beats a 3 instruction solution
with a CDQ at least on Pentium.

Eric J. Korpela

Mar 9, 1999, 3:00:00 AM3/9/99
to
In article <7c1f9l\$4hb\$1...@winter.news.rcn.net>,

On Pentium SAR takes 1 cycle and is U pairable. This is definitely the
better solution if you can interleave this code with other code. CDQ takes 2
cycles and is not pairable. At worst, the above is 4 cycles at best
(inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles
(i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best
(interleaved) the CDQ solution is 3 cycles.

Srstanek

Mar 9, 1999, 3:00:00 AM3/9/99
to

> On Pentium SAR takes 1 cycle and is U pairable. This is definitely the
> better solution if you can interleave this code with other code. CDQ
> takes 2

MOV + SHR of the same register is not pairable. I suppose you can pair it by
interleaving other code, but that wasn't the task at hand. :P
On a 386, SHR = 3 cycles, MOV=1, CDQ=2. For a lower processor, CDQ is a good
answer. On higher processors, CDQ will likely be downed to 1, and is one
instruction.

> cycles and is not pairable. At worst, the above is 4 cycles at best
> (inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles
> (i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best
> (interleaved) the CDQ solution is 3 cycles.

MOV+SHR even though you can interleave does not mean those cycles do not count.
They still get their 1/2 value. =)
So, 3 at best.

Charles Liu

Mar 9, 1999, 3:00:00 AM3/9/99
to
juffa wrote:

>I was wondering what the fastest inline code for an integer sgn() function
>could look like. It's defined as follows:
>
>int sgn(int a)
>{
> if (!a) {
> return (0);
> } else if (a < 0) {
> return (-1);
> } else {
> return (1);
> }
>}
>
>The inline assembly code for this receives the argument in EAX, and should
>use only EDX and ECX. It should preserve EAX if possible. This is not for a
>specific x86 processor, but assume it should be "optimal" for an
out-of-order
>processor (i.e. one doesn't have to pay attention to instruction pairing
and
>AGIs). CMOVs are acceptable, but would limit the code to P6 and compatible
>machines of course.
>

>...

CDQ // -1 0 0
EAX NEG // set CF if EAX <> 0
EDX EDX ADC // -1 0 1

Sorry! EAX is not preserved.

Charles Liu

Roelof Engelbrecht

Mar 9, 1999, 3:00:00 AM3/9/99
to
<ju...@my-dejanews.com> wrote in message
news:7c0pen\$s62\$1...@winter.news.rcn.net...

>I was wondering what the fastest inline code for an integer sgn() function
>could look like.

[snip]

>The best code I have so far is based on generating a 1 for positive
numbers,
>a -1 for negative numbers, and then conditionally zeroing the result if the
>input was 0. I tried many other approaches, like conditionally incrementing
>and decrementing, using multiple CMOVs, etc.

[snip]

> ; eax = input a (with CMOV)
>
> xor ecx, ecx ; 0
> cdq ; a < 0 ? -1 : 0
> cmp eax, 1 ; |a| < 1 ? CY : NC
> lea edx, [edx*2+1] ; a < 0 ? -1 : 1
> cmovc edx, ecx ; sgn(a) = |a| < 1 ? 0 : (a < 0 ? -1 : 1)

I'm want to try this code but my assembler (BASM in Delphi) doesn't
recognize the cmovc instruction. I need to know the encoding of

cmovc edx, ecx

so that I can manually insert it with db's (or a dd). I think it is
something like 0x0f4187d2, but this doesn't work. (I have a Pentium II so
the cmov instruction is available)

Thanks,

Roelof
------

Terje Mathisen

Mar 9, 1999, 3:00:00 AM3/9/99
to
Eric J. Korpela wrote:
>
> In article <7c1f9l\$4hb\$1...@winter.news.rcn.net>,
> Srstanek <srst...@aol.com> wrote:
> >
> >> On many x86 cpus the CDQ is very slow, so it might be better to split it
> >> into a MOV + SAR pair instead:
> >>
> >> mov edx,eax
> >> sar edx,31 ; -1 or 0, correct for negative and zero input
> >> cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff
> >> adc edx,0 ; +1 if positive
> >
> >Well, that's 4 instructions... Norbert wanted 3. =)
> >Also, SHR takes 2 cycles, MOV takes 1... that's 3 cycles. That cannot be
> >pipelined as EDX is used as target in both instructions. CDQ is not very slow.
> >2 cycles on a 386, 3 on a 486, 2(?) on a pentium. All the rest is the same. The
> >CDQ one might be faster though. :D
>
> On Pentium SAR takes 1 cycle and is U pairable. This is definitely the
> better solution if you can interleave this code with other code. CDQ takes 2
> cycles and is not pairable. At worst, the above is 4 cycles at best
> (inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles
> (i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best
> (interleaved) the CDQ solution is 3 cycles.

There is one other possible idea, by going back to the definition:

First return 0 or -1 depending upon zero/nonzero input:

cmp eax,1
sbb ebx,ebx

At the same time, check for negative input:

mov edx,eax
sar edx,31

If the input was non-zero but not negative, return +1 instead of -1:

xor edx,ebx
sub edx,ebx

It's obvious that this version will have the same best-case latency of 4
cycles on a PII, and it will be a cycle slower on a Pentium, so the
4-instruction version is better.

Roelof Engelbrecht

Mar 9, 1999, 3:00:00 AM3/9/99
to
Roelof Engelbrecht wrote in message <7c3ctq\$1sl\$4...@winter.news.rcn.net>...
>I want to try this code but my assembler (BASM in Delphi) doesn't

>recognize the cmovc instruction. I need to know the encoding of
>
>cmovc edx, ecx

Don't worry, I found it; it's 0x0f42d1.

Srstanek

Mar 12, 1999, 3:00:00 AM3/12/99
to

>It's obvious that this version will have the same best-case latency of 4
>cycles on a PII, and it will be a cycle slower on a Pentium, so the
>4-instruction version is better.
>
> Terje
>
>- <Terje.M...@hda.hydro.com>

Well, I do have to agree with you on that for now. I suppose there's pipelining
capabilities, etc. CDQ may, in the future, become pipelineable and at 1 clock.

And since Michael Abrash seems to know you, there's no point in arguing any
furthur. ;)

Terje Mathisen

Mar 12, 1999, 3:00:00 AM3/12/99
to
Srstanek wrote:
>
> >It's obvious that this version will have the same best-case latency of 4
> >cycles on a PII, and it will be a cycle slower on a Pentium, so the
> >4-instruction version is better.
> >
> > Terje
> >
> >- <Terje.M...@hda.hydro.com>
>
> Well, I do have to agree with you on that for now. I suppose there's pipelining
> capabilities, etc. CDQ may, in the future, become pipelineable and at 1 clock.

Yeah, it is perfectly conceivable that CDQ will become a regular Px
micro-ops.

>
> And since Michael Abrash seems to know you, there's no point in arguing any
> furthur. ;)

I've known Mike for more than 10 years, but that's absolutely no reason
to not test stuff yourself. I have absolutely no monopoly on good ideas.