94 views

Skip to first unread message

Mar 8, 1999, 3:00:00 AM3/8/99

to

I was wondering what the fastest inline code for an integer sgn() function

could look like. It's defined as follows:

could look like. It's defined as follows:

int sgn(int a)

{

if (!a) {

return (0);

} else if (a < 0) {

return (-1);

} else {

return (1);

}

}

The inline assembly code for this receives the argument in EAX, and should

use only EDX and ECX. It should preserve EAX if possible. This is not for a

specific x86 processor, but assume it should be "optimal" for an out-of-order

processor (i.e. one doesn't have to pay attention to instruction pairing and

AGIs). CMOVs are acceptable, but would limit the code to P6 and compatible

machines of course.

The best code I have so far is based on generating a 1 for positive numbers,

a -1 for negative numbers, and then conditionally zeroing the result if the

input was 0. I tried many other approaches, like conditionally incrementing

and decrementing, using multiple CMOVs, etc.

; eax = input a (no CMOV)

xor ecx, ecx ; 0

cdq ; a < 0 ? -1 : 0

cmp eax, 1 ; |a| < 1 ? CY : NC

sbb ecx, 1 ; |a| < 1 ? 0 : 0xffffffff

lea edx, [edx*2+1] ; a < 0 ? -1 : 1

and edx, ecx ; sgn(a)

; eax = input a (with CMOV)

xor ecx, ecx ; 0

cdq ; a < 0 ? -1 : 0

cmp eax, 1 ; |a| < 1 ? CY : NC

lea edx, [edx*2+1] ; a < 0 ? -1 : 1

cmovc edx, ecx ; sgn(a) = |a| < 1 ? 0 : (a < 0 ? -1 : 1)

Since I didn't make the constraints too specific, there are probably

multiple "optimal" solutions. I wouldn't be surprised if Terje has a

3-instruction solution stashed away in his bag-of-tricks :-)

-- Norbert

-----------== Posted via Deja News, The Discussion Network ==----------

http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

Mar 8, 1999, 3:00:00 AM3/8/99

to

> Since I didn't make the constraints too specific, there are probably

> multiple "optimal" solutions. I wouldn't be surprised if Terje has a

> 3-instruction solution stashed away in his bag-of-tricks :-)

Here's a 3 instruction one for you:

cdq

cmp edx,eax

adc edx,0

;edx now contains sgn(eax)

Is that more optimal? I'm sure it's better than your CMOV version too.

Who is this Terje anyway? =)

- vulture a.k.a. Sean Stanek

Mar 8, 1999, 3:00:00 AM3/8/99

to

In article <7c0ut5$srj$1...@winter.news.rcn.net>,

Srstanek <srst...@aol.com> wrote:

>

>> Since I didn't make the constraints too specific, there are probably

>> multiple "optimal" solutions. I wouldn't be surprised if Terje has a

>> 3-instruction solution stashed away in his bag-of-tricks :-)

>

>

>Here's a 3 instruction one for you:

>

> cdq

> cmp edx,eax

> adc edx,0

>

>;edx now contains sgn(eax)

>

>Is that more optimal? I'm sure it's better than your CMOV version too.

Srstanek <srst...@aol.com> wrote:

>

>> Since I didn't make the constraints too specific, there are probably

>> multiple "optimal" solutions. I wouldn't be surprised if Terje has a

>> 3-instruction solution stashed away in his bag-of-tricks :-)

>

>

>Here's a 3 instruction one for you:

>

> cdq

> cmp edx,eax

> adc edx,0

>

>;edx now contains sgn(eax)

>

>Is that more optimal? I'm sure it's better than your CMOV version too.

Who needs three instructions.....

This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If

you can interleave other instructions, this can execute in 1 cycle

on a Pentium.

sar eax,31 ; eax= -1 or 0 1 U

lea eax,[2*eax+1] ; eax= -1 or 1 1 UV

--

Eric Korpela | An object at rest can never be

kor...@ssl.berkeley.edu | stopped.

<a href="http://sag-www.ssl.berkeley.edu/~korpela">Click for home page.</a>

Mar 8, 1999, 3:00:00 AM3/8/99

to

In article <7c0ut5$srj$1...@winter.news.rcn.net>,

srst...@aol.com (Srstanek) wrote:

>

> > Since I didn't make the constraints too specific, there are probably

> > multiple "optimal" solutions. I wouldn't be surprised if Terje has a

> > 3-instruction solution stashed away in his bag-of-tricks :-)

>

> Here's a 3 instruction one for you:

>

> cdq

> cmp edx,eax

> adc edx,0

>

> ;edx now contains sgn(eax)

>

> Is that more optimal? I'm sure it's better than your CMOV version too.

srst...@aol.com (Srstanek) wrote:

>

> > Since I didn't make the constraints too specific, there are probably

> > multiple "optimal" solutions. I wouldn't be surprised if Terje has a

> > 3-instruction solution stashed away in his bag-of-tricks :-)

>

> Here's a 3 instruction one for you:

>

> cdq

> cmp edx,eax

> adc edx,0

>

> ;edx now contains sgn(eax)

>

> Is that more optimal? I'm sure it's better than your CMOV version too.

Well, I guess it can't get any better than that. I tried conditional

increments like you use, but I never managed to hit upon this solution.

I think I am going to retire as a code optimizer :-(

Mar 8, 1999, 3:00:00 AM3/8/99

to

In article <7c1600$87m$1...@winter.news.rcn.net>,

Eric J. Korpela <kor...@islay.ssl.berkeley.edu> wrote:

>This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If

>you can interleave other instructions, this can execute in 1 cycle

>on a Pentium.

Eric J. Korpela <kor...@islay.ssl.berkeley.edu> wrote:

>This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If

>you can interleave other instructions, this can execute in 1 cycle

>on a Pentium.

I should clarify this for those who are about to tell me that the start and

the result will always happen in different cycles. By 1 cycles I mean

each instruction can be paired with another so averaged, each

instruction takes 1/2 cycle.

Eric

Mar 8, 1999, 3:00:00 AM3/8/99

to

> Srstanek <srst...@aol.com> wrote:

> >

> >> Since I didn't make the constraints too specific, there are probably

> >> multiple "optimal" solutions. I wouldn't be surprised if Terje has a

> >> 3-instruction solution stashed away in his bag-of-tricks :-)

> >

> >

> >Here's a 3 instruction one for you:

> >

> > cdq

> > cmp edx,eax

> > adc edx,0

> >

> >;edx now contains sgn(eax)

> >

> >Is that more optimal? I'm sure it's better than your CMOV version too.

>

> Who needs three instructions.....> >

> >> Since I didn't make the constraints too specific, there are probably

> >> multiple "optimal" solutions. I wouldn't be surprised if Terje has a

> >> 3-instruction solution stashed away in his bag-of-tricks :-)

> >

> >

> >Here's a 3 instruction one for you:

> >

> > cdq

> > cmp edx,eax

> > adc edx,0

> >

> >;edx now contains sgn(eax)

> >

> >Is that more optimal? I'm sure it's better than your CMOV version too.

>

>

> This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If

> you can interleave other instructions, this can execute in 1 cycle

> on a Pentium.

>

> sar eax,31 ; eax= -1 or 0 1 U

> lea eax,[2*eax+1] ; eax= -1 or 1 1 UV

> lea eax,[2*eax+1] ; eax= -1 or 1 1 UV

Unfortunately this does not return 0 if EAX was 0, as initially

required. You'd have to add another two simple instructions, e.g.

cmp eax, 1

sbb ecx, ecx

sar eax, 31

lea eax, [eax * 2 + ecx + 1]

Interleaved with something else, of course.

Regards,

Michael Tippach

Mar 8, 1999, 3:00:00 AM3/8/99

to

Srstanek wrote:

>

> > Since I didn't make the constraints too specific, there are probably

> > multiple "optimal" solutions. I wouldn't be surprised if Terje has a

> > 3-instruction solution stashed away in his bag-of-tricks :-)

>

> Here's a 3 instruction one for you:

>

> cdq

> cmp edx,eax

> adc edx,0

>

> ;edx now contains sgn(eax)

>

> > Since I didn't make the constraints too specific, there are probably

> > multiple "optimal" solutions. I wouldn't be surprised if Terje has a

> > 3-instruction solution stashed away in his bag-of-tricks :-)

>

> Here's a 3 instruction one for you:

>

> cdq

> cmp edx,eax

> adc edx,0

>

> ;edx now contains sgn(eax)

This is a fairly well-known sequence, I'd guess it is one of those that

could be located by the (in)famous Superoptimizer.

On many x86 cpus the CDQ is very slow, so it might be better to split it

into a MOV + SAR pair instead:

mov edx,eax

sar edx,31 ; -1 or 0, correct for negative and zero input

cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff

adc edx,0 ; +1 if positive

>

> Is that more optimal? I'm sure it's better than your CMOV version too.

>

> Who is this Terje anyway? =)

Just another asm programmer. :-)

Terje

--

- <Terje.M...@hda.hydro.com>

Using self-discipline, see http://www.eiffel.com/discipline

"almost all programming can be viewed as an exercise in caching"

Mar 8, 1999, 3:00:00 AM3/8/99

to

> On many x86 cpus the CDQ is very slow, so it might be better to split it

> into a MOV + SAR pair instead:

>

> mov edx,eax

> sar edx,31 ; -1 or 0, correct for negative and zero input

> cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff

> adc edx,0 ; +1 if positive

Well, that's 4 instructions... Norbert wanted 3. =)

Also, SHR takes 2 cycles, MOV takes 1... that's 3 cycles. That cannot be

pipelined as EDX is used as target in both instructions. CDQ is not very slow.

2 cycles on a 386, 3 on a 486, 2(?) on a pentium. All the rest is the same. The

CDQ one might be faster though. :D

Mar 9, 1999, 3:00:00 AM3/9/99

to

In article <7c18e5$m1n$1...@winter.news.rcn.net>,

kor...@islay.ssl.berkeley.edu (Eric J. Korpela) wrote:

> In article <7c1600$87m$1...@winter.news.rcn.net>,

> Eric J. Korpela <kor...@islay.ssl.berkeley.edu> wrote:

kor...@islay.ssl.berkeley.edu (Eric J. Korpela) wrote:

> In article <7c1600$87m$1...@winter.news.rcn.net>,

> Eric J. Korpela <kor...@islay.ssl.berkeley.edu> wrote:

> >This is better is as it removes the Pentium unpairable, 2 cycle, cdq. If

> >you can interleave other instructions, this can execute in 1 cycle

> >on a Pentium.

>

> >you can interleave other instructions, this can execute in 1 cycle

> >on a Pentium.

>

> I should clarify this for those who are about to tell me that the start and

> the result will always happen in different cycles. By 1 cycles I mean

> each instruction can be paired with another so averaged, each

> instruction takes 1/2 cycle.

>

> Eric

> the result will always happen in different cycles. By 1 cycles I mean

> each instruction can be paired with another so averaged, each

> instruction takes 1/2 cycle.

>

> Eric

Actually, CDQ works just fine on modern out-of-order x86 processors. You

are correct, it is a 2 cycle instruction on both Pentium and K6. I have

no idea why that is. All it is is a SAR by 31 with EAX as source and EDX

as destination which doesn;t update the flags. Anyhow, I much prefer CDQ

since it's a one byte instruction. If you replace it, you need 2 dependent

instructions in order to get the same semantics (i.e. don't overwrite the

source) and that takes two cycles, too.

Mar 9, 1999, 3:00:00 AM3/9/99

to

In article <7c18e2$lvu$1...@winter.news.rcn.net>,

Michael Tippach <Spammer....@wuschel.demon.co.uk> wrote:

>>

>> sar eax,31 ; eax= -1 or 0 1 U

>> lea eax,[2*eax+1] ; eax= -1 or 1 1 UV

>

>Unfortunately this does not return 0 if EAX was 0, as initially

>required.

Michael Tippach <Spammer....@wuschel.demon.co.uk> wrote:

>>

>> sar eax,31 ; eax= -1 or 0 1 U

>> lea eax,[2*eax+1] ; eax= -1 or 1 1 UV

>

>Unfortunately this does not return 0 if EAX was 0, as initially

>required.

Ahh, I didn't catch that. Not enough caffeine today I guess.

IMHO your 4 instruction solution beats a 3 instruction solution

with a CDQ at least on Pentium.

Mar 9, 1999, 3:00:00 AM3/9/99

to

In article <7c1f9l$4hb$1...@winter.news.rcn.net>,

On Pentium SAR takes 1 cycle and is U pairable. This is definitely the

better solution if you can interleave this code with other code. CDQ takes 2

cycles and is not pairable. At worst, the above is 4 cycles at best

(inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles

(i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best

(interleaved) the CDQ solution is 3 cycles.

Mar 9, 1999, 3:00:00 AM3/9/99

to

> On Pentium SAR takes 1 cycle and is U pairable. This is definitely the

> better solution if you can interleave this code with other code. CDQ

> takes 2

MOV + SHR of the same register is not pairable. I suppose you can pair it by

interleaving other code, but that wasn't the task at hand. :P

On a 386, SHR = 3 cycles, MOV=1, CDQ=2. For a lower processor, CDQ is a good

answer. On higher processors, CDQ will likely be downed to 1, and is one

instruction.

> cycles and is not pairable. At worst, the above is 4 cycles at best

> (inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles

> (i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best

> (interleaved) the CDQ solution is 3 cycles.

MOV+SHR even though you can interleave does not mean those cycles do not count.

They still get their 1/2 value. =)

So, 3 at best.

Mar 9, 1999, 3:00:00 AM3/9/99

to

juffa wrote:

>I was wondering what the fastest inline code for an integer sgn() function

>could look like. It's defined as follows:

>

>int sgn(int a)

>{

> if (!a) {

> return (0);

> } else if (a < 0) {

> return (-1);

> } else {

> return (1);

> }

>}

>

>The inline assembly code for this receives the argument in EAX, and should

>use only EDX and ECX. It should preserve EAX if possible. This is not for a

>specific x86 processor, but assume it should be "optimal" for an

out-of-order

>processor (i.e. one doesn't have to pay attention to instruction pairing

and

>AGIs). CMOVs are acceptable, but would limit the code to P6 and compatible

>machines of course.

>

>...

CDQ // -1 0 0

EAX NEG // set CF if EAX <> 0

EDX EDX ADC // -1 0 1

Sorry! EAX is not preserved.

Charles Liu

Mar 9, 1999, 3:00:00 AM3/9/99

to

<ju...@my-dejanews.com> wrote in message

news:7c0pen$s62$1...@winter.news.rcn.net...

>I was wondering what the fastest inline code for an integer sgn() function

>could look like.

news:7c0pen$s62$1...@winter.news.rcn.net...

>I was wondering what the fastest inline code for an integer sgn() function

>could look like.

[snip]

>The best code I have so far is based on generating a 1 for positive

numbers,

>a -1 for negative numbers, and then conditionally zeroing the result if the

>input was 0. I tried many other approaches, like conditionally incrementing

>and decrementing, using multiple CMOVs, etc.

[snip]

> ; eax = input a (with CMOV)

>

> xor ecx, ecx ; 0

> cdq ; a < 0 ? -1 : 0

> cmp eax, 1 ; |a| < 1 ? CY : NC

> lea edx, [edx*2+1] ; a < 0 ? -1 : 1

> cmovc edx, ecx ; sgn(a) = |a| < 1 ? 0 : (a < 0 ? -1 : 1)

I'm want to try this code but my assembler (BASM in Delphi) doesn't

recognize the cmovc instruction. I need to know the encoding of

cmovc edx, ecx

so that I can manually insert it with db's (or a dd). I think it is

something like 0x0f4187d2, but this doesn't work. (I have a Pentium II so

the cmov instruction is available)

Thanks,

Roelof

------

remove the obvious from my reply address...

Mar 9, 1999, 3:00:00 AM3/9/99

to

Eric J. Korpela wrote:

>

> In article <7c1f9l$4hb$1...@winter.news.rcn.net>,

> Srstanek <srst...@aol.com> wrote:

> >

> >> On many x86 cpus the CDQ is very slow, so it might be better to split it

> >> into a MOV + SAR pair instead:

> >>

> >> mov edx,eax

> >> sar edx,31 ; -1 or 0, correct for negative and zero input

> >> cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff

> >> adc edx,0 ; +1 if positive

> >

> >Well, that's 4 instructions... Norbert wanted 3. =)

> >Also, SHR takes 2 cycles, MOV takes 1... that's 3 cycles. That cannot be

> >pipelined as EDX is used as target in both instructions. CDQ is not very slow.

> >2 cycles on a 386, 3 on a 486, 2(?) on a pentium. All the rest is the same. The

> >CDQ one might be faster though. :D

>

>

> In article <7c1f9l$4hb$1...@winter.news.rcn.net>,

> Srstanek <srst...@aol.com> wrote:

> >

> >> On many x86 cpus the CDQ is very slow, so it might be better to split it

> >> into a MOV + SAR pair instead:

> >>

> >> mov edx,eax

> >> sar edx,31 ; -1 or 0, correct for negative and zero input

> >> cmp edx,eax ; Carry iff 0 < EAX <= 0x7fffffff

> >> adc edx,0 ; +1 if positive

> >

> >Well, that's 4 instructions... Norbert wanted 3. =)

> >Also, SHR takes 2 cycles, MOV takes 1... that's 3 cycles. That cannot be

> >pipelined as EDX is used as target in both instructions. CDQ is not very slow.

> >2 cycles on a 386, 3 on a 486, 2(?) on a pentium. All the rest is the same. The

> >CDQ one might be faster though. :D

>

> On Pentium SAR takes 1 cycle and is U pairable. This is definitely the

> better solution if you can interleave this code with other code. CDQ takes 2

> better solution if you can interleave this code with other code. CDQ takes 2

> cycles and is not pairable. At worst, the above is 4 cycles at best

> (inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles

> (i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best

> (interleaved) the CDQ solution is 3 cycles.

> (inteleaved) it is 2 cycles. At worst the CDQ solution is 4.5 cycles

> (i.e. the U pipe but not the V pipe is in use when the CDQ hits). At best

> (interleaved) the CDQ solution is 3 cycles.

There is one other possible idea, by going back to the definition:

First return 0 or -1 depending upon zero/nonzero input:

cmp eax,1

sbb ebx,ebx

At the same time, check for negative input:

mov edx,eax

sar edx,31

If the input was non-zero but not negative, return +1 instead of -1:

xor edx,ebx

sub edx,ebx

It's obvious that this version will have the same best-case latency of 4

cycles on a PII, and it will be a cycle slower on a Pentium, so the

4-instruction version is better.

Mar 9, 1999, 3:00:00 AM3/9/99

to

Roelof Engelbrecht wrote in message <7c3ctq$1sl$4...@winter.news.rcn.net>...

>I want to try this code but my assembler (BASM in Delphi) doesn't

>recognize the cmovc instruction. I need to know the encoding of

>

>cmovc edx, ecx

>I want to try this code but my assembler (BASM in Delphi) doesn't

>recognize the cmovc instruction. I need to know the encoding of

>

>cmovc edx, ecx

Don't worry, I found it; it's 0x0f42d1.

Mar 12, 1999, 3:00:00 AM3/12/99

to

>It's obvious that this version will have the same best-case latency of 4

>cycles on a PII, and it will be a cycle slower on a Pentium, so the

>4-instruction version is better.

>

> Terje

>

Well, I do have to agree with you on that for now. I suppose there's pipelining

capabilities, etc. CDQ may, in the future, become pipelineable and at 1 clock.

And since Michael Abrash seems to know you, there's no point in arguing any

furthur. ;)

Mar 12, 1999, 3:00:00 AM3/12/99

to

Srstanek wrote:

>

> >It's obvious that this version will have the same best-case latency of 4

> >cycles on a PII, and it will be a cycle slower on a Pentium, so the

> >4-instruction version is better.

> >

> > Terje

> >

> >- <Terje.M...@hda.hydro.com>

>

> Well, I do have to agree with you on that for now. I suppose there's pipelining

> capabilities, etc. CDQ may, in the future, become pipelineable and at 1 clock.

>

> >It's obvious that this version will have the same best-case latency of 4

> >cycles on a PII, and it will be a cycle slower on a Pentium, so the

> >4-instruction version is better.

> >

> > Terje

> >

> >- <Terje.M...@hda.hydro.com>

>

> Well, I do have to agree with you on that for now. I suppose there's pipelining

> capabilities, etc. CDQ may, in the future, become pipelineable and at 1 clock.

Yeah, it is perfectly conceivable that CDQ will become a regular Px

micro-ops.

>

> And since Michael Abrash seems to know you, there's no point in arguing any

> furthur. ;)

I've known Mike for more than 10 years, but that's absolutely no reason

to not test stuff yourself. I have absolutely no monopoly on good ideas.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu