Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Add with carry in AXP?

0 views
Skip to first unread message

Fat Phil

unread,
Jan 21, 2001, 4:57:52 PM1/21/01
to
I wish to do a 192 bit shift left by 1 bit. g++ produces about 20 instructions, as it thinks that I actually want to shift right by 63 bits occasionally. I want something along the lines of add bot, bot, bot add/carry mid, mid, mid add/carry top, top, top Does the AXP have such an instruction? I can't see it in Bhandarkar, and seem to have lost all other AXP references. (Actually, a piece of inline g++ asm would be the tops!) Greets to those who participated in the 'demon bit twiddle' thread on clax - this is the same application! Phil

James Van Buskirk

unread,
Jan 21, 2001, 9:56:02 PM1/21/01
to

Fat Phil wrote in message <3A6B5B9C...@altavista.com>...

>Does the AXP have such an instruction? I can't see it in Bhandarkar, and
>seem to have lost all other AXP references.
>(Actually, a piece of inline g++ asm would be the tops!)

Kind of hard to do an ADC on a processor that has no CF! Still,
this would take 11 uops on a PentiumPro, so the following code
isn't as arduous as it looks.

# Assume the 192 bits are in $0 (low), $1, and $2 (high.)
srl $0, 63, $3
addq $2, 1, $4
cmovge $1, $2, $4
addq $1, $1, $1

addq $0, $0, $0
addq $1, $3, $1
addq $2, $4, $2

>Greets to those who participated in the 'demon bit twiddle' thread on
>clax - this is the same application!

What approach did you finally decide to use for that operation?

Fat Phil

unread,
Jan 22, 2001, 4:22:09 AM1/22/01
to
James Van Buskirk wrote:
> Fat Phil wrote in message <3A6B5B9C...@altavista.com>...
>
> >Does the AXP have such an instruction? I can't see it in Bhandarkar, and
> >seem to have lost all other AXP references.
> >(Actually, a piece of inline g++ asm would be the tops!)
>
> Kind of hard to do an ADC on a processor that has no CF!

Erk! That's wierd.

> Still,
> this would take 11 uops on a PentiumPro, so the following code
> isn't as arduous as it looks.

It looks like 3uops when properly implemented. I'm sure I've also seen
some processors which had a carry-hijack feature (dunno what they called
it), but it permitted the next instruction to access the carry flag
before it was propogated out to the registers. (I think PPro does this
for test/cjump instruction pairs?) So that should (morally) be <=3
ticks.

I'll send another reply later, when I've analysed the code, and I'll
enclose g++'s attempt too.

Phil

Phil Carmody

unread,
Jan 22, 2001, 7:07:21 AM1/22/01
to
James Van Buskirk wrote:
> Kind of hard to do an ADC on a processor that has no CF! Still,
> this would take 11 uops on a PentiumPro, so the following code
> isn't as arduous as it looks.

Re earlier remark - oops, I forgot that intel's integer register file is
still stuck in the 32bit world, make that 6 operations.


Comments are mine below:

> # Assume the 192 bits are in $0 (low), $1, and $2 (high.)

> srl $0, 63, $3 ; 1: E0 $3 = low>>63
> addq $2, 1, $4 ; E1 $4 = high+1
> cmovge $1, $2, $4 ; 2: E0 $4 = some condition on mid ? high : high + 1
> addq $1, $1, $1 ; E1 $1 = mid<<1
>
> addq $0, $0, $0 ; 3: E0 $0 = low<<1
> addq $1, $3, $1 ; E1 $1 = mid<<1 | low>>63
> addq $2, $4, $2 ; 4: E0 $2 = high<<1 | conditionally 1

I'm confused by the CMOV. Bloody Bhandarkar wastes 328 pages gibbering
on about Sparc, HPPA, Power/PPC etc, and forgets to actually explain
what the AXP does. (Yes, it calls itself the "Complete" "Alpha"
"Architecture" "Reference", if you permit me to permute the words in its
title, which is what I did when I bought it).

Oh for reference G++'s 192 bit add: (it has also has 6 load and stores
of course)
srl $2,63,$4
srl $3,63,$5
addq $1,$1,$1
addq $2,$2,$2
bis $1,$4,$1
bis $2,$5,$2
addq $3,$3,$3
I'd be tempted to interleave the srls and addqs, as srl is E0 only, but
apart from that I think I prefer it to yours, as the >>63s are a bit
more obvious than splitting the 2mid+1 into (mid)+(mid+1). (given the
lack of a carry flag).
What made you do your CMOV route?

> >Greets to those who participated in the 'demon bit twiddle' thread on
> >clax - this is the same application!
>
> What approach did you finally decide to use for that operation?

I prepare each stage with a full mask, and then I perform that stage
repeatedly asking it to temporarily remove a very sparse subset of bits,
recurse from that.

Prepare stage N:
count bits into buckets

Perform stage N:
Condense the 'remove' bits into width(N) buckets
Prepare stage N+1 with our current full set of bits
For each bit position 1..width(N)
If bucket minus optionally removed bit has >=2 bits in it
Perform stage N+1 without the bits in that bit position
If any buckets had 0 or 1 bits
Perform stage N+1 with no extra removed bits.

I know in theory I could recurse the 'delta' operation even more, so
that the prepare stage was prepared using a delta from its previous
preparation, but that adds another level of complexity. My biggest
desire now is to add intelligence to the recursion process to stop it
recursing too far, and to make my leaf-node as efficient as possible
(it's a horrible graph-colouring problem).


You can be sure that there will be more questions as time goes on, the
task has only just begun...

Phil

James Van Buskirk

unread,
Jan 22, 2001, 12:52:10 PM1/22/01
to

Phil Carmody wrote in message <3A6C27EA...@trshp.ntc.nokia.com>...

>Re earlier remark - oops, I forgot that intel's integer register file is
>still stuck in the 32bit world, make that 6 operations.

And each ADC is 2 uops on a PentiumPro.

>> cmovge $1, $2, $4 ; 2: E0 $4 = some condition on mid ? high :
high + 1

The condition is "if(mid >= 0)" i.e. "if sign (high) bit of mid not set."

>I'm confused by the CMOV. Bloody Bhandarkar wastes 328 pages gibbering
>on about Sparc, HPPA, Power/PPC etc, and forgets to actually explain
>what the AXP does.

Well, there are references like alphahb2.pdf and 164hrm.pdf that
you can download.

>Oh for reference G++'s 192 bit add: (it has also has 6 load and stores
>of course)

Why aren't you using Compaq's compilers for this? CVF at least
can handle right shifts of constant length and CMOVxx. The loads
and stores almost sound like you're compiling without any form
of optimization enabled, or else dereferencing pointers.

>What made you do your CMOV route?

Variety is the spice of life. Sometimes you want to keep E0 open
for other things like stores and sometimes you don't want to work
around the 2 clocks latency of CMOVxx. At times you can get the
compiler to do one form well but not the other.

Phil Carmody

unread,
Jan 22, 2001, 1:30:01 PM1/22/01
to
teeheehee, I've been polling a.l.a every 15 minutes today, waiting for a
reply ;-)

James Van Buskirk wrote:
> >> cmovge $1, $2, $4 ; 2: E0 $4 = some condition on mid ? high :
> high + 1
>
> The condition is "if(mid >= 0)" i.e. "if sign (high) bit of mid not set."
>
> >I'm confused by the CMOV. Bloody Bhandarkar wastes 328 pages gibbering
> >on about Sparc, HPPA, Power/PPC etc, and forgets to actually explain
> >what the AXP does.
>
> Well, there are references like alphahb2.pdf and 164hrm.pdf that
> you can download.

I found them once, they've been in my home directory, on a floppy, on my
laptop, and on a machine at home in the last 3 months. Now - they appear
to be nowhere. Will remedy...



> >Oh for reference G++'s 192 bit add: (it has also has 6 load and stores
> >of course)
>
> Why aren't you using Compaq's compilers for this? CVF at least
> can handle right shifts of constant length and CMOVxx. The loads
> and stores almost sound like you're compiling without any form
> of optimization enabled, or else dereferencing pointers.

Ah, I'm on what a RedHat Alpha guru once called a "******* piece of
crap" - namely RH6.0
I'm going to do a total reinstall of Debian when I feel brave (i.e. have
someone to hold my hand).
Consequently I'm using old tools. I'm, using -O2, which I know isn't the
bells and whistles optimisation, but claims to be fairly thorough. I'll
crank up to -O6 when I get home.
When the system is stable, I'll put ccc onto it, as discussions have
hinted that this is a superior compiler.

Alas I'm using a struct containing an array of 3 uint64s. G++ wants to
keep these out of registers wherever possible... grrr! I pass using
const*.
I may optimise it to pass 3 uint64s in the future.


Phil

0 new messages