64/32 bit division ARM Assembly

Martin Ouellet

unread,

Mar 14, 2001, 2:24:29 PM3/14/01

to

Hi,

I am currently trying to make optimized functions for an unsigned [64-bit /
32-bit = 32-bit] and an unsigned [32-bit / 16-bit = 16-bit] division in ARM
Assembly. So far, I've come up with this code for the 32/16=16:

RSB(r1, r1, 0);
MOVS(r1, r1, 15);
ADDS(r0, r1, r0, 0);
RSBCC(r0, r1, r0);

for i := 1 to 15 do begin
ADCS(r0, r1, r0, 1);
RSBCC(r0, r1, r0);
end;

ADC(r0, r0, r0, 0);
MOV(r1, r0, -16);
BIC(r0, r0, r1, 16);

(don't mind the "for" loop, I'm simulating in Delphi :^)

Anyone can help me do the u-64/32=32 division? I've been trying for days.

Thanks,
Mart

Message has been deleted

Cyrille de Brébisson

unread,

Mar 14, 2001, 4:34:24 PM3/14/01

to

Hello,
You can try this for the 32/16 one.
It will execute in roughly

10*log2(r0/r1) cycles

for the 64/32=32, you will have to pass r0 and r1 in 2 registers and do the
comparisons/substractions/shifts on the 2 registers.

Regards, Cyrille

PS: I did not test the function, but it should work:-) You will also have to
code the DivByZero and Overflow part of it depending on your needs

@r0=r0/r1 (uses r0, r1, r2, r3)
cmp r0, #0
moveq pc, lr ; return 0 if r0=0
cmp r1, #0
beq DivByZero
mov r3, #1 ; current tested bit
mov r2, #0 ; result

@shift r3 and r1 until r1 > r0
Loop1
cmp r1, #0x80000000
cmpcc r1, r0
movcc r1, r1, lsl #4
movcc r3, r3, lsl #4
bcc Loop1

Loop2
cmp r1, #0x80000000
cmpcc r1, r0
movcc r1, r1, lsl #1
movcc r3, r3, lsl #1
bcc Loop2

cmp r3, #0x10000
bge Overflow

; this is the core of the function. It's aim is to extract the power of 2 of
r1 from r0 fir each possible power of 2.
;ie: r0=sigma(n=0 to n=15 of r1*2^n).
; so r0/r1=sigma(n=15 to n=0 of if r0>=r1*2^n then r0-=r0-r1*2^n, result=
result+2^n)
; the loop is unrolled for maximum speed
Loop3
cmp r0, r1
subcs r0, r0, r1
orrcs r2, r2, r3
cmp r0, r1, lsr #1
subcs r0, r0, r1, lsr #1
orrcs r2, r2, r3, lsr #1
cmp r0, r1, lsr #2
subcs r0, r0, r1, lsr #2
orrcs r2, r2, r3, lsr #2
cmp r0, r1, lsr #3
subcs r0, r0, r1, lsr #3
orrcs r2, r2, r3, lsr #3
cmp r0, #0
movnes r3, r3, lsr #4
movne r1, r1, lsr #4
bne Loop3

mov r0, r2 ; result in r0
mov pc, lr

"Martin Ouellet" <elm...@total.net> wrote in message
news:98ogd9$56...@NWNEWS.PCT.EDU...

Message has been deleted

Wilco Dijkstra

unread,

Mar 15, 2001, 11:54:13 AM3/15/01

to

Bill Pringlemeir wrote:

> Wilco> That is indeed the optimal version.
>
> Are you sure that computing the inverse and multiplying wouldn't be
> faster?

Yes, that is faster if you have the E extension. I did a version in about
27 cycles (the iterative version takes 40). If the divisor is a constant,
it takes less than 10 cycles.
But calculating the inverse fast and accurate using a small lookup table
is very very tricky...

Wilco

Wilco Dijkstra

unread,

Mar 15, 2001, 5:52:21 AM3/15/01

to

Martin Ouellet wrote:
>
> Hi,
>
> I am currently trying to make optimized functions for an unsigned [64-bit /
> 32-bit = 32-bit] and an unsigned [32-bit / 16-bit = 16-bit] division in ARM
> Assembly. So far, I've come up with this code for the 32/16=16:

> (don't mind the "for" loop, I'm simulating in Delphi :^)

That is indeed the optimal version.

> Anyone can help me do the u-64/32=32 division? I've been trying for days.

The method is very similar to 32/16 division, except that it uses a 64-bit
shift:

RSB r2, r2, #0
ADDS r1, r1, r1
repeat 32 times
ADCS r0, r2, r0, LSL #1
RSBCC r0, r2, r0
ADCS r1, r1, r1
end

The 64-bit divident is in r0 and r1 (high part in r0), the divisor in r2.
After 32 iterations, the quotient is in r1, remainder in r0.

Wilco