I am currently trying to make optimized functions for an unsigned [64-bit /
32-bit = 32-bit] and an unsigned [32-bit / 16-bit = 16-bit] division in ARM
Assembly. So far, I've come up with this code for the 32/16=16:
RSB(r1, r1, 0);
MOVS(r1, r1, 15);
ADDS(r0, r1, r0, 0);
RSBCC(r0, r1, r0);
for i := 1 to 15 do begin
ADCS(r0, r1, r0, 1);
RSBCC(r0, r1, r0);
end;
ADC(r0, r0, r0, 0);
MOV(r1, r0, -16);
BIC(r0, r0, r1, 16);
(don't mind the "for" loop, I'm simulating in Delphi :^)
Anyone can help me do the u-64/32=32 division? I've been trying for days.
Thanks,
Mart
10*log2(r0/r1) cycles
for the 64/32=32, you will have to pass r0 and r1 in 2 registers and do the
comparisons/substractions/shifts on the 2 registers.
Regards, Cyrille
PS: I did not test the function, but it should work:-) You will also have to
code the DivByZero and Overflow part of it depending on your needs
@r0=r0/r1 (uses r0, r1, r2, r3)
cmp r0, #0
moveq pc, lr ; return 0 if r0=0
cmp r1, #0
beq DivByZero
mov r3, #1 ; current tested bit
mov r2, #0 ; result
@shift r3 and r1 until r1 > r0
Loop1
cmp r1, #0x80000000
cmpcc r1, r0
movcc r1, r1, lsl #4
movcc r3, r3, lsl #4
bcc Loop1
Loop2
cmp r1, #0x80000000
cmpcc r1, r0
movcc r1, r1, lsl #1
movcc r3, r3, lsl #1
bcc Loop2
cmp r3, #0x10000
bge Overflow
; this is the core of the function. It's aim is to extract the power of 2 of
r1 from r0 fir each possible power of 2.
;ie: r0=sigma(n=0 to n=15 of r1*2^n).
; so r0/r1=sigma(n=15 to n=0 of if r0>=r1*2^n then r0-=r0-r1*2^n, result=
result+2^n)
; the loop is unrolled for maximum speed
Loop3
cmp r0, r1
subcs r0, r0, r1
orrcs r2, r2, r3
cmp r0, r1, lsr #1
subcs r0, r0, r1, lsr #1
orrcs r2, r2, r3, lsr #1
cmp r0, r1, lsr #2
subcs r0, r0, r1, lsr #2
orrcs r2, r2, r3, lsr #2
cmp r0, r1, lsr #3
subcs r0, r0, r1, lsr #3
orrcs r2, r2, r3, lsr #3
cmp r0, #0
movnes r3, r3, lsr #4
movne r1, r1, lsr #4
bne Loop3
mov r0, r2 ; result in r0
mov pc, lr
"Martin Ouellet" <elm...@total.net> wrote in message
news:98ogd9$56...@NWNEWS.PCT.EDU...
> Wilco> That is indeed the optimal version.
>
> Are you sure that computing the inverse and multiplying wouldn't be
> faster?
Yes, that is faster if you have the E extension. I did a version in about
27 cycles (the iterative version takes 40). If the divisor is a constant,
it takes less than 10 cycles.
But calculating the inverse fast and accurate using a small lookup table
is very very tricky...
Wilco
> (don't mind the "for" loop, I'm simulating in Delphi :^)
That is indeed the optimal version.
> Anyone can help me do the u-64/32=32 division? I've been trying for days.
The method is very similar to 32/16 division, except that it uses a 64-bit
shift:
RSB r2, r2, #0
ADDS r1, r1, r1
repeat 32 times
ADCS r0, r2, r0, LSL #1
RSBCC r0, r2, r0
ADCS r1, r1, r1
end
The 64-bit divident is in r0 and r1 (high part in r0), the divisor in r2.
After 32 iterations, the quotient is in r1, remainder in r0.
Wilco