Square Root

David Chester

unread,

Oct 3, 1998, 3:00:00 AM10/3/98

to

Hi,
Anyone know where I can get hold of a Integer Square root algorithm for ARM assembler.

Oh yeah, and where is the FAQ for this newsgroup kept?
--
David Chester

Peter Teichmann

unread,

Oct 4, 1998, 3:00:00 AM10/4/98

to

In message <2c76468f48%Da...@chescam.demon.co.uk>
David Chester <Da...@chescam.demon.co.uk> wrote:

> Hi,
> Anyone know where I can get hold of a Integer Square root algorithm for ARM assembler.

You could use the following iteration formula:

Y1 = ( Y0 + W / Y0 ) / 2

where Y1 = (N+1)th estimation
Y0 = Nth estimation
W = number you want the square root from

you stop the iteration if Y1=Y0.

care should be taken with the first estimation,if you take 1 it is fast
for small sqrts and slow for big. If you take 65536 it is the other way
round.

ARM code [r0=sqrt(r0)]:

stmfd r13!,{r14}

mov r3,r0 ;r3=W
mov r0,#16384*2 ;r0=start estimation * 2

loop
mov r14,r0,lsr#1
mov r0,r3
mov r1,r14
bl FDIV ;r0=r0/r1 r2=scrapped
add r0,r0,r14
cmp r14,r0,lsr#1
bne loop

mov r0,r0,lsr#1

ldmfd r13!,{pc}^

For the division look at my homepage please:
http://rcswww.urz.tu-dresden.de/~teich-p/adiv1e.html

On a Strongarm 202.8 MHz it takes between 2 and 2.5 seconds with the above
start estimation for 1000000 sqrts (2.5 for sqrt(1) and sqrt(4e9).

--
Peter Teichmann

----------------------------------------------------------------------------
Email: tei...@rcs.urz.tu-dresden.de WWW: rcswww.urz.tu-dresden.de/~teich-p

Mark Olleson

unread,

Oct 4, 1998, 3:00:00 AM10/4/98

to

In message <2c76468f48%Da...@chescam.demon.co.uk>
David Chester <Da...@chescam.demon.co.uk> wrote:

> Hi,
> Anyone know where I can get hold of a Integer Square root algorithm for
> ARM assembler.
>

> Oh yeah, and where is the FAQ for this newsgroup kept?

I'm, fairly sure that this has been asked before on c.s.a. - A search on
DejaNews will probably find it.

The nornal way of calculating square roots is by the use of the
Newton-Raphson theorum which is an iterative approximation.

--
.................................................................
......Make Everything as Simple as possible, but not simpler.....
.................................................................

Mark Olleson, Electronic Engineer
ma...@vapour-trail.demon.co.uk, mjol...@iee.org.uk
A sorcerer is one who by commerce with the Devil has a full intention of
attaining his own ends.

Andreas Joos

unread,

Oct 4, 1998, 3:00:00 AM10/4/98

to

David Chester <Da...@chescam.demon.co.uk> wrote:

DC> Anyone know where I can get hold of a Integer Square root algorithm for
DC> ARM assembler.

This article appeared sometime in usent:

From: Mechanoid <dmal...@argonet.co.uk>
Reply-To: Mechanoid <dmal...@argonet.co.uk>
X-Newsreader: NewsAgent 0.84 for RISC OS

In article <na.41c12647fc.a7...@argonet.co.uk>, G N Dodson
<gerald...@argonet.co.uk> wrote:

> Title says it all really. I don't get on well with the fp op's but need to
> calculate correlation coefficients which involve SQRT's.

Here's mine:

; Optimised 32-bit integer sqrt
; Calculates square root of R0
; Result returned in R0, uses R1 and R2
; Worst case execution time exactly 70 S cycles

.intsqr
MOV r1,#0
CMP r1,r0,LSR #8
BEQ bit8 ; &000000xx numbers

CMP r1,r0,LSR #16
BEQ bit16 ; &0000xxxx numbers

CMP r1,r0,LSR #24
BEQ bit24 ; &00xxxxxx numbers

SUBS r2,r0,#&40000000
MOVCS r0,r2
MOVCS r1,#&10000

ADD r2,r1,#&4000
SUBS r2,r0,r2,LSL#14
MOVCS r0,r2
ADDCS r1,r1,#&8000

ADD r2,r1,#&2000
SUBS r2,r0,r2,LSL#13
MOVCS r0,r2
ADDCS r1,r1,#&4000

.bit24
ADD r2,r1,#&1000
SUBS r2,r0,r2,LSL#12
MOVCS r0,r2
ADDCS r1,r1,#&2000

ADD r2,r1,#&800
SUBS r2,r0,r2,LSL#11
MOVCS r0,r2
ADDCS r1,r1,#&1000

ADD r2,r1,#&400
SUBS r2,r0,r2,LSL#10
MOVCS r0,r2
ADDCS r1,r1,#&800

ADD r2,r1,#&200
SUBS r2,r0,r2,LSL#9
MOVCS r0,r2
ADDCS r1,r1,#&400

.bit16
ADD r2,r1,#&100
SUBS r2,r0,r2,LSL#8
MOVCS r0,r2
ADDCS r1,r1,#&200

ADD r2,r1,#&80
SUBS r2,r0,r2,LSL#7
MOVCS r0,r2
ADDCS r1,r1,#&100

ADD r2,r1,#&40
SUBS r2,r0,r2,LSL#6
MOVCS r0,r2
ADDCS r1,r1,#&80

ADD r2,r1,#&20
SUBS r2,r0,r2,LSL#5
MOVCS r0,r2
ADDCS r1,r1,#&40

.bit8
ADD r2,r1,#&10
SUBS r2,r0,r2,LSL#4
MOVCS r0,r2
ADDCS r1,r1,#&20

ADD r2,r1,#&8
SUBS r2,r0,r2,LSL#3
MOVCS r0,r2
ADDCS r1,r1,#&10

ADD r2,r1,#&4
SUBS r2,r0,r2,LSL#2
MOVCS r0,r2
ADDCS r1,r1,#&8

ADD r2,r1,#&2
SUBS r2,r0,r2,LSL#1
MOVCS r0,r2
ADDCS r1,r1,#&4

ADD r2,r1,#&1
CMP r0,r2
ADDCS r1,r1,#&2

MOV r0,r1,LSR#1 ; result in R0

Optimised by myself, but I dunno where I got the original from.

If by 'integer' you meant a fixed point routine, then I've got one of them
too.

Cheers,

--
Dan "Mech" Maloney.
__ _______ ______ __
/ |/ / __/ ___/ /_/ / # Mechanoid the Haemorroidal Android
/ /|_/ / _// /__/ __ / # Disclaimer: I may have been wrong deliberately
/_/ /_/___/\___/_/ /_/ # mailto:mech...@usa.net

---
* Origin: ArgoNet, but does not reflect its views (2:246/1416.500)

Robert Harley

unread,

Oct 5, 1998, 3:00:00 AM10/5/98

to

David Chester <Da...@chescam.demon.co.uk> writes:

>Anyone know where I can get hold of a Integer Square root algorithm for ARM assembler.

Here's what I use, in C. It returns floor(sqrt(n)):

==============================================================================
typedef unsigned int u32;

u32 isqrt(const u32 n);

/*-- isqrt -----------------------------------------------------------------*/

u32 isqrt(u32 n) {
u32 s, t;

#define sqrtBit(k) \
t = s+(1UL<<(k-1)); t <<= k+1; if (n >= t) { n -= t; s |= 1UL<<k; }

s = 0UL;
#ifdef __alpha
if (n >= 1UL<<62) { n -= 1UL<<62; s = 1UL<<31; }
sqrtBit(30); sqrtBit(29); sqrtBit(28); sqrtBit(27); sqrtBit(26);
sqrtBit(25); sqrtBit(24); sqrtBit(23); sqrtBit(22); sqrtBit(21);
sqrtBit(20); sqrtBit(19); sqrtBit(18); sqrtBit(17); sqrtBit(16);
sqrtBit(15);
#else
if (n >= 1UL<<30) { n -= 1UL<<30; s = 1UL<<15; }
#endif
sqrtBit(14); sqrtBit(13); sqrtBit(12); sqrtBit(11); sqrtBit(10);
sqrtBit(9); sqrtBit(8); sqrtBit(7); sqrtBit(6); sqrtBit(5);
sqrtBit(4); sqrtBit(3); sqrtBit(2); sqrtBit(1);
if (n > s<<1) s |= 1UL;

#undef sqrtBit

return s;
} /* end isqrt */
==============================================================================

I compiled it with a cross-compiler from x86 Linux to ARM, running the
compiler on Alpha Linux emulating an x86 (!):

==============================================================================
>/usr/local/gnuarm/arm-unknown-coff/bin/gcc -O2 -fomit-frame-pointer isqrt.c -o isqrt.s -S
>cat isqrt.s

@ Generated by gcc 2.8.1 for ARM/coff
.text
.align 0
.global _isqrt
_isqrt:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, current_function_anonymous_args = 0
mov r2, #0
cmn r0, #-1073741823
addhi r0, r0, #-1073741824
movhi r2, #32768
.L2:
add r3, r2, #8192
mov r3, r3, asl #15
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #16384
.L3:
add r3, r2, #4096
mov r3, r3, asl #14
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #8192
.L4:
add r3, r2, #2048
mov r3, r3, asl #13
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #4096
.L5:
add r3, r2, #1024
mov r3, r3, asl #12
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #2048
.L6:
add r3, r2, #512
mov r3, r3, asl #11
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #1024
.L7:
add r3, r2, #256
mov r3, r3, asl #10
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #512
.L8:
add r3, r2, #128
mov r3, r3, asl #9
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #256
.L9:
add r3, r2, #64
mov r3, r3, asl #8
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #128
.L10:
add r3, r2, #32
mov r3, r3, asl #7
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #64
.L11:
add r3, r2, #16
mov r3, r3, asl #6
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #32
.L12:
add r3, r2, #8
mov r3, r3, asl #5
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #16
.L13:
add r3, r2, #4
mov r3, r3, asl #4
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #8
.L14:
add r3, r2, #2
mov r3, r3, asl #3
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #4
.L15:
add r3, r2, #1
mov r3, r3, asl #2
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #2
.L16:
cmp r0, r2, asl #1
movls r0, r2
orrhi r0, r2, #1
mov pc, lr
==============================================================================

You could do the obvious optimisation replacing:

add r3, r2, #8192
mov r3, r3, asl #15
cmp r0, r3
rsbcs r0, r3, r0
orrcs r2, r2, #16384

with:

add r3, r2, #8192
cmp r0, r3, asl #15
subcs r0, r0, r3, asl #15
orrcs r2, r2, #16384

and so on, for 4 cycles per bit.

Bye,
Rob.

Rich Walker

unread,

Oct 5, 1998, 3:00:00 AM10/5/98

to

In message <2c76468f48%Da...@chescam.demon.co.uk>
David Chester <Da...@chescam.demon.co.uk> wrote:

> Hi,

> Anyone know where I can get hold of a Integer Square root
> algorithm for ARM assembler.

I find this piece of C is optimal, in the sense that NorCroft C on the ARM
produces exactly the assembler I'd expect.

static __pure unsigned int isqrt(unsigned int i)
{
int max;
int j;
unsigned int a, c, d, s;
a=0; c=0;
if (i>=0x10000) max=15; else max=7;
s = 1<<(max*2);
j=max;
do {
d = c + (a<<(j+1)) + s;
if (d<=i) { c=d; a |= 1<<j; };
if (d!=i) { s>>=2; j--; }
} while ((j>=0)&&(d!=i));
return a;
}

You may want to remove the __pure, and tell your C compiler some other way
that this is a pure function.

you may also wish to unroll this...

cheers,Rich.

--
Rich Walker: r...@shadow.org.uk (Shadow Robot Project)
http://www.shadow.org.uk 251 Liverpool Road
+44(0)171 700 2487 London N1 1LX
"Sometimes after an electrical storm I see in 5 dimensions"
-- Cornfed Pig, Duckman.

Wilco.D...@arm.com

unread,

Oct 6, 1998, 3:00:00 AM10/6/98

to

In article <2c76468f48%Da...@chescam.demon.co.uk>,

David Chester <Da...@chescam.demon.co.uk> wrote:
> Hi,
> Anyone know where I can get hold of a Integer Square root algorithm for ARM
assembler.

In article 1996/06/20 "3 cycle/bit square root (algorithm & code)" I wrote:

; IN : n 32 bit unsigned integer
; OUT: root = INT (SQRT (n))
; TMP: offset

MOV offset, #3 << 30
MOV root, #1 << 30
[ unroll for i = 0 .. 15
CMP n, root, ROR #2 * i
SUBHS n, n, root, ROR #2 * i
ADC root, offset, root, LSL #1
]
BIC root, root, #3 << 30 ; for rounding add: CMP n, root ADC root, #1

This routine takes 3 * 16 + 3 = 51 cycles for a root of a 32 bit integer.
Of course, there are many other possibilities. The routine can be used with
some extra code which uses early termination if n is less than 32 bit. Also
fixed point versions are possible, or roots of 64 bit numbers. Even looped
versions are no problem: Shift in 8 or 16 bits at a time,
so you need only to unroll 4 or 8 times (overhead 6 cycles).

----

I also found some C code that is optimized for ARM (4 cycle per bit):

typedef unsigned uint32;

#define iter1(N) \
try = root + (1 << (N)); \
if (n >= try << (N)) \
{ n -= try << (N); \
root |= 2 << (N); \
}

uint32 sqrt (uint32 n)
{
uint32 root = 0, try;
iter1 (15); iter1 (14); iter1 (13); iter1 (12);
iter1 (11); iter1 (10); iter1 ( 9); iter1 ( 8);
iter1 ( 7); iter1 ( 6); iter1 ( 5); iter1 ( 4);
iter1 ( 3); iter1 ( 2); iter1 ( 1); iter1 ( 0);
return root >> 1;
}

> Oh yeah, and where is the FAQ for this newsgroup kept?

There isn't one - maybe we need to create one?

Wilco

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

Wilco.D...@arm.com

unread,

Oct 6, 1998, 3:00:00 AM10/6/98

to

In article <48904B3108%r...@shadow.org.uk>,
r...@shadow.org.uk wrote:
> In message <2c76468f48%Da...@chescam.demon.co.uk>

> David Chester <Da...@chescam.demon.co.uk> wrote:
>
> > Hi,
> > Anyone know where I can get hold of a Integer Square root
> > algorithm for ARM assembler.
>

> I find this piece of C is optimal, in the sense that NorCroft C on the ARM
> produces exactly the assembler I'd expect.
>
> static __pure unsigned int isqrt(unsigned int i)
> {
> int max;
> int j;
> unsigned int a, c, d, s;
> a=0; c=0;
> if (i>=0x10000) max=15; else max=7;
> s = 1<<(max*2);
> j=max;
> do {
> d = c + (a<<(j+1)) + s;
> if (d<=i) { c=d; a |= 1<<j; };
> if (d!=i) { s>>=2; j--; }
> } while ((j>=0)&&(d!=i));
> return a;
> }

You may want to replace it with this, it uses the same early
termination, but takes only 9 cycles per bit...

typedef unsigned int uint32;

uint32 isqrt3(uint32 n)
{
uint32 root = 0, bit, trial;
bit = (n >= 0x10000) ? 1<<30 : 1<<14;
do
{
trial = root+bit;
if (n >= trial)
{
n -= trial;
root = trial+bit;
}
root >>= 1;
bit >>= 2;
} while (bit);
return root;

Steven Singer

unread,

Oct 7, 1998, 3:00:00 AM10/7/98

to

Wilco.D...@arm.com wrote:
: In article <2c76468f48%Da...@chescam.demon.co.uk>,
: David Chester <Da...@chescam.demon.co.uk> wrote:
: > Oh yeah, and where is the FAQ for this newsgroup kept?

: There isn't one - maybe we need to create one?

With that in mind, and so that next time this question comes up there
will be a reference point, I've taken every code suggestion in this
thread and edited them together to make a document on integer square
root routines for ARM processors. I've even dug up Wilco's original
article on his 3 cycle/bit square root routine which shows his
derivation.

You can find it at http://www.finesse.demon.co.uk/steven/sqrt.html

This does not mean I'm volunteering to produce a FAQ, but if I get some
positive feedback I may be persuaded to similarly archive other code
snippets.

- Steven

David Chester

unread,

Oct 11, 1998, 3:00:00 AM10/11/98

to

Many thanks people,

setting up a FAQ would probably be a good Idea, maby some pointers to sites with arm code on?
--
David Chester