For one without MUL and DIV (or where the instructions operate on shorter
operands than would be required to form the square in question)?
Thanks for all ideas and thoughts.
I will check TOACP, of course.
The application is a low-cost microcontroller with a 3-axis accelerometer.
To get the magnitude of the acceleration, we'd want to use
sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition
is cheap (ADD, ADC). The square root extraction is what I'm not sure how
best to do.
The algorithm that comes to mind is just to do a binary search by squaring
potential solutions and comparing them to the quantity whose square root is
to be extracted.
For example, if the sum of the squares is 16 bits or less, and since the
processor has a 8 x 8 = 16 MUL instruction, it should be possible to iterate
to the solution in no more than 16 square/compare cycles.
But maybe there is a better way?
And it isn't clear what to do if the sum of the squares is a bit larger.
Thanks for all.
--
David T. Ashley (d...@e3ft.com)
http://www.e3ft.com (Consulting Home Page)
http://www.dtashley.com (Personal Home Page)
http://gpl.e3ft.com (GPL Publications and Projects)
There's never such a thing as "the best" way to do something that's been
described so vaguely. Different circumstances render different aspects
important, and thus change what "good" actually means.
> To get the magnitude of the acceleration, we'd want to use
> sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition
> is cheap (ADD, ADC). The square root extraction is what I'm not sure how
> best to do.
There's one exception to the above rule: not doing something at all, if
you can get away, is always the best way of doing it. So: do you really
*need* that sqrt() taken on your microcontroller?
Failing that you can still use use good old Newton-Raphson iteration to
locate the zero of (x^2 - input).
x |--> x - (x^2 - input) / (2*x)
= (x + input/x) / 2
David T. Ashley wrote:
> What is the best integer square root algorithm for a processor with MUL and
> DIV (integer multiplication and division) built-in?
Here is the simplest (but not the very fastest) sqrt algorithm for the
CPU with cheap multiplication:
//-------------------------
s16 sqrt_s16(s32 x)
{
s32 tmp_result;
s16 root, tmp_root, tmp_add;
root = 0;
tmp_add = 0x4000;
while(tmp_add)
{
tmp_root = root + tmp_add;
tmp_result = ((s32)tmp_root)*tmp_root;
if(tmp_result <= x) root = tmp_root;
tmp_add >>= 1;
}
return root;
}
//-----------------------
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
Approximate the square root by the lowest half of the bit
representation
of x^2 + y^2 (or any other easy, convenient approximation), and then
use Newton-Raphson.
>What is the best integer square root algorithm for a processor with MUL and
>DIV (integer multiplication and division) built-in?
>
>For one without MUL and DIV (or where the instructions operate on shorter
>operands than would be required to form the square in question)?
>
>Thanks for all ideas and thoughts.
>
>I will check TOACP, of course.
>
>The application is a low-cost microcontroller with a 3-axis accelerometer.
>To get the magnitude of the acceleration, we'd want to use
>sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition
>is cheap (ADD, ADC). The square root extraction is what I'm not sure how
>best to do.
>
>The algorithm that comes to mind is just to do a binary search by squaring
>potential solutions and comparing them to the quantity whose square root is
>to be extracted.
>
>For example, if the sum of the squares is 16 bits or less, and since the
>processor has a 8 x 8 = 16 MUL instruction, it should be possible to iterate
>to the solution in no more than 16 square/compare cycles.
>
>But maybe there is a better way?
Jack Crenshaw had a series of columns on numerical methods in
"Embedded Systems" a few years ago. Has a good book on the subject, as
well. http://www.powells.com/biblio/72-9781929629091-0
(The past tense isn't meant to imply that his current columns aren't
great too... ;-)
As far as the columns, the one on integer square roots is at
<http://pdf.aiaa.org/jaPreview/JGCD/1997/PVJAIMP4026.pdf>
--
Rich Webb Norfolk, VA
David T. Ashley wrote:
> The application is a low-cost microcontroller with a 3-axis accelerometer.
> To get the magnitude of the acceleration, we'd want to use
> sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition
> is cheap (ADD, ADC). The square root extraction is what I'm not sure how
> best to do.
How accurate do you want to be?
let X = min(a,b)/max(a,b)
Then you can use a polynomial approximation:
sqrt(a^2 + b^2) ~ max(a,b)*(1 + P(X))
The simplest approximation of the 1st order (somewhat 10% accurate):
sqrt(a^2 + b^2) ~ max(a,b) + 0.318 * min(a,b)
In the 3-dimensional case, you have to do this procedure twice:
sqrt(x^2 + y^2 + z^2) = sqrt(max(x,y,z)^2 + sqrt(mid(x,y,z)^2 +
min(x,y,z)^2))^2
"David T. Ashley" wrote:
> What is the best integer square root algorithm for a processor with MUL and
> DIV (integer multiplication and division) built-in?
>
> For one without MUL and DIV (or where the instructions operate on shorter
> operands than would be required to form the square in question)?
For no MUL and DIV search for Dann Corbit's postings. Dann
posted one a few years ago with N/2 interations. If you can't find it
contact me off line.
The following square root routine was originally done for
the 6805. It is probably not the fastest
The basic loop spits out one bit of result per pass. This
routine takes a 16 bit argument and returns an eight bit
integer and 8 bit fraction.
Walter Banks
============================================================
#pragma option v;
#pragma has MUL;
long SQRT (long l)
/*
Square root V1.0.
Hayward/Banks.
This routine generates a square root to
eight bit resolution taking a 16 bit argument
normalizing the argument and returning a 16 bit
real number that consists of 8 bits integer and
8 bit fractional parts. The MS8bits are the integer
part.
i i i i i i i i . f f f f f f f f
| | | | | | | | | | | | | | | |
128 |32 | 8 4 2 1 .5 | | | | | | .00390625
64 16 .25 | | | | --.0078125
.125- | | ----.015625
.0625-- ------.03125
These routines may be adapted for any purpose. No
warranty is implied or given as to their usability for
any purpose.
(c) Byte Craft Limited
Waterloo, Ontario
Canada N2J 4E4
(519) 888-6911
Walter Banks/Gordon Hayward
*/
{
unsigned int n,j,k;
unsigned long t;
n = 0;
while( (l & 0xc000) == 0)
{
n++;
l <<= 2;
}
for (j = 0, k = 0x80; k != 0; k >>=1)
{
j = j | k;
t = j * j;
if (t > l) j = j & (~k);
}
t = j;
return (t << (8-n));
}
unsigned long result;
unsigned int int_part, frac_part;
void main(void)
{
result = SQRT(0x4000);
int_part = result >> 8;
frac_part = result & 0xff;
}
Actually you can do this in 8 multiplies. This is how I usually
implement
square root - I have probably reinvented it some 15+ years ago for
myself.
You just start with $80, square and compare to the number you want
the root of, and do successive approximation down to bit 0 (i.e.
second time
with $40+result so far, then $20 etc.).
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Original message: http://groups.google.com/group/comp.arch.embedded/msg/376b32a1ccd9dd46?dmode=source
> On May 1, 1:06 pm, "David T. Ashley" <d...@e3ft.com> wrote:
>
> > What is the best integer square root algorithm for a processor with MUL and
> > DIV (integer multiplication and division) built-in?
>
> > The application is a low-cost microcontroller with a 3-axis accelerometer.
> > To get the magnitude of the acceleration, we'd want to use
> > sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition
> > is cheap (ADD, ADC). The square root extraction is what I'm not sure how
> > best to do.
>
>
> Approximate the square root by the lowest half of the bit
> representation
> of x^2 + y^2 (or any other easy, convenient approximation), and then
> use Newton-Raphson.
If the values of x, y, z don't change too much from sample to sample
you could use the previous value as the starting point for Newton-
Raphson iteration.
For a different application (not involving acceleration or square
roots), I performed only a single Newton-Raphson iteration per sample.
For slow moving inputs, this gave perfect results, and for fast moving
inputs, I got a free low-pass filter :)
(I posted this also from TerraNews, but sometimes those posts take a
few days to show up):
Take a look here:
http://www.azillionmonkeys.com/qed/sqroot.html
A zillion monkeys can't all be wrong.
> What is the best integer square root algorithm for a processor with MUL
> and DIV (integer multiplication and division) built-in?
>
> For one without MUL and DIV (or where the instructions operate on
> shorter operands than would be required to form the square in
> question)?
>
> Thanks for all ideas and thoughts.
>
> I will check TOACP, of course.
>
> The application is a low-cost microcontroller with a 3-axis
> accelerometer. To get the magnitude of the acceleration, we'd want to
> use
> sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The
> addition
> is cheap (ADD, ADC). The square root extraction is what I'm not sure
> how best to do.
>
> The algorithm that comes to mind is just to do a binary search by
> squaring potential solutions and comparing them to the quantity whose
> square root is to be extracted.
>
> For example, if the sum of the squares is 16 bits or less, and since
> the processor has a 8 x 8 = 16 MUL instruction, it should be possible
> to iterate to the solution in no more than 16 square/compare cycles.
>
> But maybe there is a better way?
>
> And it isn't clear what to do if the sum of the squares is a bit
> larger.
Nice and simple method that works quite well.
--
********************************************************************
Paul E. Bennett...............<email://Paul_E....@topmail.co.uk>
Forth based HIDECS Consultancy
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-811095
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************
Well my algorithm takes less lines and is a lot easier to read than
those
in HLL which were suggested, but I hope it counts in spite of
that :-).
Here is it excerpted from a post I have made many years ago,
calculates a 16 bit squre root of a 32 bit number.
The example is in CPU32 (or CF) assembly.
*
* calc. the square root of d1.l; return it in d1.l
* destroys d2,d3,d4
*
sqrtd1
moveq #15,d2 counter
clr.l d3 colect result here
set loop,*
bset d2,d3 try this
move.w d3,d4 extra copy result so far
mulu.w d4,d4 square
cmp.l d1,d4 higher?
bls keepbit branch not
bclr d2,d3 else bit back down
keepbit
dbra d2,loop process all bits
move.l d3,d1 update in d1
rts
*
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Original message: http://groups.google.com/group/comp.arch.embedded/msg/13325873c2c05689?dmode=source
I have Mr. Crenshaw's book "Math Toolkit for REAL-TIME Programming"
"Square Root" runs from pages 43-87 and he discusses a boatload of
methods.
Unfortunately, I have misplaced the CD. Looking for it now.
First make sure you really need to find the square root. For many uses,
you could just use the calculated square of the magnitude.
A benchmark shows this to be the fastest on modern hardware, but I do
not know about embedded CPUs. I guess they should be tested.
unsigned long isqrt26(unsigned long input)
{
unsigned long s = 0;
unsigned long s21 = 1073741824;
if (s21 <= input)
s += 32768, input -= s21;
s21 = (s << 15) + 268435456;
if (s21 <= input)
s += 16384, input -= s21;
s21 = (s << 14) + 67108864;
if (s21 <= input)
s += 8192, input -= s21;
s21 = (s << 13) + 16777216;
if (s21 <= input)
s += 4096, input -= s21;
s21 = (s << 12) + 4194304;
if (s21 <= input)
s += 2048, input -= s21;
s21 = (s << 11) + 1048576;
if (s21 <= input)
s += 1024, input -= s21;
s21 = (s << 10) + 262144;
if (s21 <= input)
s += 512, input -= s21;
s21 = (s << 9) + 65536;
if (s21 <= input)
s += 256, input -= s21;
s21 = (s << 8) + 16384;
if (s21 <= input)
s += 128, input -= s21;
s21 = (s << 7) + 4096;
if (s21 <= input)
s += 64, input -= s21;
s21 = (s << 6) + 1024;
if (s21 <= input)
s += 32, input -= s21;
s21 = (s << 5) + 256;
if (s21 <= input)
s += 16, input -= s21;
s21 = (s << 4) + 64;
if (s21 <= input)
s += 8, input -= s21;
s21 = (s << 3) + 16;
if (s21 <= input)
s += 4, input -= s21;
s21 = (s << 2) + 4;
if (s21 <= input)
s += 2, input -= s21;
s21 = (s << 1) + 1;
if (s21 <= input)
s++;
return s;
}
#include <limits.h>
// This one is REALLY bad for large inputs.
unsigned long isqrt00(unsigned long x)
{
unsigned long s = 1;
unsigned long remainder;
if (x <= 1)
return x;
remainder = x;
while (remainder >= s) {
remainder -= s;
s += 2;
}
return (s - 1) >> 1;
}
// Integer Square Root function
// Uses bisection to find square root
unsigned long isqrt01(unsigned long x)
{
unsigned long r,
s,
t;
r = 0;
for (t = 0x40000000; t; t >>= 2) {
s = r + t;
if (s <= x) {
x -= s;
r = s + t;
} r >>= 1;
} return (r);
}
#define LSIZE (sizeof(unsigned long)*CHAR_BIT)
/* Jos came up with this one... */
unsigned long isqrt06(unsigned long n)
{
unsigned long l = 0, c = 0, y, k;
for (k = LSIZE >> 1; k--; n <<= 2)
{
l = (l << 2) | (n >> (LSIZE - 2) & 0x03);
y = (c += c) << 1;
if (l > y)
{
c++;
l -= y + 1;
}
}
return c;
}
// Integer Square Root function
// Uses bisection to find square root
unsigned long isqrt03(unsigned long y)
{
unsigned long x = 32768,
xsq = x * x,
oldx;
int i;
for (i = 14; i >= 0; i--)
if (y < xsq) {
oldx = x;
x -= 1 << i;
xsq -= (x + oldx) << i;
} else {
oldx = x;
x += 1 << i;
xsq += (x + oldx) << i;
}
if (y < xsq)
x--;
return (x);
}
unsigned long isqrt04(unsigned long x)
{
if (x == 0)
return 0;
else if (x == 1)
return 1;
else {
unsigned long r = x >> 1;
unsigned long q;
for (;;) {
q = x / r;
if (q >= r)
return r;
else
r = (r + q) >> 1;
}
}
}
// Integer Square Root function
// Uses bisection to find square root
unsigned long isqrt05(unsigned long n)
{
unsigned long h,
p = 0,
q = 1,
r = n;
while (q <= n)
q <<= 2;
while (q != 1) {
q >>= 2;
h = p + q;
p >>= 1;
if (r >= h) {
p += q;
r -= h;
}
}
return p;
}
// Binomial Theorem -- O( 1/2 log2 N )
unsigned long isqrt07(unsigned long N)
{
unsigned long l2,
u,
v,
u2,
v2,
uv2,
n;
if (2 > N)
return (N);
u = N;
l2 = 0;
while (u >>= 1)
l2++;
l2 >>= 1;
u = 1L << l2;
u2 = u << l2;
while (l2--) {
v = 1L << l2;
v2 = v << l2;
uv2 = u << (l2 + 1);
n = u2 + uv2 + v2;
if (n <= N) {
u += v;
u2 = n;
}
}
return (u);
}
// Integer Square Root function
// Uses factoring to find square root
// A 256 entry table used to work out the square root of the 7 or 8
most
// significant bits. A power of 2 used to approximate the rest.
// Based on an 80386 Assembly implementation by Arne Steinarson
unsigned const char sqq_table[] =
{
0, 16, 22, 27, 32, 35, 39, 42, 45, 48, 50, 53, 55, 57, 59, 61, 64,
65,
67, 69, 71, 73, 75, 76, 78, 80, 81, 83, 84, 86, 87, 89, 90, 91,
93, 94,
96, 97, 98, 99, 101, 102, 103, 104, 106, 107, 108, 109, 110, 112,
113,
114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
128,
128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
141,
142, 143, 144, 144, 145, 146, 147, 148, 149, 150, 150, 151, 152,
153,
154, 155, 155, 156, 157, 158, 159, 160, 160, 161, 162, 163, 163,
164,
165, 166, 167, 167, 168, 169, 170, 170, 171, 172, 173, 173, 174,
175,
176, 176, 177, 178, 178, 179, 180, 181, 181, 182, 183, 183, 184,
185,
185, 186, 187, 187, 188, 189, 189, 190, 191, 192, 192, 193, 193,
194,
195, 195, 196, 197, 197, 198, 199, 199, 200, 201, 201, 202, 203,
203,
204, 204, 205, 206, 206, 207, 208, 208, 209, 209, 210, 211, 211,
212,
212, 213, 214, 214, 215, 215, 216, 217, 217, 218, 218, 219, 219,
220,
221, 221, 222, 222, 223, 224, 224, 225, 225, 226, 226, 227, 227,
228,
229, 229, 230, 230, 231, 231, 232, 232, 233, 234, 234, 235, 235,
236,
236, 237, 237, 238, 238, 239, 240, 240, 241, 241, 242, 242, 243,
243,
244, 244, 245, 245, 246, 246, 247, 247, 248, 248, 249, 249, 250,
250,
251, 251, 252, 252, 253, 253, 254, 254, 255
};
// Really fast, but not real accurate.
unsigned long isqrt08(unsigned long n)
{
if (n >= 0x10000)
if (n >= 0x1000000)
if (n >= 0x10000000)
if (n >= 0x40000000)
return (sqq_table[n >> 24] << 8);
else
return (sqq_table[n >> 22] << 7);
else if (n >= 0x4000000)
return (sqq_table[n >> 20] << 6);
else
return (sqq_table[n >> 18] << 5);
else if (n >= 0x100000)
if (n >= 0x400000)
return (sqq_table[n >> 16] << 4);
else
return (sqq_table[n >> 14] << 3);
else if (n >= 0x40000)
return (sqq_table[n >> 12] << 2);
else
return (sqq_table[n >> 10] << 1);
else if (n >= 0x100)
if (n >= 0x1000)
if (n >= 0x4000)
return (sqq_table[n >> 8]);
else
return (sqq_table[n >> 6] >> 1);
else if (n >= 0x400)
return (sqq_table[n >> 4] >> 2);
else
return (sqq_table[n >> 2] >> 3);
else if (n >= 0x10)
if (n >= 0x40)
return (sqq_table[n] >> 4);
else
return (sqq_table[n << 2] << 5);
else if (n >= 0x4)
return (sqq_table[n >> 4] << 6);
else
return (sqq_table[n >> 6] << 7);
}
extern unsigned long sq_table[];
unsigned long isqrt09(unsigned long n)
{
unsigned long result;
if (n >= 0x10000)
if (n >= 0x10000000)
if (n >= 0x40000000)
result = sq_table[n >> 16];
else
result = sq_table[n >> 14] >> 1;
else if (n >= 0x1000000)
result = sq_table[n >> 12] >> 2;
else
result = sq_table[n >> 8] >> 4;
else
return (sq_table[n] >> 8);
if (((result + 1) * (result + 1)) <= n)
result++; // minor correction for numbers > 2^16
// can be removed to get almost 2 times speed up, but accuracy
will suffer.
// however worst case will be out by 1.
return (result);
}
/* Fast integer square root */
unsigned long isqrt10(unsigned long x)
{
unsigned long squaredbit,
remainder,
root;
if (x == 0)
return 0;
/* Load the binary constant 01 00 00 ... 00, where the number of
zero
* bits to the right of the single one bit is even, and the one
bit is as
* far left as is consistant with that condition.) */
squaredbit = (long) ((((unsigned long) ~0L) >> 1) &
~(((unsigned long) ~0L) >> 2));
/* This portable load replaces the loop that used to be here, and
was
* donated by lega...@xmission.com */
/* Form bits of the answer. */
remainder = x;
root = 0;
while (squaredbit > 0) {
if (remainder >= (squaredbit | root)) {
remainder -= (squaredbit | root);
root >>= 1;
root |= squaredbit;
} else {
root >>= 1;
}
squaredbit >>= 2;
}
return root;
}
#define Ns ((sizeof(unsigned long)) << 2)
unsigned long isqrt11(unsigned long Y)
{
unsigned long C,
D,
X,
X4p1;
int i;
X = C = 0;
for (i = 1; i <= Ns + 1; i++) {
D = (Y >> ((Ns - i) << 1)) & 3;
X <<= 1;
X4p1 = (X << 1) + 1;
C = (C << 2) + D;
if (C >= X4p1) {
C -= X4p1;
X++;
}
}
return (X >> 1) + (X & 1);
}
/*
*
* Find square root of n. This is a Newton's method
approximation,
* converging in 1 iteration per digit or so. Finds floor(sqrt(n) +
0.5).
*/
unsigned long isqrt12(unsigned long n)
{
unsigned long i;
unsigned long k0,
k1,
nn;
for (nn = i = n, k0 = 2; i > 0; i >>= 2, k0 <<= 1) {;
}
nn <<= 2;
for (;;) {
k1 = (nn / k0 + k0) >> 1;
if (((k0 ^ k1) & ~1) == 0)
break;
k0 = k1;
}
return (unsigned long) ((k1 + 1) >> 1);
}
/* #include <math.h> */
/*
* It requires more space to describe this implementation of the
manual
* square root algorithm than it did to code it. The basic idea is
that
* the square root is computed one bit at a time from the high end.
Because
* the original number is 32 bits (unsigned), the root cannot exceed
16 bits
* in length, so we start with the 0x8000 bit.
*
* Let "x" be the value whose root we desire, "t" be the square root
* that we desire, and "s" be a bitmask. A simple way to compute
* the root is to set "s" to 0x8000, and loop doing the following:
*
* t = 0;
* s = 0x8000;
* do {
* if ((t + s) * (t + s) <= x)
* t += s;
* s >>= 1;
* while (s != 0);
*
* The primary disadvantage to this approach is the multiplication.
To
* eliminate this, we begin simplying. First, we observe that
*
* (t + s) * (t + s) == (t * t) + (2 * t * s) + (s * s)
*
* Therefore, if we redefine "x" to be the original argument minus the
* current value of (t * t), we can determine if we should add "s" to
* the root if
*
* (2 * t * s) + (s * s) <= x
*
* If we define a new temporary "nr", we can express this as
*
* t = 0;
* s = 0x8000;
* do {
* nr = (2 * t * s) + (s * s);
* if (nr <= x) {
* x -= nr;
* t += s;
* }
* s >>= 1;
* while (s != 0);
*
* We can improve the performance of this by noting that "s" is always
a
* power of two, so multiplication by "s" is just a shift. Also,
because
* "s" changes in a predictable manner (shifted right after each
iteration)
* we can precompute (0x8000 * t) and (0x8000 * 0x8000) and then
adjust
* them by shifting after each step. First, we let "m" hold the value
* (s * s) and adjust it after each step by shifting right twice. We
* also introduce "r" to hold (2 * t * s) and adjust it after each
step
* by shifting right once. When we update "t" we must also update
"r",
* and we do so by noting that (2 * (old_t + s) * s) is the same as
* (2 * old_t * s) + (2 * s * s). Noting that (s * s) is "m" and that
* (r + 2 * m) == ((r + m) + m) == (nr + m):
*
* t = 0;
* s = 0x8000;
* m = 0x40000000;
* r = 0;
* do {
* nr = r + m;
* if (nr <= x) {
* x -= nr;
* t += s;
* r = nr + m;
* }
* s >>= 1;
* r >>= 1;
* m >>= 2;
* } while (s != 0);
*
* Finally, we note that, if we were using fractional arithmetic,
after
* 16 iterations "s" would be a binary 0.5, so the value of "r" when
* the loop terminates is (2 * t * 0.5) or "t". Because the values in
* "t" and "r" are identical after the loop terminates, and because we
* do not otherwise use "t" explicitly within the loop, we can omit
it.
* When we do so, there is no need for "s" except to terminate the
loop,
* but we observe that "m" will become zero at the same time as "s",
* so we can use it instead.
*
* The result we have at this point is the floor of the square root.
If
* we want to round to the nearest integer, we need to consider
whether
* the remainder in "x" is greater than or equal to the difference
* between ((r + 0.5) * (r + 0.5)) and (r * r). Noting that the
former
* quantity is (r * r + r + 0.25), we want to check if the remainder
is
* greater than or equal to (r + 0.25). Because we are dealing with
* integers, we can't have equality, so we round up if "x" is strictly
* greater than "r":
*
* if (x > r)
* r++;
*/
unsigned long isqrt02(unsigned long x)
{
unsigned long r,
nr,
m;
r = 0;
m = 0x40000000;
do {
nr = r + m;
if (nr <= x) {
x -= nr;
r = nr + m;
}
r >>= 1;
m >>= 2;
} while (m != 0);
if (x > r)
r++;
return (r);
}
unsigned long isqrt14(unsigned long input)
{
unsigned long s;
unsigned long s21;
long j;
s = 0;
for (j = 15; j >= 0; --j) {
s21 = (s << (j + 1)) + (1 << (j + j));
if (s21 <= input) {
s += (1 << j);
input -= s21;
}
}
return s;
}
// Newton's Method -- O( log2 N )
unsigned long isqrt15(unsigned long N)
{
unsigned long n=0,
p=0,
low,
high;
if (2 > N)
return (N);
low = 0;
high = N;
while (high > low + 1) {
n = (high + low) / 2;
p = n * n;
if (N < p)
high = n;
else if (N > p)
low = n;
else
break;
}
return (N == p ? n : low);
}
// Summing terms -- O( sqrt N )
unsigned long isqrt16(unsigned long N)
{
unsigned long n,
u,
v;
if (2 > N)
return (N);
u = 4;
v = 5;
for (n = 1; u <= N; n++) {
u += v;
v += 2;
}
return (n);
}
// Binomial Theorem -- O( 1/2 log2 N )
unsigned long isqrt17(unsigned long N)
{
unsigned long l2,
u,
v,
u2,
v2,
uv2,
n;
if (2 > N)
return (N);
u = N;
l2 = 0;
while (u >>= 1)
l2++;
l2 >>= 1;
u = 1L << l2;
u2 = u << l2;
while (l2--) {
v = 1L << l2;
v2 = v << l2;
uv2 = u << (l2 + 1);
n = u2 + uv2 + v2;
if (n <= N) {
u += v;
u2 = n;
}
}
return (u);
}
// Optimized Binomial Theorem
unsigned long isqrt18(unsigned long N)
{
unsigned long l2,
u,
v,
u2,
n;
if (2 > N)
return (N);
u = N;
l2 = 0;
while (u >>= 2)
l2++;
u = 1L << l2;
v = u;
u2 = u << l2;
while (l2--) {
v >>= 1;
n = (u + u + v) << l2;
n += u2;
if (n <= N) {
u += v;
u2 = n;
}
}
return (u);
}
#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))
/*
ALGORITHM
The calculations are the base-two analogue of the square
root algorithm we all learned in grammar school. Since we're
in base 2, there is only one nontrivial trial multiplier.
Notice that absolutely no multiplications or divisions are
performed.
This means it'll be fast on a wide range of processors.
*/
unsigned long isqrt19(unsigned long x)
{
unsigned long a = 0L; /* accumulator */
unsigned long r = 0L; /* remainder */
unsigned long e = 0L; /* trial product */
int i;
for (i = 0; i < BITSPERLONG / 2; i++) {
r = (r << 2) + TOP2BITS(x);
x <<= 2;
a <<= 1;
e = (a << 1) + 1;
if (r >= e) {
r -= e;
a++;
}
}
return a;
}
/* James C Hu [j...@cs.wustl.edu] sent this one: */
unsigned long isqrt20(unsigned long a)
{
unsigned long x = a;
if (a > 1) {
unsigned long xx;
register int l = 0;
while (x >>= 2)
l++;
x = a / (1 << l);
/* Variation of Newton-Raphson */
while (x > (xx = a / x)) {
int dx = (x - xx) / 2;
if (dx == 0)
x--;
else
x -= dx;
}
}
return x;
}
unsigned long isqrt21(unsigned long x)
{
unsigned long xn,
xs;
if (x <= 1)
return (x);
xn = x / 128; // First guess: Xo = x>>5
if (xn == 0)
xn = 1; // Prevent divide by 0
do {
xs = xn; // Save present Xn
xn = (xn + x / xn) / 2; // Compiler outputs right shifts for /
2^n
} while (xn - xs > 1);
return (xn);
}
unsigned long isqrt22(unsigned long x)
{
unsigned long a = 1; /* Thats a number not a lowercase 'L'
*/
unsigned long b = x;
while (b) {
b >>= 2;
a <<= 1;
} b = a;
a >>= 1;
/* At this point a is the closest power of 2 less than the
resultant
* square root, and b is just above. Run whatever technique you
like from
* here. */
b = (a + b + 1) / 2;
do {
a = x / b;
b = (a + b) / 2;
} while ((a > b) ? ((a - b) > 1) : ((b - a) > 1));
if (a != b) {
a = x / b;
b = (a + b) / 2;
} /* result in B */
return b;
}
/* Integer square root by Halleck's method, with Legalize's speedup */
unsigned long isqrt23(unsigned long x)
{
long squaredbit,
remainder,
root;
if (x < 1)
return 0;
/* Load the binary constant 01 00 00 ... 00, where the number of
zero
* bits to the right of the single one bit is even, and the one
bit is as
* far left as is consistant with that condition.) */
squaredbit = (long) ((((unsigned long) ~0L) >> 1) &
~(((unsigned long) ~0L) >> 2));
/* This portable load replaces the loop that used to be here, and
was
* donated by lega...@xmission.com */
/* Form bits of the answer. */
remainder = x;
root = 0;
while (squaredbit > 0) {
if (remainder >= (squaredbit | root)) {
remainder -= (squaredbit | root);
root >>= 1;
root |= squaredbit;
} else {
root >>= 1;
}
squaredbit >>= 2;
}
return root;
}
#define BITS 32
// BITS is the smallest even integer such that 2^BITS > ULONG_MAX.
// If you port this routine to a computer where 2^(BITS-1) > ULONG_MAX
// (Unisys A-series, for example, where integers would generally be
// stored in the 39-bit mantissa of a 48-bit word), unwind the first
// iteration of the loop, and special-case references to 'half' in
that
// first iteration so as to avoid overflow.
unsigned long isqrt24(unsigned long n)
{
// Compute the integer square root of the integer argument n
// Method is to divide n by x computing the quotient x and
remainder r
// Notice that the divisor x is changing as the quotient x changes
// Instead of shifting the dividend/remainder left, we shift the
// quotient/divisor right. The binary point starts at the extreme
// left, and shifts two bits at a time to the extreme right.
// The residue contains n-x^2. (Within these comments, the ^
operator
// signifies exponentiation rather than exclusive or. Also, the /
// operator returns fractions, rather than truncating, so 1/4
means
// one fourth, not zero.) // Since (x + 1/2)^2 == x^2 + x +
1/4,
// n - (x + 1/2)^2 == (n - x^2) - (x + 1/4)
// Thus, we can increase x by 1/2 if we decrease (n-x^2) by (x
+1/4)
unsigned long residue; // n - x^2
unsigned long root; // x + 1/4
unsigned long half; // 1/2
residue = n; // n - (x = 0)^2, with suitable
alignment
root = 1 << (BITS - 2); // x + 1/4, shifted left BITS places
half = root + root; // 1/2, shifted left BITS places
do {
if (root <= residue) { // Whenever we can,
residue -= root; // decrease (n-x^2) by (x+1/4)
root += half;
} // increase x by 1/2
half >>= 2; // Shift binary point 2 places right
root -= half; // x{+1/2}+1/4 - 1/8 == x{+1/2}+1/8
root >>= 1; // 2x{+1}+1/4, shifted right 2 places
} while (half); // When 1/2 == 0, binary point is at
far
// right
// Omit the following line to truncate instead of rounding
if (root < residue)
++root;
return root; // Guaranteed to be correctly rounded
(or
// truncated)
}
/* Length of root (1/2 len of input) */
#define LEN 16
/* intsqr(n)
** A fixed-point integer square root based on the
** algorithm proposed by John Von Neuman in the first
** draft of the EDVAC report.
** time = 118+(66*LEN)+(18*number_of_1s_in_root)
** = 1174 - 1462 cycles.
** This routine was provided by Mike Albaugh and came from
** John Van Neumann [or prior].
*/
/* The timing was for a 68K assembly tweaked version of the
* compiler output (MEA 30Jan1998)
*/
unsigned long isqrt25(unsigned long n)
{
unsigned long err;
unsigned long trial,
trix2;
unsigned long i;
trial = err = 0;
for (i = 0; i < LEN; i++) {
err = (err << 2) + (n >> 30); /* shift input up 2 bits into
error */
n <<= 2;
trial <<= 1;
trix2 = trial << 1; /* get trial and trial*2 */
if (err > trix2) {
trial++;
err -= (1 + trix2);
}
}
return (trial);
unsigned long isqrt27(unsigned long input)
{
unsigned long s = 0;
unsigned long s21 = 1073741824;
if (s21 <= input)
s |= 32768, input -= s21;
s21 = (s << 15) | 268435456;
if (s21 <= input)
s |= 16384, input -= s21;
s21 = (s << 14) | 67108864;
if (s21 <= input)
s |= 8192, input -= s21;
s21 = (s << 13) | 16777216;
if (s21 <= input)
s |= 4096, input -= s21;
s21 = (s << 12) | 4194304;
if (s21 <= input)
s |= 2048, input -= s21;
s21 = (s << 11) | 1048576;
if (s21 <= input)
s |= 1024, input -= s21;
s21 = (s << 10) | 262144;
if (s21 <= input)
s |= 512, input -= s21;
s21 = (s << 9) | 65536;
if (s21 <= input)
s |= 256, input -= s21;
s21 = (s << 8) | 16384;
if (s21 <= input)
s |= 128, input -= s21;
s21 = (s << 7) | 4096;
if (s21 <= input)
s |= 64, input -= s21;
s21 = (s << 6) | 1024;
if (s21 <= input)
s |= 32, input -= s21;
s21 = (s << 5) | 256;
if (s21 <= input)
s |= 16, input -= s21;
s21 = (s << 4) | 64;
if (s21 <= input)
s |= 8, input -= s21;
s21 = (s << 3) | 16;
if (s21 <= input)
s |= 4, input -= s21;
s21 = (s << 2) | 4;
if (s21 <= input)
s |= 2, input -= s21;
s21 = (s << 1) | 1;
if (s21 <= input)
s++;
return s;
}
unsigned long isqrt28(unsigned long input)
{
unsigned long s = 0;
unsigned long s21 = 1073741824;
if (s21 <= input)
s |= 32768, input -= s21;
s21 = (s << 15) + 268435456;
if (s21 <= input)
s |= 16384, input -= s21;
s21 = (s << 14) + 67108864;
if (s21 <= input)
s |= 8192, input -= s21;
s21 = (s << 13) + 16777216;
if (s21 <= input)
s |= 4096, input -= s21;
s21 = (s << 12) + 4194304;
if (s21 <= input)
s |= 2048, input -= s21;
s21 = (s << 11) + 1048576;
if (s21 <= input)
s |= 1024, input -= s21;
s21 = (s << 10) + 262144;
if (s21 <= input)
s |= 512, input -= s21;
s21 = (s << 9) + 65536;
if (s21 <= input)
s |= 256, input -= s21;
s21 = (s << 8) + 16384;
if (s21 <= input)
s |= 128, input -= s21;
s21 = (s << 7) + 4096;
if (s21 <= input)
s |= 64, input -= s21;
s21 = (s << 6) + 1024;
if (s21 <= input)
s |= 32, input -= s21;
s21 = (s << 5) + 256;
if (s21 <= input)
s |= 16, input -= s21;
s21 = (s << 4) + 64;
if (s21 <= input)
s |= 8, input -= s21;
s21 = (s << 3) + 16;
if (s21 <= input)
s |= 4, input -= s21;
s21 = (s << 2) + 4;
if (s21 <= input)
s |= 2, input -= s21;
s21 = (s << 1) + 1;
if (s21 <= input)
s++;
return s;
}
/* Log base 2 estimator */
int qlog2(unsigned long n)
{
register int i = (n & 0xffff0000) ? 16 : 0;
if ((n >>= i) & 0xff00)
i |= 8, n >>= 8;
if (n & 0xf0)
i |= 4, n >>= 4;
if (n & 0xc)
i |= 2, n >>= 2;
return (i | (n >> 1));
}
unsigned long isqrt29(unsigned long x)
{
unsigned long xn,
xs;
if (x <= 1)
return (x);
xn = x >> (qlog2(x) / 2); // First guess is x shifted by half
its
// highest bit position
do {
xs = xn; // Save present Xn
xn = (xn + x / xn) / 2; // Compiler outputs right shifts for /
2^n
} while (xn - xs > 1);
return (xn);
}
/*
* The simplest integer square root.
* Programmed by Jack W. Crenshaw, Crenshaw Technologies.
*/
unsigned long isqrt30(unsigned long a)
{
unsigned long x = 1;
while (x * x <= a)
++x;
return --x;
}
/* An improved version. Using no multiplications
* it's ideal for small microcontrollers. Feel
* free to change the word length as desired.
* Programmed by Jack W. Crenshaw, Crenshaw Technologies.
*/
unsigned long isqrt31(unsigned long a)
{
unsigned long x = 1;
unsigned long xsqr = 1;
unsigned long delta = 3;
while (xsqr <= a) {
++x;
xsqr += delta;
delta += 2;
}
return --x;
}
/* A non-iterative square root
* This function is the binary version of your
* familiar high-school algorithm. It works in
* fashion similar to the shift-and-subtract
* integer division.
* Programmed by Jack W. Crenshaw, Crenshaw Technologies.
*/
unsigned short isqrt32(unsigned long a)
{
unsigned long rem = 0;
unsigned long root = 0;
int i;
for (i = 0; i < 16; i++) {
root <<= 1;
rem = ((rem << 2) + (a >> 30));
a <<= 2;
root++;
if (root <= rem) {
rem -= root;
root++;
} else
root--;
}
return (unsigned short) (root >> 1);
}
/* From: http://www.azillionmonkeys.com/qed/sqroot.html */
/* Integer Square Root function. Contributors include Arne Steinarson
* for the basic approximation idea, Dann Corbit and Mathew Hendry for
* the first cut at the algorithm, Lawrence Kirby for the
rearrangement,
* improvments and range optimization and Paul Hsieh for the
* round-then-adjust idea.
*/
unsigned short isqrt34(unsigned long x)
{
static const unsigned char sqq_table[] = {
0, 16, 22, 27, 32, 35, 39, 42, 45, 48, 50, 53, 55, 57,
59, 61, 64, 65, 67, 69, 71, 73, 75, 76, 78, 80, 81, 83,
84, 86, 87, 89, 90, 91, 93, 94, 96, 97, 98, 99, 101, 102,
103, 104, 106, 107, 108, 109, 110, 112, 113, 114, 115, 116,
117, 118,
119, 120, 121, 122, 123, 124, 125, 126, 128, 128, 129, 130,
131, 132,
133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
144, 145,
146, 147, 148, 149, 150, 150, 151, 152, 153, 154, 155, 155,
156, 157,
158, 159, 160, 160, 161, 162, 163, 163, 164, 165, 166, 167,
167, 168,
169, 170, 170, 171, 172, 173, 173, 174, 175, 176, 176, 177,
178, 178,
179, 180, 181, 181, 182, 183, 183, 184, 185, 185, 186, 187,
187, 188,
189, 189, 190, 191, 192, 192, 193, 193, 194, 195, 195, 196,
197, 197,
198, 199, 199, 200, 201, 201, 202, 203, 203, 204, 204, 205,
206, 206,
207, 208, 208, 209, 209, 210, 211, 211, 212, 212, 213, 214,
214, 215,
215, 216, 217, 217, 218, 218, 219, 219, 220, 221, 221, 222,
222, 223,
224, 224, 225, 225, 226, 226, 227, 227, 228, 229, 229, 230,
230, 231,
231, 232, 232, 233, 234, 234, 235, 235, 236, 236, 237, 237,
238, 238,
239, 240, 240, 241, 241, 242, 242, 243, 243, 244, 244, 245,
245, 246,
246, 247, 247, 248, 248, 249, 249, 250, 250, 251, 251, 252,
252, 253,
253, 254, 254, 255
};
unsigned long xn;
if (x >= 0x10000)
if (x >= 0x1000000)
if (x >= 0x10000000)
if (x >= 0x40000000) {
if (x >= 65535UL * 65535UL)
return 65535;
xn = sqq_table[x >> 24] << 8;
} else
xn = sqq_table[x >> 22] << 7;
else if (x >= 0x4000000)
xn = sqq_table[x >> 20] << 6;
else
xn = sqq_table[x >> 18] << 5;
else {
if (x >= 0x100000)
if (x >= 0x400000)
xn = sqq_table[x >> 16] << 4;
else
xn = sqq_table[x >> 14] << 3;
else if (x >= 0x40000)
xn = sqq_table[x >> 12] << 2;
else
xn = sqq_table[x >> 10] << 1;
goto nr1;
}
else if (x >= 0x100) {
if (x >= 0x1000)
if (x >= 0x4000)
xn = (sqq_table[x >> 8] >> 0) + 1;
else
xn = (sqq_table[x >> 6] >> 1) + 1;
else if (x >= 0x400)
xn = (sqq_table[x >> 4] >> 2) + 1;
else
xn = (sqq_table[x >> 2] >> 3) + 1;
goto adj;
} else
return sqq_table[x] >> 4;
/* Run two iterations of the standard convergence formula */
xn = (xn + 1 + x / xn) / 2;
nr1:
xn = (xn + 1 + x / xn) / 2;
adj:
if (xn * xn > x) /* Correct rounding if necessary */
xn--;
return xn;
}
/* From: http://www.azillionmonkeys.com/qed/sqroot.html */
/* Let us examine an implementation of the second method. By Jim Ulery
*/
unsigned short isqrt36(unsigned long val)
{
unsigned long temp,
g = 0,
b = 0x8000,
bshft = 15;
do {
if (val >= (temp = (((g << 1) + b) << bshft--))) {
g += b;
val -= temp;
}
} while (b >>= 1);
return g;
}
/* From: http://www.azillionmonkeys.com/qed/sqroot.html */
/* Here the division step is replaced with a 1-bit approximation
* of that division. Jim Ulery's complete derivation is given here.
* But there are still 15 branches here that will not be predicted
* very well, as well as loop overhead. So unrolling is the first
* improvement we can make. By Mark Crowne
*/
unsigned short isqrt37(unsigned long val)
{
unsigned int temp,
g = 0;
if (val >= 0x40000000) {
g = 0x8000;
val -= 0x40000000;
}
#define INNER_ISQRT(s) \
temp = (g << (s)) + (1 << ((s) * 2 - 2)); \
if (val >= temp) { \
g += 1 << ((s)-1); \
val -= temp; \
}
INNER_ISQRT(15)
INNER_ISQRT(14)
INNER_ISQRT(13)
INNER_ISQRT(12)
INNER_ISQRT(11)
INNER_ISQRT(10)
INNER_ISQRT(9)
INNER_ISQRT(8)
INNER_ISQRT(7)
INNER_ISQRT(6)
INNER_ISQRT(5)
INNER_ISQRT(4)
INNER_ISQRT(3)
INNER_ISQRT(2)
#undef INNER_ISQRT
temp = g + g + 1;
if (val >= temp)
g++;
return g;
}
#ifdef TEST
#include <stdio.h>
#include <math.h>
int main(void)
{
unsigned long l;
unsigned long m=0;
unsigned long limit = 4000000000;
double answer;
for (l = 0; l < limit; l+= 1000) {
#ifdef TEST_REAL_SLOW_ONES_TOO
m += isqrt00(l);
#endif
m += isqrt01(l);
m += isqrt02(l);
m += isqrt03(l);
#ifdef TEST_REAL_SLOW_ONES_TOO
m += isqrt04(l);
if (l <= 100000000)
m += isqrt05(l); /* fails for large values */
#endif
m += isqrt06(l);
m += isqrt07(l);
m += isqrt08(l);
#ifdef TEST_REAL_SLOW_ONES_TOO
m += isqrt09(l);
#endif
m += isqrt10(l);
m += isqrt11(l);
m += isqrt12(l);
m += isqrt18(l);
m += isqrt14(l);
#if 0
if (l <= 200000000)
m += isqrt15(l);
#endif
#ifdef TEST_REAL_SLOW_ONES_TOO
m += isqrt16(l);
#endif
m += isqrt17(l);
m += isqrt19(l);
m += isqrt20(l);
#ifdef TEST_REAL_SLOW_ONES_TOO
m += isqrt21(l);
#endif
m += isqrt22(l);
m += isqrt23(l);
m += isqrt24(l);
m += isqrt25(l);
m += isqrt26(l);
m += isqrt27(l);
m += isqrt28(l);
m += isqrt29(l);
#ifdef TEST_REAL_SLOW_ONES_TOO
m += isqrt30(l);
m += isqrt31(l);
#endif
m += isqrt32(l);
m += isqrt34(l);
m += isqrt36(l);
}
printf("trying to trick the optimizer=%lu ", m);
#if 0
for (l = 0; l < 100000000; l += 1000000) {
printf("isqrt00(l)=%ld ", isqrt00(l));
printf("isqrt01(l)=%ld ", isqrt01(l));
printf("isqrt02(l)=%ld ", isqrt02(l));
printf("isqrt03(l)=%ld ", isqrt03(l));
printf("isqrt04(l)=%ld ", isqrt04(l));
if (l <= 100000000)
printf("isqrt05(l)=%ld ", isqrt05(l));
printf("isqrt06(l)=%ld ", isqrt06(l));
printf("isqrt07(l)=%ld ", isqrt07(l));
printf("isqrt08(l)=%ld ", isqrt08(l));
printf("isqrt09(l)=%ld\n", isqrt09(l));
printf("isqrt10(l)=%ld ", isqrt10(l));
printf("isqrt11(l)=%ld ", isqrt11(l));
printf("isqrt12(l)=%ld ", isqrt12(l));
printf("isqrt18(l)=%ld ", isqrt18(l));
printf("isqrt14(l)=%ld ", isqrt14(l));
if (l <= 200000000)
printf("isqrt15(l)=%ld ", isqrt15(l));
printf("isqrt16(l)=%ld ", isqrt16(l));
printf("isqrt17(l)=%ld ", isqrt17(l));
printf("isqrt19(l)=%ld\n", isqrt19(l));
printf("isqrt20(l)=%ld ", isqrt20(l));
printf("isqrt21(l)=%ld ", isqrt21(l));
printf("isqrt22(l)=%ld ", isqrt22(l));
printf("isqrt23(l)=%ld ", isqrt23(l));
printf("isqrt24(l)=%ld ", isqrt24(l));
printf("isqrt25(l)=%ld ", isqrt25(l));
printf("isqrt26(l)=%ld ", isqrt26(l));
printf("isqrt27(l)=%ld ", isqrt27(l));
printf("isqrt28(l)=%ld ", isqrt28(l));
printf("isqrt29(l)=%ld\n", isqrt29(l));
printf("isqrt30(l)=%ld\n", isqrt30(l));
printf("isqrt31(l)=%ld\n", isqrt31(l));
printf("isqrt32(l)=%ld\n", isqrt32(l));
printf("isqrt34(l)=%ld\n", isqrt34(l));
printf("isqrt36(l)=%ld\n", isqrt36(l));
printf("Answer is %.2f\n", sqrt((double) l));
}
#endif
for (l = 0; l < 100000000; l += 1000000) {
answer = sqrt((double) l);
#ifdef TEST_REAL_SLOW_ONES_TOO
if ( 1 < fabs(answer - isqrt00(l)))puts("00");
#endif
if ( 1 < fabs(answer - isqrt01(l)))puts("01");
if ( 1 < fabs(answer - isqrt02(l)))puts("02");
if ( 1 < fabs(answer - isqrt03(l)))puts("03");
#ifdef TEST_REAL_SLOW_ONES_TOO
if ( 1 < fabs(answer - isqrt04(l)))puts("04");
#endif
if ( 1 < fabs(answer - isqrt06(l)))puts("06");
if ( 1 < fabs(answer - isqrt07(l)))puts("07");
if ( 1 < fabs(answer - isqrt08(l)))puts("08");
#ifdef TEST_REAL_SLOW_ONES_TOO
if ( 1 < fabs(answer - isqrt09(l)))puts("09");
#endif
if ( 1 < fabs(answer - isqrt10(l)))puts("10");
if ( 1 < fabs(answer - isqrt11(l)))puts("11");
if ( 1 < fabs(answer - isqrt12(l)))puts("12");
if ( 1 < fabs(answer - isqrt18(l)))puts("18");
if ( 1 < fabs(answer - isqrt14(l)))puts("14");
#ifdef TEST_REAL_SLOW_ONES_TOO
if ( 1 < fabs(answer - isqrt15(l)))puts("15");
if ( 1 < fabs(answer - isqrt16(l)))puts("16");
#endif
if ( 1 < fabs(answer - isqrt17(l)))puts("17");
if ( 1 < fabs(answer - isqrt19(l)))puts("19");
if ( 1 < fabs(answer - isqrt20(l)))puts("20");
#ifdef TEST_REAL_SLOW_ONES_TOO
if ( 1 < fabs(answer - isqrt21(l)))puts("21");
#endif
if ( 1 < fabs(answer - isqrt22(l)))puts("22");
if ( 1 < fabs(answer - isqrt23(l)))puts("23");
if ( 1 < fabs(answer - isqrt24(l)))puts("24");
if ( 1 < fabs(answer - isqrt25(l)))puts("25");
if ( 1 < fabs(answer - isqrt26(l)))puts("26");
if ( 1 < fabs(answer - isqrt27(l)))puts("27");
if ( 1 < fabs(answer - isqrt28(l)))puts("28");
if ( 1 < fabs(answer - isqrt29(l)))puts("29");
#ifdef TEST_REAL_SLOW_ONES_TOO
if ( 1 < fabs(answer - isqrt30(l)))puts("30");
if ( 1 < fabs(answer - isqrt31(l)))puts("31");
#endif
if ( 1 < fabs(answer - isqrt32(l)))puts("32");
if ( 1 < fabs(answer - isqrt34(l)))puts("34");
if ( 1 < fabs(answer - isqrt36(l)))puts("36");
printf("Answer is %.2f\n", sqrt((double) l));
}
return 0;
}
#endif
/*
Function Timer % (323) Timer events (323)
isqrt08 0.19 12,000 {Only calculates an approximation}
qlog2 0.27 17,000 {called by isqrt29()}
isqrt26 0.97 60,000
isqrt27 1.03 64,000
isqrt28 1.36 84,000
isqrt03 2.58 160,000
isqrt23 2.60 161,000
isqrt24 2.79 173,000
isqrt34 2.89 179,000
isqrt36 2.89 179,000
isqrt01 2.99 185,000
isqrt10 3.02 187,000
main 3.03 188,000
isqrt25 3.28 203,000
isqrt02 3.36 208,000
isqrt19 3.55 220,000
isqrt18 3.55 220,000
isqrt14 4.13 256,000
isqrt32 4.23 262,000
isqrt06 4.46 276,000
isqrt11 5.23 324,000
isqrt29 5.70 353,000
isqrt07 5.81 360,000
isqrt17 5.86 363,000
isqrt22 6.76 419,000
isqrt12 8.54 529,000
isqrt20 8.93 553,000
*/
Depends on needs. Probably your best bet is Newton approximation:
Rn+1 = 0.5 * (Rn + N/Rn)
and repeat until the error is negligible. Start with R == 1.0.
--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.
** Posted from http://www.teranews.com **
Take a look here:
http://www.azillionmonkeys.com/qed/sqroot.html
A zillion monkeys can't all be wrong.
Mark Borgerson
Mark Borgerson
Newton-Raphson for isqrt can be done with only integer arithemtic.....
Please show me... I have to take roots of 64 bit integers on a 32 bit
CPU and I currently do the two bit per iteration method. I can do an
educated guess via tables, and the significant bit-doubling of the
Newton-Raphosn method would be great.
However, my attempts at doing reciprocal square-roots in fixed-point
always failed and never reached the exact result.
But maybe i'm just to dumb to do it..
> "David T. Ashley" wrote:
> >
> > What is the best integer square root algorithm for a processor with
> > MUL and DIV (integer multiplication and division) built-in?
>
> Depends on needs. Probably your best bet is Newton approximation:
>
> Rn+1 = 0.5 * (Rn + N/Rn)
>
> and repeat until the error is negligible. Start with R == 1.0.
If it is using integer multiplication and division shouldn't it be
Rn+1 = (Rn + N/Rn) / 2
or even
Rn+1 = (Rn*Rn + N) / (2 * Rn)
But if you need even only one division per calculated root
this will in most if not all cases make the algorithm slower
than mine (which uses just as many mutiplications as there
are bits in the result, I posted an example earlier to this
thread - http://groups.google.com/group/comp.arch.embedded/msg/0d604a471d31350b?dmode=source
)
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Original message: http://groups.google.com/group/comp.arch.embedded/msg/56d0896b27ee17b8?dmode=source
> Virgil wrote:
> > ...
> >
> > If it is using integer multiplication and division shouldn't it be
> >
> > Rn+1 = (Rn + N/Rn) / 2
> >
> > or even
> >
> > Rn+1 = (Rn*Rn + N) / (2 * Rn)
>
> But if you need even only one division per calculated root
> this will in most if not all cases make the algorithm slower
> than mine (which uses just as many mutiplications as there
> are bits in the result
The formula that I objected to involved a multiplication by a
non-integer in a supposedly integer only operation.
Admittedly a multiplication by 0.5 is easy enough as a binary right
shift, but that is not what indicated.
> What is the best integer square root algorithm for a processor with MUL and
> DIV (integer multiplication and division) built-in?
>
> For one without MUL and DIV (or where the instructions operate on shorter
> operands than would be required to form the square in question)?
>
> Thanks for all ideas and thoughts.
>
> I will check TOACP, of course.
>
> The application is a low-cost microcontroller with a 3-axis accelerometer.
> To get the magnitude of the acceleration, we'd want to use
> sqrt(x^2+y^2+z^2). The squaring operations are cheap (MUL). The addition
> is cheap (ADD, ADC). The square root extraction is what I'm not sure how
> best to do.
>
> The algorithm that comes to mind is just to do a binary search by squaring
> potential solutions and comparing them to the quantity whose square root is
> to be extracted.
>
> For example, if the sum of the squares is 16 bits or less, and since the
> processor has a 8 x 8 = 16 MUL instruction, it should be possible to iterate
> to the solution in no more than 16 square/compare cycles.
>
> But maybe there is a better way?
>
> And it isn't clear what to do if the sum of the squares is a bit larger.
>
> Thanks for all.
Here's an algorithm for finding r = sqrt(x^2 + y^2).
Assume x > y > 0. Let
x_1 = (4 x^2 + 3)x / (4 x^2 + y^2)
y_1 = y^3/(4 x^2 + y^2).
Then x_1^2 + y_1^2 = x^2 + y^2. If the transformation x,y) \mapsto
(x_1, y_1) is iterated, then the point converges *cubically* to (r, 0).
I read this in a magazine many years ago.
Adapting it to your problem, I would first find r = sqrt(x^2 + y^2), then
s = sqrt(r^2 + z^2).
I would use scaled integer arithmetic, that is, represent x etc. by
j_x/D etc. where j_x is an integer and I choose D to be some convenient
common denominator like 65536.
--
Christopher J. Henrich
chen...@monmouth.com
htp://www.mathinteract.com
> Paul E. Bennett wrote:
>> ...
>> See:-
>>
<http://groups.google.co.uk/group/comp.lang.forth/browse_thread/thread/67130b2beb453d93/51fa6f17a670fe31?hl=en&lnk=st&q=Forth+Square+Root#51fa6f17a670fe31>
>>
>
> Well my algorithm takes less lines and is a lot easier to read than
> those
> in HLL which were suggested, but I hope it counts in spite of
> that :-).
> Here is it excerpted from a post I have made many years ago,
> calculates a 16 bit squre root of a 32 bit number.
> The example is in CPU32 (or CF) assembly.
If you had focused on the first part;
<quote>
A classical Forth puzzle is: What does the following do?
: ???? -1 TUCK DO 2 + DUP +LOOP 2/ ;
It is a standard method of calculating square root by summing
the odd integers. It works for all unsigned integers from 0 to
FFFFFFFF (or FFFF).
</quote>
You would have realised that the definition ???? is actually a sqrt by
the method stated. Even in the more spread out style used by many of us
these days this would only have been 5 lines at most.
Are you claiming that your unreadable 5 line algorithm which uses
summing somehow (I am not able to read that line and from what I
see I can say it is not worth learning how to read it) is in any way
superior to my 12 line example which finishes in a guaranteed
16 times through the loop for any 32 bit integer - and can be
read by any programmer? I might agree it is if the purpose is
to waste CPU and programmer time.
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Original message: http://groups.google.com/group/comp.arch.embedded/msg/e1532ed305256729?dmode=source
Which points out that the mathematics is more important than the
coding. Three or four lines of C can easily produce the same
effect.
--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.
It probably would be a little better to find r = sqrt(x^2 + z^2),
then s = sqrt(r^2 + y^2), if x > y > z. On a processor with fast
multiply and divide relative to comparison and swap operations it
would make no difference, but on small cpu's the extra ops to put
x,y,z in order would cost much less on average than the potentially
avoided iteration steps.
>Here is it excerpted from a post I have made many years ago,
>calculates a 16 bit squre root of a 32 bit number.
> The example is in CPU32 (or CF) assembly.
>
>*
>* calc. the square root of d1.l; return it in d1.l
>* destroys d2,d3,d4
>*
>sqrtd1
> moveq #15,d2 counter
> clr.l d3 colect result here
> set loop,*
> bset d2,d3 try this
> move.w d3,d4 extra copy result so far
> mulu.w d4,d4 square
> cmp.l d1,d4 higher?
> bls keepbit branch not
> bclr d2,d3 else bit back down
>keepbit
> dbra d2,loop process all bits
> move.l d3,d1 update in d1
> rts
I guess this will benefit from the ff1 (Coldfire ISA A+ AFAIK) or
cntlzw (PowerPC) opcodes, which return the position of the first 1
bit, and could be used to setup the loop counter (ff1 / 2).
--
42Bastian
Do not email to bast...@yahoo.com, it's a spam-only account :-)
Use <same-name>@monlynx.de instead !
It will indeed - it is really good to have the ff1 opcode back (as in
the 020,
there was a bfff1 there).
I am a frequent cntlzw user on the PPC, but have not used it on this
algorithm yet (never thought of it, I did it first for the CPU32 - or
was it the
HC11 - where there was no such opcode and have not changed it since).
I may do so next time I dig in that vicinity.
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Original message: http://groups.google.com/group/comp.arch.embedded/msg/7805bc4cf65eecab?dmode=source
I am glad someone else said that. as often I look at algorithms and then
determine
what results are needed and if some of the final satges are actually needed!
Too often the mathematicians and straight coders are looking for the perfect
mousetrap where a sufficient one will do. By this I mean
Are our input or output ranges limited anyway
(for _most_ light reading applications negative light is not
possible!)
Do we need to process using the result in a way that a partial result
will
give the same relationship, i.e. could we just use the sum of
squares
or is the data being passed to a PC or higher computational device
better suited to do the intensive calculations.
What is out native (8/16/32/64bit) and software (emulated floating
point) ranges
and limitations on what we are doing.
How often are we doing this (once a day, down to millions a second)
How fast do we have to do this (closely associated with previous and
total system
and application
requirements)
How accurate is the response needed and implications of any inaccuracies
(e.g. image processing will calculating image position results to 10
decimal places
help us draw a circle on a fixed pixel array - integer).
--
Paul Carpenter | pa...@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/> PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate
good question.
if you can tolerate +3 lsb or +7 % (that's better) accuracy, you can
convert polynomial to exponent (approx),
to halve exponent,
convert exponent to polynomial (approx).
none MUL_DIV: INC_DEC_SHIFT_ROT only.
best accuracy is at input values = 2^even.
worst accuracy is at input values = 2^odd.
regards
You can use the common binary form of the square root process
and be accurate to 1 LSB for any value. The process involves
only shifts and conditional subtraction, a variation of
division. With some pre- and post-processing, it works for
floating point values as well.