New fast sqrt() approximation

Aleks Jakulin

unread,

Aug 17, 1995, 3:00:00 AM8/17/95

to

There are several extremely fast square root routines around. But I
anyway wondered if it is possible to make a very fast approximation
which would work on 16:16 fixed point numbers, would not interpolate,
would not DIV, would not eat up 1 MB of memory for LUTs, etc.

So I made a funny new system which can be very fast and spend very
little memory. The natural choice is polynomial approximation, where
we use the input variable as the parameter. It is just as precise as
interpolation, but takes a MUL instead of DIV.

So, if you have x, you have to retrieve a and b somehow and then
calculate: y = a * x + b;

Great. But how do we determine a and b? OK, I will not go into details
here. a and b are Taylor polynom coefficients, and are calculated by
means of calculating the first few terms of a Taylor polynom.

Great. But how do we determine a and b *fast* maintaining precision and
not wasting too much memory? Of course a look-up table is the best
choice, but it would be too large...

The answer lies in a filtering function:

If you look at the graph of sqrt() you will see that the a and b should
change much more often for small numbers than for large ones. If you
keep looking for some time, you'll see that you could have a LUT for a &
b pair, but use the integer part of log2(x) or a similar function as the
index.

Anyway, the approximations I am getting for numbers greater than 1 (for
smaller numbers the errors are greater, not because of my method, but
because of relative error with fixed point numbers) are within 1%.

The nice thing about integer part of log2(x) is that it can be
calculated very fast. I used the slow 386+ microcodeprocedure bsr; it
can be done faster with a logarithmic look-up table or with a couple of
discriminative tests (0x0000ffff -> 0x00ff00ff -> ....)

If you mind the MUL, you can create a LUT of pointers onto small
functions each of which approximates or calculates the MUL with shifts
and adds. If you mind precision, you can create another coefficient, and
do two MULs (or approximations) extra, or you can decimate the existing
filtering function.

Also, please ignore the stack operations and the ax-dx shifts in the
end. That's needed for compatibility with BC 3.5 which I used at the
time I fiddled with that. The real routine is the incredible 6 lines of
simple 80386 code between the star comments.

Feedback is always welcome.

Regards,

Alex Jakulin

--------------------------------------------------------------------------
MODEL LARGE
IDEAL
P386

SEGMENT CODE
ASSUME cs:CODE, ds:NOTHING

LABEL Tab
DD 157, 6849270
DD 222, 4843165
DD 314, 3424635
DD 443, 2421583
DD 627, 1712317
DD 887, 1210791
DD 1254, 856159
DD 1774, 605396
DD 2508, 428079
DD 3547, 302698
DD 5017, 214040
DD 7094, 151349
DD 10033, 107020
DD 14189, 75674
DD 20066, 53510
DD 28378, 37837
DD 40132, 26755
DD 56756, 18919
DD 80265, 13377
DD 113512, 9459
DD 160530, 6689
DD 227023, 4730
DD 321060, 3344
DD 454047, 2365
DD 642119, 1672
DD 908093, 1182
DD 1284238, 836
DD 1816187, 591
DD 2568476, 418
DD 3632374, 296
DD 5136952, 209
DD 7264748, 148
DD 10273905, 105

PROC KORE FAR
PUBLIC KORE
push bp
mov bp,sp
mov eax,[bp+6]
; *****
bsr ebx,eax ; bx = # of leftmost 0
; bits in eax
jz ret0 ;) (guess)

lea ebx,[ebx*8] ; bx *= 8 (offset calc)
; lea is faster than shift
; on the new Intels
mul dword cs:[bx+OFFSET Tab+4] ; edx:eax=eax*Tab[bx].a
shrd eax,edx,10h ; fixup for 16:16 fix point
add eax, cs:[bx + OFFSET Tab] ; eax += Tab[bx].b
; *****

shld edx,eax,10h
jmp quit

ret0:
mov dx, ax
quit:
pop bp
ret 4
ENDP KORE

ENDS CODE

END

-----------------------------------------------------------------------------

Terje Mathisen

unread,

Aug 18, 1995, 3:00:00 AM8/18/95

to

al...@fc.hp.com (Aleks Jakulin) wrote:
>There are several extremely fast square root routines around. But I
>anyway wondered if it is possible to make a very fast approximation
>which would work on 16:16 fixed point numbers, would not interpolate,
>would not DIV, would not eat up 1 MB of memory for LUTs, etc.
>
>So I made a funny new system which can be very fast and spend very
>little memory. The natural choice is polynomial approximation, where
>we use the input variable as the parameter. It is just as precise as
>interpolation, but takes a MUL instead of DIV.

[snip]

>time I fiddled with that. The real routine is the incredible 6 lines of
>simple 80386 code between the star comments.

[another big snip]

>; *****
> bsr ebx,eax ; bx = # of leftmost 0
> ; bits in eax
> jz ret0 ;) (guess)
>
> lea ebx,[ebx*8] ; bx *= 8 (offset calc)
> ; lea is faster than shift
> ; on the new Intels
> mul dword cs:[bx+OFFSET Tab+4] ; edx:eax=eax*Tab[bx].a
> shrd eax,edx,10h ; fixup for 16:16 fix point
> add eax, cs:[bx + OFFSET Tab] ; eax += Tab[bx].b
>; *****
>
> shld edx,eax,10h
> jmp quit
>
>ret0:
> mov dx, ax
>quit:

That's really impressive code (assuming that you do get the 1% accuracy
you claim)!

I have just a few notes:

a) BSR is indeed extremely slow, taking 10+3n cycles on a 386, and
as far as I remember, something similar on 486 and Pentium.

b) The code really cries out for 32-bit code, since most of it have
to use 32-bit prefixes when used in a 16-bit segment.

c) LEA EBX, [EBX*8] isn't really any faster than SHL BX,3, because
EBX is modified in the previous cycle, so the address hw used
for LEA will stall.

d) The LEA step can be skipped, since the array indexing can be
included in the table lookups instead.

e) Assuming that the normal path goes through, you'll end up with
a JMP QUIT to skip the MOV DX,AX instruction. A faster way
to skip a short instruction, is to use a dummy TEST operation
to hide the short opcode:

shld edx,eax,10h
test ax,1234h
org $-2 ; Step two bytes back into the imm data

ret0:
mov dx,ax ; Two-byte opcode, contained within the TEST!
;quit: ; Back here again!

f) The lookup tables should really be in the data segment instead to
avoid the CS: segment prefix override.

I must say though, that I'm really impressed by the effort you've
obviously invested in your fix_sqrt() function!

--
-Terje Mathisen (include std disclaimer) <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Peter Gustav Lindstrom

unread,

Aug 18, 1995, 3:00:00 AM8/18/95

to

al...@fc.hp.com (Aleks Jakulin) wrote:
>There are several extremely fast square root routines around. But I
>anyway wondered if it is possible to make a very fast approximation
>which would work on 16:16 fixed point numbers, would not interpolate,
>would not DIV, would not eat up 1 MB of memory for LUTs, etc.

For floating point, your LUT's don't have to be that big. I've got a
piece of code that uses a 16KB LUT (for both floats and doubles) and I
get a maximum error of 0.015% for _all_ positive floating point numbers.
On a MIPS R4400, the actual square root takes about 15 instructions,
while the stack manipulations and test for negative numbers etc. increase
the function size to about 40 instructions. Compared to libm, this
code runs about 4 times faster, while it's about 2-3 times slower than
the floating point coprocessor instruction (which is not supported by
the R2000/3000 processors). For fixed point, your algorithm looks great
if the errors are what you claim.

--
_____________________________________________________________________________
PETER LINDSTROM Graphics Visualization and Usability Center
Ph.D. Student College of Computing
lind...@cc.gatech.edu Georgia Institute of Technology
(404) 853-9393 Atlanta, GA 30332

Aleks Jakulin

unread,

Aug 19, 1995, 3:00:00 AM8/19/95

to

I checked my article again and found out that the look-up table in the
article was an old version and badly done (the result is always
overestimated) and the peak relative error was 2.1%.

I now improved the look-up table and the results are now within 0.85%
for any value greater than 0.0045! For values less than that, the
errors will still be within 1% in most cases.

The greater error for lower values is due to inaccuracy of operations
with fixed point numbers and not my algorithm, which is unchanged.

Also, I considered some of Terje Mathisen's advices, cut the routine
into 5 instructions and increasing the speed by splitting the table into
two parts. Also, you might want to replace the bsr instruction with
something faster, but that's up to you.

-------------------------------------------------------------------
; this is not yet compilable assembler source, so adapt it for your
; own assembler

LABEL TabA
DD 6949350
DD 4913933
DD 3474675
DD 2456966
DD 1737338
DD 1228483
DD 868669
DD 614242
DD 434334
DD 307121
DD 217167
DD 153560
DD 108584
DD 76780
DD 54292
DD 38390
DD 27146
DD 19195
DD 13573
DD 9598
DD 6786
DD 4799
DD 3393
DD 2399
DD 1697
DD 1200
DD 848
DD 600
DD 424
DD 300
DD 212
DD 150
DD 106

LABEL TabB
DD 152
DD 215
DD 304
DD 430
DD 608
DD 860
DD 1217
DD 1721
DD 2433
DD 3441
DD 4867
DD 6883
DD 9734
DD 13765
DD 19467
DD 27531
DD 38934
DD 55061
DD 77868
DD 110122
DD 155736
DD 220244
DD 311472
DD 440488
DD 622945
DD 880977
DD 1245890
DD 1761954
DD 2491779
DD 3523908
DD 4983558
DD 7047816
DD 9967116

; The shorter and faster routine which accepts the 16:16 fixed point
; input parameter in eax, and returns sqrt(eax) in eax:

Sqrt:
bsr ebx,eax
jz zero

mul dword [4*ebx + OFFSET TabA]
shrd eax,edx,10h
add eax, [4*ebx + OFFSET TabB]
zero:
ret
-------------------------------------------------------------------

I also calculated an alternative look-up table where the errors are
bounded by 1.9% (with the same conditions as for 0.85% above), but the
average error is lower. However, this is not the optimal table.

-------------------------------------------------------------------
LABEL TabA
DD 6949350
DD 4913933
DD 3474675
DD 2456966
DD 1737338
DD 1228483
DD 868669
DD 614242
DD 434334
DD 307121
DD 217167
DD 153560
DD 108584
DD 76780
DD 54292
DD 38390
DD 27146
DD 19195
DD 13573
DD 9598
DD 6786
DD 4799
DD 3393
DD 2399
DD 1697
DD 1200
DD 848
DD 600
DD 424
DD 300
DD 212
DD 150
DD 106

LABEL TabB
DD 154
DD 218
DD 309
DD 437
DD 618
DD 874
DD 1236
DD 1748
DD 2472
DD 3496
DD 4944
DD 6992
DD 9888
DD 13983
DD 19775
DD 27967
DD 39551
DD 55933
DD 79101
DD 111866
DD 158203
DD 223732
DD 316405
DD 447465
DD 632811
DD 894929
DD 1265621
DD 1789859
DD 2531243
DD 3579718
DD 5062486
DD 7159436
DD 10124971
-------------------------------------------------------------------

I again apologize for publishing preliminary and unoptimized results,
although I am glad that nobody complained about this before I did. ;-)

Now go and speed up your raytracers, morphing routines, vector
normalizers, distance minimzers and similar slow horrors.

Regards,
Alex

P.s.: For those who want the routine in C: It is usually not possible
to do it on 32-bit CPUs, as the result of multiplication of two 32 bit
values has to be given in 64 bits to maintain precision, and ANSI C does
not support this. If you were looking for a good reason to learn to
program in assembler, this is one.

Bill C. Riemers

unread,

Aug 22, 1995, 3:00:00 AM8/22/95

to

In article <DDGy9...@hpl.hp.com>, Aleks Jakulin <al...@fc.hp.com> wrote:
>So, if you have x, you have to retrieve a and b somehow and then
>calculate: y = a * x + b;

>Great. But how do we determine a and b? OK, I will not go into details
>here. a and b are Taylor polynom coefficients, and are calculated by
>means of calculating the first few terms of a Taylor polynom.

Well, actually no. Because of course you figure out a taylor series
near some value. i.e.

F(x)=F(x0)+F'(x0)*(x-x0)+F''(x0)*(x-x0)^2/2+F'''(x0)*(x-x0)^3/6+...

the aproximation is:

F(x) =~ F(x0)+F'(x0)*(x-x0)

So, you get ideal values near whatever x0 you choise to use, but lousy
as you get further away. In practice you really, want something works
well over some range of values. For example between x0 and x1. In
this case you want the parameters that minimize the maximum error
of your polynomial over this range. So:

G(x)=(F(x)-a*x-b)^2

Find the parameters a & b that minimize the maximum value of G(x). For
simple functions, this can be done analyticaly, but for more complicated
functions, this needs to be done with numerical analysis. Great, that is
why we invented computers to begin with. Something like, MINUITE will find
the answer in no time for any arbitrary function.

Alex, I'm willing to bet you already understood these considerations, but you
where just trying to keep things simple.

Now, the next level of difficulty. Often it is better to break the function
up into sub-regions over our full interval, and use a different approximation
for each one. In this case, again MINUITE can come to our rescue simply by
adding the boundries as a free parameter. Of course for something like
a sqrt() function, one can use analytical tricks to extrapulate further. For
example, if I know the sqrt of 257, then I know the sqrt of 257/4 is the
sqrt of 257 right shifted one bit. (Note: I see you said this same thing
slightly reworded.)

Bill
--
<A HREF= "ftp://physics.purdue.edu/pub/bcr/www/homepage.html" >
<IMG SRC= "ftp://physics.purdue.edu/pub/bcr/www/bcr.gif" >
<ADDRESS><H1> Dr. Bill C. Riemers, b...@physics.purdue.edu </H1></ADDRESS>
<H1> Department of Physics, Purdue University </H1></A>

C. van Rij

unread,

Aug 25, 1995, 3:00:00 AM8/25/95

to

Ok, let me start by pointing out that I haven't had time
to follow this interesting thread, but I wanted to share
this code with you anyway. I wrote it 4 years ago when I
needed a fst root alg. .. it's based on the CORDIC trigono-
metric system. (Well, I read an article about it and
tried to implement it.. I'm not sure I did it right, but
it works, and it's fast.)

.MODEL small
.STACK 100h
.DATA
result dw 0
value dw 65025
.CODE
start: MOV AX,@DATA
MOV DS, AX
MOV AX, value
MOV BX, 0
MOV CX, 8 ; (sizeof value in bits)/2
MOV DX, 0
digit: SHL AX, 1
RCL BX, 1
SHL AX, 1
RCL BX, 1
SHL DX, 1
STC
RCL DX, 1
CMP BX, DX
JL nofit
SUB BX,DX
SHR DX, 1
OR DX,1
JMP cont
nofit: SHR DX, 1
cont: OR AX,BX
LOOPNZ digit
MOV result, DX
MOV AH,4Ch
INT 21h
END

this tries to explain how it works:

; 01010001 = 0
; : : : : :
; 01 : : : 1 * 001 => 1
; 01-: : : 0 * 000 :
; -- : : : :
; 001 : : 1 * 101
; 000-: :< 0 * 100 => 10
; --- : : ::
; 100 : 1 * 1001
; 000-:< 0 * 1000 => 100
; --- : :::
; 10001 1 * 10001 => 1001
; 10001- 0 * 10000
; -----
; 0

I'm still interested in references to information about CORDIC, btw.

Cordially,
Casey R.

Chris Cox

unread,

Aug 25, 1995, 3:00:00 AM8/25/95

to

Well, it might be fast, but it doesn't run on my sparc, or my SGI crimson,
or my decstation, or my powermac, or my .......

PLEASE - this newsgroup is about ALGORITHMS, not platform specific
implimentations.

Chris

In article <DDvD8J....@cs.vu.nl>, cv...@cs.vu.nl (C. van Rij) wrote:

> Ok, let me start by pointing out that I haven't had time
> to follow this interesting thread, but I wanted to share
> this code with you anyway. I wrote it 4 years ago when I
> needed a fst root alg. .. it's based on the CORDIC trigono-
> metric system. (Well, I read an article about it and
> tried to implement it.. I'm not sure I did it right, but
> it works, and it's fast.)
>
> .MODEL small
> .STACK 100h
> .DATA
> result dw 0

8080A source deleted
> INT 21h
> END
>
--
Companies have policies, not opinions.
Thus my opinions cannot be those of my company.

http://www.asi.com/pwrshop.html

Terje Mathisen

unread,

Aug 28, 1995, 3:00:00 AM8/28/95

to

cv...@cs.vu.nl (C. van Rij) wrote:

>digit: SHL AX, 1
> RCL BX, 1
> SHL AX, 1
> RCL BX, 1
> SHL DX, 1
> STC
> RCL DX, 1
> CMP BX, DX
> JL nofit
> SUB BX,DX
> SHR DX, 1
> OR DX,1
> JMP cont
>nofit: SHR DX, 1
>cont: OR AX,BX
> LOOPNZ digit

That's an interesting algorithm!

You can speed it up a bit, esp on 486 and Pentium, by getting rid
of some of the shift operations, and reschedule some others:

digit:
add dx,dx
add ax,ax
adc bx,bx
add dx,dx
inc dx
add ax,ax
adc bx,bx
sub bx,dx
jl nofit ; Should this be JB instead?
shr dx,1
or dx,1
or ax,bx
jz done
dec cx
jnz digit
jmp done

nofit:
add bx,dx
shr dx,1
or ax,bx
jz done
dec cx
jnz digit
done:

There's still a few places in this code where the conditional
jumps end up in the U pipe, so there's room for better scheduling,
but this version should still be at least 4-5 cycles faster
per iteration.

Romesh Prakashpalan

unread,

Sep 2, 1995, 3:00:00 AM9/2/95

to

Terje Mathisen <Terje.M...@hda.hydro.com> wrote:

Hello,
Could someone repost (or even better, email me) the original
article? I seem to have gotten onto the group rather late in the life
cycle of this thread.

Thanks,
Romesh Prakashpalan
Game Developer, Celeris Inc.

/-------------------------------------------------------------------------\
| Romesh Prakashpalan (pra...@earthlink.net) |
| Check out my homepage at: http://www.csun.edu/~hbcsc294 |
| WARNING! My homepage shall move to: http://www.earthlink.net/~prakash |
| shortly! |
+-------------------------------------------------------------------------+
| All you Turbo Pascal programmers, grab my SBDSP library from: |
| x2ftp.oulu.fi in the /pub/msdos/programming/mxlibs directory, most |
| recent filename: sbdsp2b.zip. A library for TP 7, that includes *full* |
| source code, and plays back digitized sound. |
\-------------------------------------------------------------------------/

Romesh Prakashpalan

unread,

Sep 2, 1995, 3:00:00 AM9/2/95

to

Those with brains would realize the worth of the algorithm/function NO
MATTER what language/platform it was written for! Rewrite it in
PowerPC assembly, it ain't hard! (Oops, forgot! with instructions like
eieio :) the PowerPC can become a bit hilarious to code for! One might
spend more time laughing at the choice of instruction mnemonics rather
than coding!). And whose fault is it that you don't have an Intel
machine in the first place? (Just kidding, no more platform flames
please). I'd trade in my SparcStation if I were you though, Sun's
future ain't too good. Buy an SGI or real PowerPCs, not the cheap
PowerMac variety (you are sinning as much for using that OS on that
POWERful CPU as a PC user who uses DOS/WinDOS with their P6).

Romesh Prakashpalan
Game Developer, Celeris Inc.

cc...@asi.com (Chris Cox) wrote:

>Well, it might be fast, but it doesn't run on my sparc, or my SGI crimson,
>or my decstation, or my powermac, or my .......

>PLEASE - this newsgroup is about ALGORITHMS, not platform specific
>implimentations.

>Chris

Chris Cox

unread,

Sep 3, 1995, 3:00:00 AM9/3/95

to

Those with brains would know the difference between an algorithm and an
implementation of an algorithm - as well as the fact that different
implementations will get different speeds on different architectures.

Chris

In article <428n74$b...@mars.earthlink.net>, pra...@earthlink.net wrote:

> Those with brains would realize the worth of the algorithm/function NO
> MATTER what language/platform it was written for! Rewrite it in

> ....ignorant blathering removed....

>
> Romesh Prakashpalan
> Game Developer, Celeris Inc.

--
Companies have policies, not opinions.
Thus my opinions cannot be those of my company.

http://www.asi.com/powershop/pwrshop.html