Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Optimized reciprocal square root

118 views

Skip to first unread message

Vesa Karvonen

unread,

Dec 1, 1997, 3:00:00 AM12/1/97

Hi!

I've been lately obsessed with writing an optimized
reciprocal square root routine. I've managed to get
it down to about 50 clock cycles (including estimated
call overhead) on the Pentium. This is not bad
compared to the about 100 cycles taken by FDIV+FSQRT.

Since I'm using double precision, it is difficult to
avoid partial memory stalls. Indeed the routine has
three potential partial memory stalls, though one of
them is likely to be masked. However the routine is
otherwise carefully "blended" for Pentium Pro/PII.

First I calculate a quick and dirty estimate to the
reciprocal square root, by using a 32-bit variant of
the technique that I learned from James Van Buskirk
when he posted his parallel square root routine for
the Alpha AXP. Then I use Goldschmidt's iteration
formula for improving the reciprocal square root
estimate.

The routine appears to have a maximum error of about
2ulps. And seems to be able to handle nearly the full
range of possible arguments. I'm using this routine
as a direct replacement for Sqrt(a/x) and a/Sqrt(x).
I'm probably not going to use it for replacing a*Sqrt(x).
I'm using compositor (See TCPL 3rd. ed. by Bjarne
Stroustrup) techniques to make it convenient to use.

Obviously it would be possible to improve the speed
by calculating more than one reciprocal square root
in parallel, but before I do that, I'd like to find
a good way to calculate a single reciprocal square
root.

I think that it would be possible to use a look up
table alternative of James Van Buskirk's method.
Specificly the "magic" value or more appropriately
the bias would be chosen according a number of bits
from the argument. I've estimated that a look up
table of 5 to 6 -bits (i.e. 32 to 64 entries) should
be large enough that an iteration of the Goldschmidt's
formula could be removed without losing precision. The
speed benefit should be less than 7 clock cycles. Using
a larger look up table should have mostly negative
effects. I'll probably implement that when I have the
time, but if you know better methods, do let me know.

So, if you have a faster way to calculate the reciprocal
square root without using a large look up table, I'd
like to here from you.

Here is the routine. It has been written for MSVC++
using __stdcall[/__fastcall] calling conventions. The
clock cycle estimates are for Pentium/Pentium MMX.

<-- cut -->
/* Copyright (C) 1997 by Vesa Karvonen. All rights reserved.
**
** Use freely as long as my copyright is retained.
*/

double __stdcall Inv_Sqrt(double x) {
__asm {
; I'm assuming that the argument is aligned to a 64-bit boundary.

mov eax,0BFCDD6A1h ; 1u Constant from James Van Buskirk
mov edx,[esp+8] ; 1v Potential pms.
sub eax,edx ; 2u
push 03FC00000h ; 2v Constant 1.5, aligns stack
shr eax,1 ; 3u
sub edx,000100000h ; 3v =.5*x, biased exponent must > 1
mov [esp+12],edx ; 4u
push eax ; 4v

; The lower 32-bits of the estimate come from uninitialized stack.

fld QWORD PTR [esp-4] ; 5 Potential pms
fmul st,st ; 6-8
fld QWORD PTR [esp-4] ; 7
fxch st(1) ; 7x
fmul QWORD PTR [esp+12] ; 9-11 Potential pms
fld DWORD PTR [esp+4] ; 10
add esp,4 ; 12 Faster on Pro/PII
fsub st,st(1) ; 12-14

fmul st(1),st ; 15-17
fmul st(2),st ; 17-19
fmulp st(1),st ; 19-21
fld DWORD PTR [esp] ; 20
fsub st,st(1) ; 22-24

fmul st(1),st ; 25-27
fmul st(2),st ; 27-39
fmulp st(1),st ; 29-31
fld DWORD PTR [esp] ; 30
fsub st,st(1) ; 32-34

fmul st(1),st ; 35-37
fmul st(2),st ; 37-39
fmulp st(1),st ; 39-41
fld DWORD PTR [esp] ; 40
pop eax ; 41 Should not limit
fsubrp st(1),st ; 42-44 decoding on Pro/PII.
fmulp st(1),st ; 45-47
}
}
<-- cut -->

Vesa Karvonen

Serge Plagnol

unread,

Dec 17, 1997, 3:00:00 AM12/17/97

Hello,

I am very interested in your algorithm and I would be interested on
optimizing with you and one of my collegues. Could you be more precise about
the algorithm and the packground math because right now we have an hard time
understanding the computations you are doing especially the first part with
the constants.

Serge

James Van Buskirk

unread,

Dec 22, 1997, 3:00:00 AM12/22/97

Serge Plagnol wrote in message <34a41e9...@news.magicnet.net>...

Since Vesa hasn't responded to your query in c.l.a.x., I'll take a
stab at it.

The idea behind the approximation is that the exponent of the
approximation should be about minus one-half that of the input.
To achieve this, one would ordinarily separate the exponent from
the rest of the number, subtract it from some constant (to account
for the bias in the exponent), and shift the result right by 1, then
reconstitute the exponent back into the original approximation.
The questions arise: what would happen if we didn't separate the
exponent out first and simply carried out the subtraction from
some "magic number" and right shift using the entire IEEE 754
representation of the number? What kind of mathematical function
of the input do we get? The answer is that the output function
is piecewise linear, y = a*x+b, where a = -2.0**(-n) for some
integer n, and b lies within some range defined by n. Given the
constant of the previous posts, the function is:

for 1 <= x <= 2: y = 1.21622504239507-x/4
for 2 <= x <= 3.72980033916057: y = 0.966225042395071-x/8
for 3.72980033916057 <= x <= 4: y = 0.733112521197536-x/16

for x outside the above ranges, one can use the recurrence relation:

y(4*x) = y(x)/2

On an Alpha, lookup tables (LUTs) aren't such a good idea as on x86
processors because Alphas have twice the (per clock) floating
point throughput and about twice the clock rate. This means that
the benefits of LUTs are less and the penalties for thrashing the
cache are greater than on x86. Since this is an x86 NG, perhaps it
would be worthwhile to find some LUT methods that use similar ideas
(i.e. of not extracting the exponent.)

We could, for example, divide up the region 1 <= x < 4 into 2**N
subdomains and then try to find the optimal magic number for each
subdomain. This is not too difficult but the approximation is
not very good near x = 1 nor near x = 4. The reason for this is
that for 1 <= x < 2, the input, x, must have a biased exponent of
3ffh, whereas the output, y, must have a biased exponent of 3feh,
so the slope a = -0.25 because the exponent of the input has decreased by
one and the its mantissa has also been shifted right by one. The slope
of y = 1/sqrt(x) is near -0.5, so we get a bad match between slopes.
Similar reasons limit accuracy near x = 4. It takes a 5 bit (32 entry,
128 byte) LUT to reduce the subsequent Newton-Raphson iterations from
four down to three if we want IEEE 754 double precision accuracy.
See proc sqrt_ta5_ below.

Another possibility is to multiply the output y by an interval-
dependent number. This permits us to find the function y = a*x+b
with no important restrictions on the parameters a and b that minimizes
our figure of merit (which we take to be the maximum of the absolute value
of the relative error after the first NR iteration.) It's actually
completely straightforward in this case to find a and b, hence the
magic number and multiplier, for each subinterval. With a 6 bit
(64 entry pairs, 512 byte) LUT we get close enough that we converge to
double precision accuracy in only two NR iterations. See proc sqrt_ta6_
below.

The tables and test functions (in WASM):

.586
_DATA segment para public use32 'DATA'
public sqrt_ta5_, sqrt_ta6_
align 8

s dd 0BFCD1392h,0BFCCC88Dh
dd 0BFCC98C2h,0BFCC80A9h,0BFCC7D33h,0BFCC8C4Eh,0BFCCAB8Ch,0BFCCD94Ch
dd 0BFCD140Fh,0BFCD5A8Bh,0BFCDABA2h,0BFCE065Bh,0BFCE69DCh,0BFCED564h
dd 0BFCF4849h,0BFCFAD2Fh
dd 0BFCF8980h,0BFCEB56Eh,0BFCE07DDh,0BFCD7BCCh,0BFCD0D17h,0BFCCB849h
dd 0BFCC7A78h,0BFCC512Bh,0BFCC3A44h,0BFCC3355h,0BFCC3C9Ch,0BFCC52E6h
dd 0BFCC759Bh,0BFCCA3AAh,0BFCCDC22h,0BFCD1E30h

d dd 3EB0E05Bh,0C000BF81h
dd 3EA903D2h,0C0023F85h,3EA1B814h,0C003BF89h,3E9AEEF4h,0C0053F8Ch
dd 3E949C02h,0C006BF8Fh,3E8EB448h,0C0083F92h,3E892E16h,0C009BF95h
dd 3E8400D7h,0C00B3F98h,3E7E49CDh,0C00CBF9Ah,3E7526E7h,0C00E3F9Dh
dd 3E6C8CC9h,0C00FBF9Fh,3E647081h,0C0113FA1h,3E5CC839h,0C012BFA3h
dd 3E558B10h,0C0143FA5h,3E4EB102h,0C015BFA7h,3E4832C9h,0C0173FA9h
dd 3E4209D0h,0C018BFABh,3E3C3015h,0C01A3FADh,3E36A023h,0C01BBFAEh
dd 3E3154FBh,0C01D3FB0h,3E2C4A0Dh,0C01EBFB1h,3E277B2Bh,0C0203FB3h
dd 3E22E47Dh,0C021BFB4h,3E1E827Dh,0C0233FB6h,3E1A51EAh,0C024BFB7h
dd 3E164FC6h,0C0263FB8h,3E12794Ch,0C027BFB9h,3E0ECBEFh,0C0293FBBh
dd 3E0B454Eh,0C02ABFBCh,3E07E33Ah,0C02C3FBDh,3E04A3A6h,0C02DBFBEh
dd 3E0184AFh,0C02F3FBFh
dd 3F7A2418h,0BFD0BF81h,3F6F05F0h,0BFD23F85h,3F64B48Eh,0BFD3BF89h
dd 3F5B1BE3h,0BFD53F8Ch,3F522A5Ah,0BFD6BF8Fh,3F49D072h,0BFD83F92h
dd 3F42007Ah,0BFD9BF95h,3F3AAE4Bh,0BFDB3F98h,3F33CF19h,0BFDCBF9Ah
dd 3F2D593Fh,0BFDE3F9Dh,3F27441Dh,0BFDFBF9Fh,3F2187F6h,0BFE13FA1h
dd 3F1C1DD5h,0BFE2BFA3h,3F16FF74h,0BFE43FA5h,3F122725h,0BFE5BFA7h
dd 3F0D8FC7h,0BFE73FA9h,3F0934B0h,0BFE8BFABh,3F0511A3h,0BFEA3FADh
dd 3F0122C1h,0BFEBBFAEh,3EFAC906h,0BFED3FB0h,3EF3A760h,0BFEEBFB1h
dd 3EECDAA4h,0BFF03FB3h,3EE65D65h,0BFF1BFB4h,3EE02AA5h,0BFF33FB6h
dd 3EDA3DCCh,0BFF4BFB7h,3ED4929Eh,0BFF63FB8h,3ECF2530h,0BFF7BFB9h
dd 3EC9F1E5h,0BFF93FBBh,3EC4F564h,0BFFABFBCh,3EC02C91h,0BFFC3FBDh
dd 3EBB948Ah,0BFFDBFBEh,3EB72AA2h,0BFFF3FBFh
_DATA ends

_TEXT segment para public use32 'CODE'
; double sqrt_ta5(double *x);
; input: eax contains the address of the double input (x)
; output: ST(0) contains the approximation
; destroys ebx, ecx, edx, [eax]
sqrt_ta5_ proc near
mov ebx, dword ptr +4[eax]
shr ebx, 17
mov ecx, dword ptr +4[eax]
and ebx, 0fh
mov edx, dword ptr s[4*ebx]
sub edx, ecx
shr edx, 1
mov dword ptr +4[eax], edx
fld qword ptr [eax]
ret
sqrt_ta5_ endp

; double sqrt_ta6(double *x);
; input: eax contains the address of the double input (x)
; output: ST(0) contains the approximation
; destroys ebx, ecx, edx, [eax]
sqrt_ta6_ proc near
mov ebx, dword ptr +4[eax] ; Get hi 32 bits of input
shr ebx, 15 ; Move bits 47:53 of input to
; bits 0:5 of ebx
mov ecx, dword ptr +4[eax] ; Get hi 32 bits of input
and ebx, 3fh ; Extract bits 47:53 of input
mov edx, dword ptr (d+4)[8*ebx] ; Get magic # for interval
sub edx, ecx ; Subtract and shift to get
shr edx, 1 ; unscaled approximation
mov dword ptr +4[eax], edx ; Store unscaled approximation
fld qword ptr [eax] ; Load unscaled approx. into FPU
fmul dword ptr d[8*ebx] ; Apply scale factor for interaval
ret
sqrt_ta6_ endp
_TEXT ends
end

Instead of three wise men bringing gifts, all you get these days
is one randomly selected guy bringin LUTs!

Vesa Karvonen

unread,

Jan 3, 1998, 3:00:00 AM1/3/98

Serge Plagnol wrote:
>Hello,
>
>I am very interested in your algorithm and I would be interested on
>optimizing with you and one of my collegues. Could you be more precise about
>the algorithm and the packground math because right now we have an hard time
>understanding the computations you are doing especially the first part with
>the constants.
>
>Serge

Hi, I just got back from my x-mas holiday (hence the delay).

James Van Buskirk already explained the initial approximation
method thoroughly. So, I'm not going to get into that at all.

(All of this is from memory, so you _should_ check...)

As was said in my post, the approximation is improved using
Goldschmidt's iteration formula for reciprocal square root

x(n+1) = x(n)*r(n)
y(n+1) = y(n)*r(n)*r(n)
r(n+1) = 1.5 - y(n)*r(n)*r(n) = 1.5 - y(n+1).

If you choose

x(0) = 1.0
y(0) = 0.5*a
r(0) = estim(a)

and estim(a) is close enough to the reciprocal square root
of a, then iterating the formula converges quadratically
(i.e. the number of significant digits approximately doubles
each iteration) to

x(n -> inf) = 1.0/sqrt(a)
y(n -> inf) = 0.5
r(n -> inf) = 1.0

The advantage of using Goldschmidt's iteration formula instead
of the more widely known Newton-Raphson is that Goldschmidt's
formula has one less dependent multiplication thus allowing for
slightly faster implementation (esp. when FMUL has long latency!)
The drawback is that Goldschmidt's formula is slightly less stabile
(i.e. slightly less precise). Also, even with N-R, it takes some
effort to really get below 1 ulps.

<--->
Vesa Karvonen

0 new messages