Reciprocal of a single precision float

Awsim

unread,

Apr 30, 1998, 3:00:00 AM4/30/98

to

Hi there,

I was wondering if any of you know a quick way of taking the reciprocal
of a 32 bit floating point. As you know floating point divides are just
toooo expensive.

Thanks,

Asim.

Samuel S. Paik

unread,

Apr 30, 1998, 3:00:00 AM4/30/98

to

In article <e260xAHd9GA.186@uppubnews03>, "Awsim" <Asim_...@vie.co.uk> wrote:
> I was wondering if any of you know a quick way of taking the reciprocal
>of a 32 bit floating point. As you know floating point divides are just
>toooo expensive.

I'm pretty sure I posted something about this a few weeks ago.
In any case, a very old approach to this is to take apart
the float into the exponent and mantissa and then do a table lookup
on the mantissa. If more precision is desired, you can use a
Newton-Raphson iteration--given a 64K entry table, typically one
iteration will get you very close to full precision for single precision.

Going back to the table lookup...

The idea is that a floating point number is made up of an exponent
and a fraction (called the mantissa). The number really is

sign * 2^exponent * fraction

where fraction is between [0,1). Then,

1/(sign * 2^exponent * fraction) = sign * 2^-n * 1/fraction.

So truncate the fraction to however many entries you have in your
lookup table (e.g. truncate to 16 bits if you have a 64K entry table),
and you have a new fraction to go with the new exponent (which is just
the negative of the original). Now you can put the new exponent and
new fraction together to produce the new float.

There are gotchas to worry about, like normalization, but this should
get you started... you'll need to look up the exact format of an
IEEE float, which as I recall has 1 sign bit, 8 bits of exponent
biased by something like 126, 23 bits of mantissa with a hidden bit,
but I'm not going to do all the work for you.

Finally, if you are using an Intel x86, you may be better off just putting
the fp unit into "24 bit" mode, which turns fp division into a 17 cycle
operation (which the above will not beat by much, and which allows
you to do other fp operations while the division is running...)

Sam Paik

p.s. the above approach works well for square root and reciprical
square root as well.

--
Necessity is the Mother of Improvisation. | Samuel S. Paik
Laziness and Greed are the Parents of Innovation| pa...@webnexus.com
A Rebel against the Small World Order | Speak only for self

Arnout van der Kamp

unread,

May 1, 1998, 3:00:00 AM5/1/98

to

No problem with a lookup table.

Awsim wrote in message ...
>Hi there,

>
> I was wondering if any of you know a quick way of taking the reciprocal
>of a 32 bit floating point. As you know floating point divides are just
>toooo expensive.
>
>

> Thanks,
>
>
> Asim.
>
>

John Nagle

unread,

May 1, 1998, 3:00:00 AM5/1/98

to

pa...@webnexus.com (Samuel S. Paik) writes:
>In article <e260xAHd9GA.186@uppubnews03>, "Awsim" <Asim_...@vie.co.uk> wrote:

>> I was wondering if any of you know a quick way of taking the reciprocal
>>of a 32 bit floating point. As you know floating point divides are just
>>toooo expensive.

>I'm pretty sure I posted something about this a few weeks ago.

>In any case, a very old approach to this is to take apart
>the float into the exponent and mantissa and then do a table lookup
>on the mantissa. If more precision is desired, you can use a
>Newton-Raphson iteration--given a 64K entry table, typically one
>iteration will get you very close to full precision for single precision.

Anything that takes you outside the FPU will be slower than a
single divide on modern CPUs. Big tables are awful; a cache miss
is likely.

I was down at Intel recently, and someone mentioned that a
3x3 matrix multiply is getting to be cheaper than fetching a stored
3x3 result from the L2 cache. Not sure if they were referring to a
current or future CPU, but it gives you an idea of where things are going.

John Nagle

Richard Cant

unread,

May 5, 1998, 3:00:00 AM5/5/98

to

In article <nagleEs...@netcom.com>, John Nagle <na...@netcom.com>
writes

> Anything that takes you outside the FPU will be slower than a
>single divide on modern CPUs. Big tables are awful; a cache miss
>is likely.

Its marginal for reciprocal but for reciprocal square root on a P90 the
look up+Newton Raphson appproach really is faster. I've tested it
myself.

Which means that itel got it wrong because if THEY had implemented this
method on the FPU (like TI did on the C40) it would have been quicker
still.

>
> I was down at Intel recently, and someone mentioned that a
>3x3 matrix multiply is getting to be cheaper than fetching a stored
>3x3 result from the L2 cache. Not sure if they were referring to a
>current or future CPU, but it gives you an idea of where things are going.

Agreed - prepare for a world where computation is cheap and
communication is expensive!
--
Richard Cant

Vesa Karvonen

unread,

May 5, 1998, 3:00:00 AM5/5/98

to

Richard Cant <ric...@timezznospamzzhigh.demon.co.uk> wrote:
> In article <nagleEs...@netcom.com>, John Nagle <na...@netcom.com>
> writes
> > Anything that takes you outside the FPU will be slower than a
> >single divide on modern CPUs. Big tables are awful; a cache miss
> >is likely.
> Its marginal for reciprocal but for reciprocal square root on a P90 the
> look up+Newton Raphson appproach really is faster. I've tested it
> myself.

If I recall correctly, you had a reciprocal square root routine which
used a 64kb table? It is possible that your routine is, in some cases,
faster on a P90, which has a relatively slow clock rate these days.
However, even on a P90, in production code, with lots of stuff going
through the cache, I find it difficult to believe, that you could beat
the 19 cycles, that FDIV takes in single precision, using such a large
table.

You could do with a 6-bit (index lookup) in a 512 byte table, a linear
polynomial calculation and an iteration of Newton-Raphson. This should
take less than (or about) 20 cycles on a Pentium, but the table fits
much better into the cache. That should be pretty good for square roots
and reciprocal square roots, but still not fast enough for divides.

> > I was down at Intel recently, and someone mentioned that a
> >3x3 matrix multiply is getting to be cheaper than fetching a stored
> >3x3 result from the L2 cache. Not sure if they were referring to a
> >current or future CPU, but it gives you an idea of where things are going.
> Agreed - prepare for a world where computation is cheap and
> communication is expensive!

An optimized 3x3*3x3 matrix multiply takes about 90 cycles (assuming
100% L1 cache hits). On a very fast CPU this is already as slow or
faster than the time required to load two cache lines from the L2
cache, which takes about 200ns per cache line (depending on bus speed
etc...)

<--->
Vesa Karvonen

Richard Cant

unread,

May 6, 1998, 3:00:00 AM5/6/98

to

In article <6intan$eeo$1...@oravannahka.Helsinki.FI>, Vesa Karvonen
<vkar...@cc.helsinki.fi> writes

>However, even on a P90, in production code, with lots of stuff going
>through the cache, I find it difficult to believe, that you could beat
>the 19 cycles, that FDIV takes in single precision, using such a large
>table.

I was just reporting the result of a quick practical experiment. The
divide code WAS quicker using N-R than the instruction but it was v
marginal. I'll re-run the exprts on a faster machine and see what
happens. Of course in a real application you need to pay a lot of
attention to cache coherency etc. The real point here is that if Intel
had chosen to use N-R for divide and sqrt and had built a suitable LUT
on chip (like TI did with the C40) then all our divides and square roots
would be quicker. (Although not if you can overlap divide and multiply
with the existing design.)

--
Richard Cant