However, there are strong indications that the P4 FPU will actually be
significantly slower than the P-!!! or Athlon's FPU of today. So if you
are targetting that processor, a return to fixed point might be in order.
--
Paul Hsieh
http://www.pobox.com/~qed/
If you mean more traditional fixed point, where each worh has a fixed
binary point at, say, bit 15 -- then fixed point multiplication can be
slower than the equivalent FP multiplication. The reason is that the fixed
point has to be readjusted after the operation (via shifts of byte swaps),
and the operation may have to be performed with paired registers to preserve
accuracy.
With SSE, many floating point operations are very fast, and with SSE2,
double precision FP will be very fast. However, the SSE is not completely
IEEE FP.
"Chris Chan" <ch...@ghostchip.com> wrote in message
news:F9TP5.78620$x6.15...@news20.bellglobal.com...
And don't forget MMX; I'd imagine that might still be the absolute-fastest
way to do arithmetic on numbers in 16-bit fixed-point formats. And much
faster as the integers get narrower; up to 64-wide, if you're interested in
bit operations, and make that 128-wide on the P4 (spot the man with the
cellular-automaton fixation :))
> If you mean more traditional fixed point, where each word has a fixed
> binary point at, say, bit 15 -- then fixed point multiplication can be
> slower than the equivalent FP multiplication. The reason is that the
> fixed point has to be readjusted after the operation (via shifts of byte
> swaps), and the operation may have to be performed with paired
> registers to preserve accuracy.
MUL does the paired-register multiply anyway, and SHRD does the shift on
EDX:EAX to go with it (though I suppose you usually use a 16.16 format and
just look at EDX). The P3 optimisation guide says little of use, but I
vaguely recall that P6 and Athlon don't bother computing the top 32 bits of
the multiply if you don't look at them, and have fairly nasty latencies to
compute them if you do.
> With SSE, many floating point operations are very fast, and with SSE2,
> double precision FP will be very fast. However, the SSE is not completely
> IEEE FP.
I'd be surprised if there were tasks where you're prepared to use
fixed-point, with its complete nonsense in the event of overflow and
unpleasant numerical properties, and where minor deviations from the IEEE
standard for a floating-point implementation were in the least significant.
I wonder vaguely if it's possible to do four-wide parallel 32-bit fixed
arithmetic using SSE2's widened-MMX extensions; I suppose the multiply would
be ugly, you'd have to pull the int<4>s apart, shift using that weird
byte-wide shift and put everything back together. Might still be faster than
the P4 ALU for multiply, though double-pumped add is nice.
Grr, P4 systems will be around within a couple of weeks, but even saving
carefully I don't think I'll be able to afford one before the middle of next
year ... I can't see P4/1400, pair of 128M PC800 RIMMs and motherboard
costing much less than £1000, whilst a K7/1000, pair of 128M PC133 DIMMs and
motherboard are well under £500 already. And I'm not optimistic that even
grotesquely gnarly SSE2 assembler can run the P4 twice as fast as twiddly
3dNow assembler can drive the K7.
Tom
Last I checked, SHRD was quite slow on newer Pentia. In some cases,
the breakdown of registers into (e.g.) al, ax and eax can be used to effect
a shift.
Mul will do a 32x32 to 64; nonetheless, the paired registers do have to be
preserved
and carried through any complicated operation, to preserve precision, since
there are
no implicit guard bits.
It has been a long time since I've been concerned with this issue, but I
published the fixed point code for the Mandelbrot set in MicroCornucopia,
Sept-October Issue, 1988, pp22-29 (magazine now defunct). Back then, 387s
were in short supply and cost ~$500 (like about $800 nowadays), and the
integer fixed point was about 4 or 5 times faster than 387 code. The basic
idea went on to be used in fractint, but by the time the 486 came out, the
advantage of the integer code was severly eroded, even though the 486 still
had ~10-20 cycle times for most FP ops (not including loads).
I'd bet the price of a good beer that you don't know the difference
between integer and fixed point arithmetic. Float is almost always
faster than fixed point, only faster than integer in some nice special
cases where you can make it look like vector arithmetic.
--
bill davidsen <davi...@tmr.com> CTO, TMR Associates, Inc
Make the rules? I don't make the rules. I don't even FOLLOW the rules!