Bonita Montero wrote:
>
> I printed the sums only to prevent the compiler from optimizing away
> the summation. The result is that on my Xeon E3-1240 (Skylake) each
> iteratoin takes four clock-cycles when "d" is non-denormal. When "d"
> is a denormal, each iteration takes about 150 clock cycles! I'd never
> believe denormals would have such a huge performance-impact if I
> wouldn't have seen the opposite.
Ouch!!!
That is really horrible. :-(
I have worked on implementing fp for the Mill cpu, there is no way you
should allow denormals (on input and/or output to add more than a cycle
of two to your processing time.
To get to 150 cycles you effectively need a trap & fixup.
According to Mitch Alsup you can handle denormals inline, in hw, with a
total of 6 gate delays which is fraction of a cycle on any current process.
>
> And what about GPUs? I suppose they don't support denormals.
> Is this right?
>
Usually so, yeah.
The easiest is to treat denormals as zero, in which case you can do all
your special-case handling with a very small lookup table based on the
exponent field only:
00.0 -> Zero
00.1 to ff.e -> Normal
ff.f -> Inf or NaN
If you want/need to handle NaNs you still need to look at the mantissa
for maximal exponents, but you can do that in parallel with the normal
processing anyway, with plenty of time to spare.
Handling denorms however require both a scan for first non-zero mantissa
bit, a shift to normalize and adjusting the (internal) exponent so this
could easily take several cycles unless you are smart.
Terje
--
- <Terje.Mathisen at
tmsw.no>
"almost all programming can be viewed as an exercise in caching"