SSE "denormals are zeros"

Bonita Montero

unread,

Jun 17, 2016, 12:11:29 PM6/17/16

to

Does anyone know what the "denormals are zeros" flag of the
x86 MXCSR is good for?
Or more precisely: I know what it does, but I don't know why
it should make sense to consider denormal values as zeros.

--
http://facebook.com/bonita.montero/

Robert Wessel

unread,

Jun 17, 2016, 1:33:23 PM6/17/16

to

On Fri, 17 Jun 2016 18:11:28 +0200, Bonita Montero
<Bonita....@gmail.com> wrote:

>Does anyone know what the "denormals are zeros" flag of the
>x86 MXCSR is good for?
>Or more precisely: I know what it does, but I don't know why
>it should make sense to consider denormal values as zeros.

Mainly performance - denormals tend to be slow (although less so on
recent x86s). Some codes do things like converge to zero, but end up
passing through the denormal range first - just skipping that can
sometimes be a considerable performance improvement. There are some
downsize to disabling gradual underflow, but in practice many cases
where you get them you're on your way to zero anyway, and in most
cases the advantages of gradual underflow are very small.

Robert Wessel

unread,

Jun 17, 2016, 1:36:48 PM6/17/16

to

And x86 implements two somewhat different options: flush-to-zero and
denormals-are-zero, which differ mainly in how some exceptions and
flags are handled.

Bonita Montero

unread,

Jun 17, 2016, 2:02:21 PM6/17/16

to

Am 17.06.2016 um 19:33 schrieb Robert Wessel:

> Mainly performance - denormals tend to be slow (although less so on

> recent x86s). ...

I also asked for this on Stack Oveflow, and someone gave me a link to
an interesting Intel-article on that:
https://software.intel.com/en-us/node/513376

So I wrote a little program to check the performance-impact of
denormals. Here it is:

#include <windows.h>
#include <intrin.h>
#include <iostream>

using namespace std;

union DBL
{
DWORDLONG dwlValue;
double value;
};

int main()
{
DWORDLONG dwlTicks;
DBL d;
double sum;

dwlTicks = __rdtsc();

for( d.dwlValue = 0, sum = 0.0;
d.dwlValue < 100000000; d.dwlValue++ )
sum += d.value;

dwlTicks = __rdtsc() - dwlTicks;
cout << sum << endl;
cout << dwlTicks / 100000000.0 << endl;

dwlTicks = __rdtsc();

for( d.dwlValue = 0x0010000000000000u, sum = 0.0;
d.dwlValue < (0x0010000000000000u + 100000000); d.dwlValue++ )
sum += d.value;

dwlTicks = __rdtsc() - dwlTicks;
cout << sum << endl;
cout << dwlTicks / 100000000.0 << endl;

return 0;
}

I printed the sums only to prevent the compiler from optimizing away
the summation. The result is that on my Xeon E3-1240 (Skylake) each
iteratoin takes four clock-cycles when "d" is non-denormal. When "d"
is a denormal, each iteration takes about 150 clock cycles! I'd never
believe denormals would have such a huge performance-impact if I
wouldn't have seen the opposite.

And what about GPUs? I suppose they don't support denormals.
Is this right?

--
http://facebook.com/bonita.montero/

Terje Mathisen

unread,

Jun 17, 2016, 8:07:09 PM6/17/16

to

Bonita Montero wrote:
>
> I printed the sums only to prevent the compiler from optimizing away
> the summation. The result is that on my Xeon E3-1240 (Skylake) each
> iteratoin takes four clock-cycles when "d" is non-denormal. When "d"
> is a denormal, each iteration takes about 150 clock cycles! I'd never
> believe denormals would have such a huge performance-impact if I
> wouldn't have seen the opposite.

Ouch!!!

That is really horrible. :-(

I have worked on implementing fp for the Mill cpu, there is no way you
should allow denormals (on input and/or output to add more than a cycle
of two to your processing time.

To get to 150 cycles you effectively need a trap & fixup.

According to Mitch Alsup you can handle denormals inline, in hw, with a
total of 6 gate delays which is fraction of a cycle on any current process.

>
> And what about GPUs? I suppose they don't support denormals.
> Is this right?
>

Usually so, yeah.

The easiest is to treat denormals as zero, in which case you can do all
your special-case handling with a very small lookup table based on the
exponent field only:

00.0 -> Zero
00.1 to ff.e -> Normal
ff.f -> Inf or NaN

If you want/need to handle NaNs you still need to look at the mantissa
for maximal exponents, but you can do that in parallel with the normal
processing anyway, with plenty of time to spare.

Handling denorms however require both a scan for first non-zero mantissa
bit, a shift to normalize and adjusting the (internal) exponent so this
could easily take several cycles unless you are smart.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"