Quad-precision BLIS?

Jack Poulson

unread,

Feb 27, 2015, 2:42:43 PM2/27/15

to blis-d...@googlegroups.com, Matthew Knepley, rvdg, Field G. Van Zee

Dear BLIS developers,

Many of the state-of-the-art solvers for the sparse linear and least
squares problems create intermediate matrices whose effective condition
numbers are the potentially the square of the condition number of the
original matrix.

Furthermore, there are many techniques where, if the solution is desired
to n digits of accuracy, then the floating-point precision needs to
support 2*n digits of accuracy. With double-precision, this implies that
at most eight-digits of accuracy could be achieved with such techniques
(sparse Interior Point Methods often satisfy this property).

For this reason, it is of use for a library to support quad-precision
factorizations, matrix-vector multiplication, and triangular solves.
Libraries such as PETSc currently create custom modifications of the
netlib implementation of the BLAS, but in no way could these
modifications be expected to be high-performance.

How much effort do you think would be required for BLIS to support
quad-precision with a cost within a factor of 5-10 of double-precision?
I would be willing to dedicate a significant portion of my energy to
helping to support quad-precision Interior Point Methods in Elemental,
and this would be a critical component of such an effort.

Jack

signature.asc

Field G. Van Zee

unread,

Feb 27, 2015, 2:49:21 PM2/27/15

to Jack Poulson, blis-d...@googlegroups.com, Matthew Knepley, rvdg

Jack,

I always assumed people would not be generally interested in quad precision
until it was widely supported in hardware (and mainstream languages and
compilers). Since I have given literally zero attention to the issues
involved, I can't even hazard a rough guess for the effort that would be required.

Hopefully someone else can chime in.

Field

Jeff Hammond

unread,

Feb 27, 2015, 2:58:08 PM2/27/15

to Field G. Van Zee, Jack Poulson, blis-d...@googlegroups.com, Matthew Knepley, rvdg

Given that the biggest difference in performance between Netlib and
BLIS/MKL/etc. is due to data access rather than FPU utilization, I do
not think hardware support is required for this to be of interest for
BLIS.

If BLIS were to emulate the QP computation in software and have all
the cache-friendly access that's independent of the precise width of a
floating point number, I would expect Jack' estimation of 5-10x to be
pretty accurate.

Perhaps the first step is for Field, Robert, et al. to write an paper
about how to implement QGEMM efficiently in BLIS.

Best,

Jeff

> --
> You received this message because you are subscribed to the Google Groups
> "blis-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to blis-discuss...@googlegroups.com.
> To post to this group, send email to blis-d...@googlegroups.com.
> Visit this group at http://groups.google.com/group/blis-discuss.
> For more options, visit https://groups.google.com/d/optout.

--
Jeff Hammond
jeff.s...@gmail.com
http://jeffhammond.github.io/

Jack Poulson

unread,

Feb 27, 2015, 3:00:04 PM2/27/15

to Jeff Hammond, Field G. Van Zee, blis-d...@googlegroups.com, Matthew Knepley, rvdg

I agree with Jeff that properly handling the memory accesses is the
difficult part (what I would hope BLIS could handle).

Ideally it would be straightforward to swap out the microkernels for
quad-precision; an initial efficient implementation (relative to
modified netlib) would be a great impetus for pushing further.

Jack

signature.asc

Devin Matthews

unread,

Feb 27, 2015, 3:54:05 PM2/27/15

to blis-d...@googlegroups.com

As far as compiler support, this is what I have been able to dig up:

GCC and maybe Intel:

Builtin __float128 and __complex128 types (since 4.6), library support through -lquadmath. For Intel, I also see references to a _Quad type, but it seems like maybe this is only older versions (Jeff?)?

XL on BG/Q:

The "native" option seems to be -qldbl128 along with long double on PPC. There is also the -qfloat=gcclongdouble option for compatibility with GCC's __float128.

Otherwise, there are multiprecision libraries such as gmp and mpfr, although I imagine that would be even slower. Or, you could write the microkernel in FORTRAN and use real*16 :). You could also do explicit software double-double arithmetic.

Devin

Jeff Hammond

unread,

Feb 27, 2015, 5:45:16 PM2/27/15

to Devin Matthews, blis-d...@googlegroups.com

Do you all think there is a big difference between quad and double
double? Reading
http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic,
it seems that there is a huge difference in the exponent range. Does
this matter to your applications, Jack?

On Fri, Feb 27, 2015 at 12:54 PM, Devin Matthews
<devinam...@gmail.com> wrote:
> As far as compiler support, this is what I have been able to dig up:
>
> GCC and maybe Intel:
>
> Builtin __float128 and __complex128 types (since 4.6), library support
> through -lquadmath. For Intel, I also see references to a _Quad type, but it
> seems like maybe this is only older versions (Jeff?)?

I can't imagine they would deprecate this, but I am not an expert.

Given that real128 is part of Fortran 2008
(http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Computer-language_support),
I guess that there will eventually be widespread (Fortran) compiler
support.

> XL on BG/Q:
>
> The "native" option seems to be -qldbl128 along with long double on PPC.
> There is also the -qfloat=gcclongdouble option for compatibility with GCC's
> __float128.
>
> Otherwise, there are multiprecision libraries such as gmp and mpfr, although
> I imagine that would be even slower. Or, you could write the microkernel in
> FORTRAN and use real*16 :). You could also do explicit software
> double-double arithmetic.

GMP, MPFR or Boost:multiprecision are likely to be so much more
expensive than native arithmetic that there is little point to
optimizing for data access, but I would not mind if someone does the
experiment and proves me wrong. I think Eigen might have support for
MPFR or something like that, if you're curious about testing
something.

It would be interesting to know if one can replace the float type in
BLIS with a typedef or macro and then build the generic implementation
using one of the aforementioned non-standard C extended precision
types. It might be just a matter of tedious search-and-replace.

While I'm sure it will cause Robert's hair to stand on end, it's not a
terrible idea to consider creating a Fortran 2008 microkernel using
ISO_C_BINDING. However, this won't help in the case of extended
precision, because I don't know that there is any assurance of
interoperability of non-standard types across this interface. At
best, it would be implementation-specific.

Best,

Jeff

Devin Matthews

unread,

Feb 27, 2015, 5:50:27 PM2/27/15

to Jeff Hammond, blis-d...@googlegroups.com

If what one wants is "double precision but with more digits" then it
seems that double-double would be sufficient, and perhaps more efficient.

Devin

Field G. Van Zee

unread,

Feb 27, 2015, 6:48:25 PM2/27/15

to blis-d...@googlegroups.com

On 02/27/15 16:44, Jeff Hammond wrote:
>
> GMP, MPFR or Boost:multiprecision are likely to be so much more
> expensive than native arithmetic that there is little point to
> optimizing for data access, but I would not mind if someone does the
> experiment and proves me wrong. I think Eigen might have support for
> MPFR or something like that, if you're curious about testing
> something.
>
> It would be interesting to know if one can replace the float type in
> BLIS with a typedef or macro and then build the generic implementation
> using one of the aforementioned non-standard C extended precision
> types. It might be just a matter of tedious search-and-replace.

Even if we had "long double" and "long double complex" types supported by
hardware and in C, it would be a nontrivial task to enable use of these types
in BLIS, because BLIS is not just a library, it is effectively a cpp-based
language. Most of the work would be in defining new scalar-level arithmetic
macros on those quad-precision types.

Now, if hardware/language support is not available, this latter task becomes
much more difficult.

Field

Jeff Hammond

unread,

Feb 27, 2015, 6:54:13 PM2/27/15

to Field G. Van Zee, blis-d...@googlegroups.com

If the Intel C compiler supports all the arithmetic ops for _Quad,
isn't this trivial?

Any Fortran 2008 compiler will support the standard array of
arithmetic ops for 128b real, so maybe Devin and I should work on that
Fortran ukernel :-)

Jeff

> Now, if hardware/language support is not available, this latter task becomes
> much more difficult.
>
> Field
>
>

Marat Dukhan

unread,

Feb 27, 2015, 7:14:35 PM2/27/15

to Jack Poulson, blis-d...@googlegroups.com, Matthew Knepley, rvdg, Field G. Van Zee

Would it be sufficient to compute only the intermediate results in extended precision, i.e. leave inputs and outputs in double-precision, but accumulate intermediate dot products in double-double?

Regards,

Marat

Jack

Jed Brown

unread,

Feb 27, 2015, 7:23:45 PM2/27/15

to Jeff Hammond, Field G. Van Zee, Jack Poulson, blis-d...@googlegroups.com, Matthew Knepley, rvdg

Jeff Hammond <jeff.s...@gmail.com> writes:

> Given that the biggest difference in performance between Netlib and
> BLIS/MKL/etc. is due to data access rather than FPU utilization, I do
> not think hardware support is required for this to be of interest for
> BLIS.

The arithmetic gets slower by a factor significantly larger than 2 (the
difference in data size), so is this still the case? Has someone
evaluated the __float128 netlib BLAS used by PETSc and concluded that it
is not limited by the arithmetic performance of __float128? For which
operations? This isn't to say that a BLIS microkernel wouldn't be
worthwhile, but how much can you expect to gain and for which
operations?

signature.asc

Jack Poulson

unread,

Feb 27, 2015, 7:46:20 PM2/27/15

to Marat Dukhan, blis-d...@googlegroups.com, Matthew Knepley, rvdg, Field G. Van Zee

Hi Marat,

It depends upon which component you're talking about; in the cases I'm
thinking of, the top-level application (in my case, say, a quadratic
program) would only need double-precision inputs and outputs, but
several of the steps of the algorithm involve setting up a linear system
which is solved using an unpivoted LDL^T factorizations with a large,
but bounded, number of digits of accuracy being expected to be lost. The
idea would be to factor in either double or quad precision, but to
perform iterative refinement in quad precision.

Iteratively refining in higher precision has been standard for a while,
especially for sparse-direct methods.

Jack

> <mailto:blis-discuss%2Bunsu...@googlegroups.com>.

> To post to this group, send email to blis-d...@googlegroups.com

> <mailto:blis-d...@googlegroups.com>.

signature.asc

Matthias

unread,

Feb 28, 2015, 12:17:49 PM2/28/15

to blis-d...@googlegroups.com, kne...@gmail.com, rv...@cs.utexas.edu, fi...@cs.utexas.edu, pou...@stanford.edu

Hi Jack, hi everybody,

I was playing around with different options for quad precision some time ago. These references might be of interest to you:
http://www.netlib.org/xblas/
http://crd-legacy.lbl.gov/~xiaoye/xblas.ps.gz
http://mplapack.sourceforge.net/

My experience was mainly Intel's __Quad, GNU's __float128 (libquadmath), and MPFR. When going down this list, it gets significantly slower (was around factor 2 for Intel vs. GNU). For __Quad vs __float128 it is mostly a matter of substituting the data type. If you need constants like FLT128_MIN etc., these also have to be defined properly. I'd say, a factor of 10x slowdown seems realistic.

Best,
Matthias

Jack Poulson

unread,

Feb 28, 2015, 3:00:16 PM2/28/15

to Matthias, blis-d...@googlegroups.com, kne...@gmail.com, rv...@cs.utexas.edu, fi...@cs.utexas.edu

Thanks Matthias!

As a point of reference, Barry Smith was kind enough to send me an
example of timing PETSc's quad-precision support relative to running
with double-precision. By slightly modifying his approach, I was able to
see roughly a 68x difference in performance in sparse-direct LU
factorization between netlib double-precision and netlib quad-precision
using GCC 4.6.8.

Thus, given that users typically used tuned double-precision versions
rather than netlib, the current practical runtime difference could
easily be expected to be more than 100x. Having a tuned quad-precision
BLAS and LAPACK API could significantly shrink this performance gap.

Jack

signature.asc

Jeff Hammond

unread,

Feb 28, 2015, 9:18:40 PM2/28/15

to Jack Poulson, Matthias, blis-d...@googlegroups.com, kne...@gmail.com, rv...@cs.utexas.edu, fi...@cs.utexas.edu

It strikes me that we could make this extended-precision BLIS business
happen if we crowd-source it among the people responding on this
thread.

Hacking a branch of https://github.com/flame/blis might add stress to
Field's life if he's monitoring every commit. Perhaps we should use a
fork like https://github.com/flame/blis-ep so that we can create tons
of Github issues without giving Field a heart attack :-)

Thoughts?

As I am a member of the FLAME org, I can handle pull requests if
people want to work that way. Or Jack can fork into the Elemental org
since he seems to be the most obvious consumer of this.

In any case, I am going to write a bunch of simple tests of the
various extended precision types for BLAS-like kernels so I have data
on _Quad vs. __float128 vs. REAL*16 etc. beyond what is quoted in
these emails.

It would be great if people could push their experiments out as Github
projects as well.

Best,

Jeff

> --
> You received this message because you are subscribed to the Google Groups "blis-discuss" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
> To post to this group, send email to blis-d...@googlegroups.com.

> Visit this group at http://groups.google.com/group/blis-discuss.
> For more options, visit https://groups.google.com/d/optout.

Jeff Hammond

unread,

Mar 1, 2015, 12:32:47 AM3/1/15

to Devin Matthews, blis-d...@googlegroups.com

Devin:

I just tested and Intel C 15 doesn't appear to support _Quad but does
support __float128.

Best,

Jeff

On Fri, Feb 27, 2015 at 12:54 PM, Devin Matthews
<devinam...@gmail.com> wrote:

Jack Poulson

unread,

Mar 1, 2015, 12:45:07 AM3/1/15

to Jeff Hammond, blis-d...@googlegroups.com

Do you know if there is full compatibility with quadmath?

Jack

signature.asc

Jeff Hammond

unread,

Mar 1, 2015, 1:25:36 AM3/1/15

to Jack Poulson, blis-d...@googlegroups.com

No idea. Wouldn't surprise me, because GCC compatibility is often a goal. But without testing, we know nothing. Is there a quadmath test suite somewhere?

Jeff

Sent from my iPhone

Jed Brown

unread,

Mar 1, 2015, 10:08:42 AM3/1/15

to Jack Poulson, Matthias, blis-d...@googlegroups.com, kne...@gmail.com, rv...@cs.utexas.edu, fi...@cs.utexas.edu

Jack Poulson <pou...@stanford.edu> writes:
> Thus, given that users typically used tuned double-precision versions
> rather than netlib, the current practical runtime difference could
> easily be expected to be more than 100x. Having a tuned quad-precision
> BLAS and LAPACK API could significantly shrink this performance gap.

What are you basing this on?

Is the netlib quad precision variant limited by data motion or
arithmetic? Is __float128 vectorizing?

signature.asc

Jack Poulson

unread,

Mar 1, 2015, 11:07:57 AM3/1/15

to Jed Brown, Matthias, blis-d...@googlegroups.com, kne...@gmail.com, rv...@cs.utexas.edu, fi...@cs.utexas.edu

Which claim are you questioning? That an optimized version of a
quad-precision BLAS could be significantly faster than an unoptimized one?

With regard to data motion and arithmetic: there is no reason that both
the data motion and arithmetic can't be improved, as the current
software implementations of the quad-precision operations could likely
be efficiently combined (and I'm sure that there is literature to this
effect).

Jack

signature.asc

Field G. Van Zee

unread,

Mar 1, 2015, 1:04:32 PM3/1/15

to blis-d...@googlegroups.com

On 02/28/15 20:18, Jeff Hammond wrote:
> It strikes me that we could make this extended-precision BLIS business
> happen if we crowd-source it among the people responding on this
> thread.

Please be aware that there are parts of BLIS that I always intended to clean
up BEFORE moving to precisions beyond double. I never anticipated that there
might be desire to go beyond double without widespread hardware and language
support. Hence, why I haven't prioritized those cleanups.

Bottom line: I chose to make supporting new BLIS datatypes relatively hard so
that expressing computation over those types would be easy.

You've been warned.

>
> Hacking a branch of https://github.com/flame/blis might add stress to
> Field's life if he's monitoring every commit. Perhaps we should use a
> fork like https://github.com/flame/blis-ep so that we can create tons
> of Github issues without giving Field a heart attack :-)

Your consideration for my health and well-being is sincerely appreciated.

Jack Poulson

unread,

Mar 1, 2015, 2:36:53 PM3/1/15

to Jeff Hammond, blis-d...@googlegroups.com, Elemental, mark.h...@gmail.com, Jed Brown

Hi Jeff,

I'm not sure about an official test-suite, but being able to
successfully link against the math functions described in libquadmath's
documentation [1] would be a good start.

It might be of interest that I added support into Elemental's build
system, scalar type handling, and MPI wrappers for libquadmath over the
last two days [2,3,4,5].

As a nice side effect, it now takes more than a thousand less lines to
implement the MPI wrappers.

With that said, Field's last remark has me a bit worried about using
BLIS for this purpose in the near future.

Mark Hoemmen has suggested the usage of QD (Hida et al.) [6], and there
is a nice talk (by [7] on extending these ideas to GPUs showing about a
factor of 13 difference in performance with dgemm (Jed, hopefully this
settles your question?).

Jack

[1]
https://gcc.gnu.org/onlinedocs/libquadmath/Math-Library-Routines.html#Math-Library-Routines

[2]
https://github.com/elemental/Elemental/commit/7adec372bac11819948519c6d9640b1daf4c7977

[3]
https://github.com/elemental/Elemental/commit/e5a9cb3d4f8136372858554717c1ed50f64c21f8

[4]
https://github.com/elemental/Elemental/commit/9afb78e8621e71061372f4fb5d7f24c83db8c25a

[5]
https://github.com/elemental/Elemental/commit/65589bb71412659631aad74759414deec3b21859

[6]
http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf

[7]
http://icl.cs.utk.edu/newsletter/presentations/2012/Mukunoki-Quadruple-Precision-BLAS-Subroutines-on-GPUs-2012-02-03.pdf

signature.asc

Jeff Hammond

unread,

Mar 1, 2015, 4:53:18 PM3/1/15

to Jack Poulson, blis-d...@googlegroups.com, Elemental, mark.h...@gmail.com, Jed Brown

> On Mar 1, 2015, at 11:36 AM, Jack Poulson <pou...@stanford.edu> wrote:
>
> Hi Jeff,
>
> I'm not sure about an official test-suite, but being able to
> successfully link against the math functions described in libquadmath's
> documentation [1] would be a good start.
>

As Jed will surely tell you, successful linking means nothing :-)

I doubt Intel C would not comply if they appear to implement this feature. But I can ask to be sure.

> It might be of interest that I added support into Elemental's build
> system, scalar type handling, and MPI wrappers for libquadmath over the
> last two days [2,3,4,5].

I've been meaning to add this to MPICH. Will try to make time.

> As a nice side effect, it now takes more than a thousand less lines to
> implement the MPI wrappers.
>

Cool. You support reductions w user-defined ops, right? BigMPI has some macros you can borrow in the event you haven't done it already.

> With that said, Field's last remark has me a bit worried about using
> BLIS for this purpose in the near future.

Well we could just start w PETSc impl and optimize from there until BLIS comes around.

Jeff

Jed Brown

unread,

Mar 1, 2015, 5:27:46 PM3/1/15

to Jack Poulson, Matthias, blis-d...@googlegroups.com, kne...@gmail.com, rv...@cs.utexas.edu, fi...@cs.utexas.edu

Jack Poulson <pou...@stanford.edu> writes:

> Which claim are you questioning? That an optimized version of a
> quad-precision BLAS could be significantly faster than an unoptimized one?

Yes, that optimizing data motion could have large returns. If data
motion is not a bottleneck, optimizing it (via BLIS, etc) may provide
negligible returns.

Also, so long as we rely on the compiler for __float128, we don't get to
write assembly microkernels. So even if there is opportunity to speed
it up by interleaving and vectorizing at that level, we're not going to
get there easily.

> With regard to data motion and arithmetic: there is no reason that both
> the data motion and arithmetic can't be improved, as the current
> software implementations of the quad-precision operations could likely
> be efficiently combined (and I'm sure that there is literature to this
> effect).

"combined" how? Using the compiler's __float128 support or writing
custom assembly microkernels?

signature.asc

Jed Brown

unread,

Mar 1, 2015, 5:43:44 PM3/1/15

to Jack Poulson, Jeff Hammond, blis-d...@googlegroups.com, Elemental, mark.h...@gmail.com

Jack Poulson <pou...@stanford.edu> writes:
> Mark Hoemmen has suggested the usage of QD (Hida et al.) [6], and there
> is a nice talk (by [7] on extending these ideas to GPUs showing about a
> factor of 13 difference in performance with dgemm (Jed, hopefully this
> settles your question?).

It's not comparing the same thing, so no, it doesn't answer my question.
It's an anecdote of something related, but not constructive about what
can be expected for quad (vs. DD, if relevant) on CPUs or what needs to
happen to get there.

signature.asc

Mark Hoemmen

unread,

Mar 1, 2015, 7:05:46 PM3/1/15

to Jed Brown, Jack Poulson, Jeff Hammond, blis-d...@googlegroups.com, Elemental

Jack Poulson <pou...@stanford.edu> writes:
> Mark Hoemmen has suggested the usage of QD (Hida et al.) [6] ...

More specifically: I would like to refresh QD (at least the dd_real
part), and add the necessary syntax to make it work on GPUs as well.
I would make this available through the Kokkos project somehow, just
as we've done with complex arithmetic. I plan to use this for
mixed-precision research, and to make it available through Trilinos'
sparse linear algebra. However, I think I can do this in a way that
would not require dependence on Kokkos. (It certainly would not
require dependence on Trilinos. We already have a way to distribute
Kokkos without Trilinos and without requiring use of Trilinos' CMake
build system.)

I would like to do this anyway, so I was thinking of teaming up with
y'all. We also plan to write a Kokkos interface to the BLAS and other
Kokkos-y things, but that will take a lot longer, and it would require
buy-in to Kokkos of course. QD implies buy-in to C++ (it's a C++
library), but I would like to do this in a way that would be useful to
people who don't want to use Kokkos. Please give me advice on how to
do that :-)

mfh

Devin Matthews

unread,

Mar 1, 2015, 10:34:47 PM3/1/15

to mark.h...@gmail.com, Jed Brown, Jack Poulson, Jeff Hammond, blis-d...@googlegroups.com, Elemental

I wrote a quick test to compare various floating-point types, including
the aforementioned QD library
(http://crd-legacy.lbl.gov/~dhbailey/mpdist/). The test code is attached
(two files to prevent icpc from optimizing the whole thing away for
double...). I get:

[dmatthews@OpenSUSE ~]$g++ -o test.g++ -O3 -ffast-math -march=native
-I/home/dmatthews/opt/qd/include test.cxx test2.cxx
-L/home/dmatthews/opt/qd/lib64 -lqd -lquadmath
[dmatthews@OpenSUSE ~]$./test.g++
53 0.637766 3
64 0.64571 3
104 0.641988 3
113 10.6141 3
[dmatthews@OpenSUSE ~]$icpc -o test.icpc -O3 -xHost
-I/home/dmatthews/opt/qd/include test.cxx test2.cxx
-L/home/dmatthews/opt/qd/lib64 -lqd -lquadmath
[dmatthews@OpenSUSE ~]$./test.icpc
53 0.439146 3
64 0.449886 3
104 3.72204 3
113 5.42434 3

Where the output is # of mantissa bits, time for 100,000,000 iterations
in sec, and result rounded to double, for the types double, long double,
dd_real, and __float128. I can't explain the dd_real result for g++,
perhaps someone else can comment?

Devin Matthews

test.cxx

test2.cxx

Devin Matthews

unread,

Mar 1, 2015, 10:36:52 PM3/1/15

to blis-d...@googlegroups.com

Re-send from google-blessed address.

-------- Original Message --------

Subject:	Re: [blis-discuss] Re: Quad-precision BLIS?
Date:	Sun, 01 Mar 2015 21:34:43 -0600
From:	Devin Matthews <dmat...@utexas.edu>
To:	mark.h...@gmail.com, Jed Brown <jedb...@mcs.anl.gov>
CC:	Jack Poulson <pou...@stanford.edu>, Jeff Hammond <jeff.s...@gmail.com>, "blis-d...@googlegroups.com" <blis-d...@googlegroups.com>, Elemental <d...@libelemental.org>

test.cxx

test2.cxx

Marat Dukhan

unread,

Mar 2, 2015, 12:21:54 AM3/2/15

to Devin Matthews, mark.h...@gmail.com, Jed Brown, Jack Poulson, Jeff Hammond, blis-d...@googlegroups.com, Elemental

-ffast-math breaks double-double arithmetic, you should compile without this option.

If you target an FMA-enabled CPU, also add -ffp-contract=off -DQD_FMS(a,b,c)=__builtin_fma(a,b,-c)

Regards,

Marat

--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss+unsubscribe@googlegroups.com.

Jeff Hammond

unread,

Mar 2, 2015, 12:39:39 AM3/2/15

to mark.h...@gmail.com, Jed Brown, Jack Poulson, blis-d...@googlegroups.com, Elemental

On Sun, Mar 1, 2015 at 4:05 PM, Mark Hoemmen <mark.h...@gmail.com> wrote:
> Jack Poulson <pou...@stanford.edu> writes:
>> Mark Hoemmen has suggested the usage of QD (Hida et al.) [6] ...
>
> More specifically: I would like to refresh QD (at least the dd_real
> part), and add the necessary syntax to make it work on GPUs as well.
> I would make this available through the Kokkos project somehow, just
> as we've done with complex arithmetic. I plan to use this for
> mixed-precision research, and to make it available through Trilinos'
> sparse linear algebra. However, I think I can do this in a way that
> would not require dependence on Kokkos. (It certainly would not
> require dependence on Trilinos. We already have a way to distribute
> Kokkos without Trilinos and without requiring use of Trilinos' CMake
> build system.)

How about without C++? I guess 2/3 ain't bad ;-)

> I would like to do this anyway, so I was thinking of teaming up with
> y'all. We also plan to write a Kokkos interface to the BLAS and other
> Kokkos-y things, but that will take a lot longer, and it would require
> buy-in to Kokkos of course. QD implies buy-in to C++ (it's a C++
> library), but I would like to do this in a way that would be useful to
> people who don't want to use Kokkos. Please give me advice on how to
> do that :-)

Is it possible to write an extern "C" interface? I think the PETSc
folks are more likely to use Fortran REAL*16 than C++.

<less serious stuff starts here>

Since you are willing to tolerate wacko build systems (e.g. CMake) and
cuckoo languages (C++), why don't you just roll all of
Boost::multiprecision (which include GMP and MPFR back-ends) into
Trilinos and bring about the end of days through software nihilism?

If C++11 + Trilinos + CMake + Boost (especially Jam) aren't the four
bowls of the software apocalypse...

Jeff

PS Do not feed the trolls ;-)

Mark Hoemmen

unread,

Mar 2, 2015, 10:07:22 AM3/2/15

to Jeff Hammond, Jed Brown, Jack Poulson, blis-d...@googlegroups.com, Elemental

On Sun, Mar 1, 2015 at 10:39 PM, Jeff Hammond <jeff.s...@gmail.com> wrote:
> On Sun, Mar 1, 2015 at 4:05 PM, Mark Hoemmen <mark.h...@gmail.com> wrote:
>> I would like to do this anyway, so I was thinking of teaming up with
>> y'all. We also plan to write a Kokkos interface to the BLAS and other
>> Kokkos-y things, but that will take a lot longer, and it would require
>> buy-in to Kokkos of course. QD implies buy-in to C++ (it's a C++
>> library), but I would like to do this in a way that would be useful to
>> people who don't want to use Kokkos. Please give me advice on how to
>> do that :-)
>
> Is it possible to write an extern "C" interface? I think the PETSc
> folks are more likely to use Fortran REAL*16 than C++.

I suspect so as well. I would also much rather use REAL*16 or
__float128. (I've already gone through the tedious exercise of making
parts of std::complex work with Kokkos.) Would you happen to know how
widespread support for __float128 or analogous types are? If I get
GCC, Clang, and the Intel compiler, that's already plenty good enough
for me.

I suspect that QD already has C bindings, though I'll have to check.

> <less serious stuff starts here>

I appreciate the tag ;-)

> Since you are willing to tolerate wacko build systems (e.g. CMake) and
> cuckoo languages (C++), why don't you just roll all of
> Boost::multiprecision (which include GMP and MPFR back-ends) into
> Trilinos and bring about the end of days through software nihilism?

We actually used to have Tpetra hooked up with ARPREC, though I
haven't seen anyone try using this for the nearly five years I've been
hacking Trilinos. No one was crazy enough to hook up MPFR. All
ARPREC instances have the same run-time size (you can set a global to
change it). With MPFR, each number instance may have a different
run-time size. That makes setting up MPI communication super fun!

Kokkos gives us a better mechanism for managing arrays of types with
run-time sizes. We use it for types for stochastic PDE
discretizations, but it could be adapted to any type with a run-time
but constant size. It's a pain but getting easier, and it gives us a
way to fiddle with data layout too (you can either make all the
coefficients of the same polynomial contiguous, or make the
corresponding coefficients of different polynomials contiguous, just
by changing a template parameter -- try not to gaze too deeply into
the madness underneath or it will suck out your living soul).

We don't just use CMake; we use a custom build system built out of
CMake. The madness is complete ;-P

> If C++11 + Trilinos + CMake + Boost (especially Jam) aren't the four
> bowls of the software apocalypse...

It's seven bowls. I'll take thick darkness (fifth bowl) over bjam,
but will gladly skip the frog-demons and hailstone-quake!

On the other hand, how many complete implementations of Fortran 2008
are there? ;-P

> PS Do not feed the trolls ;-)

Ah, but it's so entertaining!

mfh

Jack Poulson

unread,

Mar 2, 2015, 11:21:20 AM3/2/15

to Jed Brown, Jeff Hammond, blis-d...@googlegroups.com, Elemental, mark.h...@gmail.com

Okay, I see you are (justifiably) wanting to argue the technical point
that the 11-bit exponent and 2*53=106 bit significand of double-double
is less than the 15-bit exponent and 113-bit significand of IEEE 754
quad-precision.

Further, a discussion on Intel's customer support forum [1] implies that
they implement IEEE 754 quad-precision in software entirely with integer
operations, as Tim Prince said the following:

"""
I didn't say there is no need for quad precision. All widely used
Fortran compilers have it, for example, with software implementation.
Performance deficiency of current quad precision is due as much to lack
of vectorizability as lack of single hardware instruction implementation.
My point was that no matter how much hardware precision you have, you
still need a higher precision range reduction algorithm to support trig
functions on your new high precision.
If the market demand were seen, no doubt someone would study the
feasibility of vector quad precision on future 256- and 512-bit register
platforms
"""

So, yes, Tim is arguing that IEEE 754 quad-precision cannot be easily
vectorized. However, the 106 significand bits from double-double could
easily be increased up to, or past, those of IEEE 754 quad-precision via
using another double (quad-double, with 212-bit significands, seems to
be common, but I would assume that triple-double, with 3*53=159 bit
significands would also be possible). The only remaining challenge would
be deciding what to do about the four missing exponent bits.

Jack

[1] https://software.intel.com/en-us/forums/topic/283646
[2]
http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf

signature.asc

Mark Hoemmen

unread,

Mar 2, 2015, 11:34:14 AM3/2/15

to Jack Poulson, Jed Brown, Jeff Hammond, blis-d...@googlegroups.com, Elemental

On Mon, Mar 2, 2015 at 9:21 AM, Jack Poulson <pou...@stanford.edu> wrote:
> Further, a discussion on Intel's customer support forum [1] implies that
> they implement IEEE 754 quad-precision in software entirely with integer
> operations, as Tim Prince said the following:
>
> """
> I didn't say there is no need for quad precision. All widely used
> Fortran compilers have it, for example, with software implementation.
> Performance deficiency of current quad precision is due as much to lack
> of vectorizability as lack of single hardware instruction implementation.

> ...
> """

One could vectorize across multiple quad-precision arithmetic operations.

mfh

Jed Brown

unread,

Mar 2, 2015, 11:35:39 AM3/2/15

to mark.h...@gmail.com, Jack Poulson, Jeff Hammond, blis-d...@googlegroups.com, Elemental

Mark Hoemmen <mark.h...@gmail.com> writes:
> One could vectorize across multiple quad-precision arithmetic operations.

This was the intent of my question.

signature.asc

Jed Brown

unread,

Mar 2, 2015, 11:41:21 AM3/2/15

to Jack Poulson, Jeff Hammond, blis-d...@googlegroups.com, Elemental, mark.h...@gmail.com

Jack Poulson <pou...@stanford.edu> writes:

> Okay, I see you are (justifiably) wanting to argue the technical point
> that the 11-bit exponent and 2*53=106 bit significand of double-double
> is less than the 15-bit exponent and 113-bit significand of IEEE 754
> quad-precision.

I'm actually not going to split hairs about whether the difference
between DD and Quad is necessary for your application. Rather, I object
to claims that some observed performance using DD versus __float128 on a
different architecture points at what is necessary to improve
performance or how much potential remains for optimization of __float128
BLAS. It may be that data motion isn't a bottleneck in either case, or
is only a bottleneck in one case. If we want to be quantitative about
potential for optimization, let's pick some experiments that answer our
questions rather than speculating ad nauseam.

signature.asc

Mark Hoemmen

unread,

Mar 2, 2015, 11:42:45 AM3/2/15

to Jed Brown, Jack Poulson, Jeff Hammond, blis-d...@googlegroups.com, Elemental

Just for clarity: The QD library would have to be rewritten to expose
vectorization across multiple operations. This would also only work
in QD's "unsafe" mode (which neglects correct handling of e.g., NaN),
due to branches in "safe" mode.

mfh

Jeff Hammond

unread,

Mar 2, 2015, 11:45:57 AM3/2/15

to mark.h...@gmail.com, Jed Brown, Jack Poulson, blis-d...@googlegroups.com, Elemental

On Mon, Mar 2, 2015 at 7:07 AM, Mark Hoemmen <mark.h...@gmail.com> wrote:
> On Sun, Mar 1, 2015 at 10:39 PM, Jeff Hammond <jeff.s...@gmail.com> wrote:
>> On Sun, Mar 1, 2015 at 4:05 PM, Mark Hoemmen <mark.h...@gmail.com> wrote:
>>> I would like to do this anyway, so I was thinking of teaming up with
>>> y'all. We also plan to write a Kokkos interface to the BLAS and other
>>> Kokkos-y things, but that will take a lot longer, and it would require
>>> buy-in to Kokkos of course. QD implies buy-in to C++ (it's a C++
>>> library), but I would like to do this in a way that would be useful to
>>> people who don't want to use Kokkos. Please give me advice on how to
>>> do that :-)
>>
>> Is it possible to write an extern "C" interface? I think the PETSc
>> folks are more likely to use Fortran REAL*16 than C++.
>
> I suspect so as well. I would also much rather use REAL*16 or
> __float128. (I've already gone through the tedious exercise of making
> parts of std::complex work with Kokkos.) Would you happen to know how
> widespread support for __float128 or analogous types are? If I get
> GCC, Clang, and the Intel compiler, that's already plenty good enough
> for me.

IBM XLC++ moved to Clang's front-end in December for little-endian
POWER (i.e. OpenPOWER)
[http://www.ibm.com/support/docview.wss?uid=swg27007322&aid=1,http://www-01.ibm.com/support/docview.wss?uid=swg27043628&aid=1],
but you will have to investigate if that means __float128 is supported
the same as Clang/LLVM. I can't find anything either way.

https://github.com/FFTW/fftw3/issues/21 suggests that PGI lacks
support for this feature, but again, you'll have to investigate
yourself to see what is supported in more recent versions and for
various architectures.

> I suspect that QD already has C bindings, though I'll have to check.
>
>> <less serious stuff starts here>
>
> I appreciate the tag ;-)
>
>> Since you are willing to tolerate wacko build systems (e.g. CMake) and
>> cuckoo languages (C++), why don't you just roll all of
>> Boost::multiprecision (which include GMP and MPFR back-ends) into
>> Trilinos and bring about the end of days through software nihilism?
>
> We actually used to have Tpetra hooked up with ARPREC, though I
> haven't seen anyone try using this for the nearly five years I've been
> hacking Trilinos. No one was crazy enough to hook up MPFR. All
> ARPREC instances have the same run-time size (you can set a global to
> change it). With MPFR, each number instance may have a different
> run-time size. That makes setting up MPI communication super fun!

I have investigated MPFR support in MPI a few times before and it
seems to be feasible, at least if efficiency isn't critical (i.e. a
memcpy or two of the entire type is tolerable). A more efficient
solution may be feasible using their API, but if all else fails, one
could assume the implementation and make ABI-dependent MPI hooks.

Again, why not use Boost::multiprecision instead of MPFR directly?
I've played with both and find the Boost stuff quite nice, modulo the
C++ insanity it requires :-)

> On the other hand, how many complete implementations of Fortran 2008
> are there? ;-P

Given that float128 is equivalent to REAL*16, which appears to be
widely supported, I don't see any issue with this feature of Fortran
2008. If you want the answer for ALL of Fortran 2008, you'll have to
ask someone with grey or no hair ;-)

Jeff

Jeff Hammond

unread,

Mar 2, 2015, 11:46:41 AM3/2/15

to mark.h...@gmail.com, Jed Brown, Jack Poulson, blis-d...@googlegroups.com, Elemental

The Intel compiler has some stuff that allows vectorizing across
function calls...

Jeff

Jed Brown

unread,

Mar 2, 2015, 11:58:29 AM3/2/15

to Jeff Hammond, mark.h...@gmail.com, Jack Poulson, blis-d...@googlegroups.com, Elemental

Jeff Hammond <jeff.s...@gmail.com> writes:

> The Intel compiler has some stuff that allows vectorizing across
> function calls...

When QD_INLINE is defined, qd_inline.h is included, meaning that stupid
cross-module inlining trickery is not needed (for normal operations like
ADD and MUL). The question is which compilers (if any) actually
vectorize these operations. Same question for __float128.

signature.asc

Jeff Hammond

unread,

Mar 2, 2015, 12:03:23 PM3/2/15

to Jed Brown, mark.h...@gmail.com, Jack Poulson, blis-d...@googlegroups.com, Elemental

If someone writes a simple benchmark that I can share with the
compiler team, I will be happy to drive this topic internally. I am
not willing to give them all of QD and file a ticket titled "can you
make this faster?"...

Jeff

Mark Hoemmen

unread,

Mar 2, 2015, 1:11:08 PM3/2/15

to Jeff Hammond, Jed Brown, Jack Poulson, blis-d...@googlegroups.com, Elemental

On Mon, Mar 2, 2015 at 9:45 AM, Jeff Hammond <jeff.s...@gmail.com> wrote:
> I have investigated MPFR support in MPI a few times before and it
> seems to be feasible, at least if efficiency isn't critical (i.e. a
> memcpy or two of the entire type is tolerable). A more efficient
> solution may be feasible using their API, but if all else fails, one
> could assume the implementation and make ABI-dependent MPI hooks.

Ah, but the receiving process needs to know how much data to expect in
the message, which means it needs to know the run-time precision.
That means an extra communication round ("count per instance"), and it
also would break our current communication abstractions. I've written
way too many words about this in silly unpublished tech reports ;-P

I'm just trolling you, dear ;-P

mfh

Jeff Hammond

unread,

Mar 2, 2015, 1:24:41 PM3/2/15

to mark.h...@gmail.com, Jed Brown, Jack Poulson, blis-d...@googlegroups.com, Elemental

On Mon, Mar 2, 2015 at 10:10 AM, Mark Hoemmen <mark.h...@gmail.com> wrote:
> On Mon, Mar 2, 2015 at 9:45 AM, Jeff Hammond <jeff.s...@gmail.com> wrote:
>> I have investigated MPFR support in MPI a few times before and it
>> seems to be feasible, at least if efficiency isn't critical (i.e. a
>> memcpy or two of the entire type is tolerable). A more efficient
>> solution may be feasible using their API, but if all else fails, one
>> could assume the implementation and make ABI-dependent MPI hooks.
>
> Ah, but the receiving process needs to know how much data to expect in
> the message, which means it needs to know the run-time precision.
> That means an extra communication round ("count per instance"), and it
> also would break our current communication abstractions. I've written
> way too many words about this in silly unpublished tech reports ;-P

Just require the precision to be set symmetrically.

Jack Poulson

unread,

Mar 2, 2015, 10:54:58 PM3/2/15

to d...@libelemental.org, Jed Brown, blis-d...@googlegroups.com

I've been without computer all day in a panel; thankfully we had a few
breaks for me to approve emails to d...@libelemental.org.

After doing some more reading, your argument that we can't extrapolate
from GPU experiments [1,2] showing that, despite the fact that a DD
multiply-add requires ~20 double-precision operations, that the
performance penalty is *less* than this for GEMV (often a factor of 2)
and for GEMM (often a factor of 13) since there is higher arithmetic
intensity seems to not be as pedantic as I originally assumed, even
though in 2002 it seems that the performance difference was even smaller
on CPUs [3].

It appears that one of the most up-to-date packages for double-double
gemm, MPACK [4], is slow enough to surprise researches (see the
bottom-left of pg. 6 of [2] for the 200x difference claim). The author
of MPACK seems to couch a similar claim in [4]:

"""
Therefore, GPU version is 260 times faster than the reference
implementation, and 60 times faster than naive OpenMP accelerated version.
"""

The authors are thus (like many predecessors) comparing optimized GPU
implementations to unoptimized CPU implementations and claiming victory.

I downloaded the most recent version of the library and, I kid you not,
this is the *entire* "optimized" version of MPACK's NN GEMM:

"""
#include <mblas_dd.h>
#ifdef _OPENMP
#include <omp.h>
#endif

void Rgemm_NN_omp(mpackint m, mpackint n, mpackint k, dd_real alpha,
dd_real *A, mpackint lda, dd_real *B, mpackint ldb, dd_real beta,
dd_real *C, mpackint ldc)
{
mpackint i, j, l;
dd_real temp;

//Form C := alpha*A*B + beta*C.
for (j = 0; j < n; j++) {
if (beta == 0.0) {
for (i = 0; i < m; i++) {
C[i + j * ldc] = 0.0;
}
} else if (beta != 1.0) {
for (i = 0; i < m; i++) {
C[i + j * ldc] = beta * C[i + j * ldc];
}
}
}
//main loop
#ifdef _OPENMP
#pragma omp parallel for private(i, j, l, temp)
#endif
for (j = 0; j < n; j++) {
for (l = 0; l < k; l++) {
temp = alpha * B[l + j * ldb];
for (i = 0; i < m; i++) {
C[i + j * ldc] += temp * A[i + l * lda];
}
}
}
return;
}
"""

Jack

[1] See pp-5-6 of
http://suchix.kek.jp/mpcomp/20131120-sc13/20131120BoF-takahashi.pdf

[2]
http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Programme_files/fastgemm.pdf

[3] http://crd-legacy.lbl.gov/~xiaoye/p152-s_li.pdf

[4]
http://www.computer.org/csdl/proceedings/icnc/2012/4893/00/4893a068-abs.html

signature.asc

Jeff Hammond

unread,

Mar 2, 2015, 11:00:22 PM3/2/15

to Jack Poulson, d...@libelemental.org, Jed Brown, blis-d...@googlegroups.com

Hey, at least they followed Franchetti and Peuschel in using 1D memory
instead of [][] indexing of an array of arrays :-)

I won't have time to finish my performance tests for at least a week,
but I intend to write something that is legitimately optimized for the
extended precision types of interest to me.

Best,

Jeff

> --
> You received this message because you are subscribed to the Google Groups "blis-discuss" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.

> To post to this group, send email to blis-d...@googlegroups.com.
> Visit this group at http://groups.google.com/group/blis-discuss.
> For more options, visit https://groups.google.com/d/optout.

Matthew Knepley

unread,

Mar 3, 2015, 5:18:48 AM3/3/15

to Jeff Hammond, Jack Poulson, dev@libele >> Elemental, Jed Brown, blis-d...@googlegroups.com

And they also used short variable names, essential for fast code. They messed

up on "alpha" and "beta", however which should be "a" and "b".

Matt

--

What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

Jeff Hammond

unread,

Mar 4, 2015, 1:34:34 PM3/4/15

to Devin Matthews, blis-d...@googlegroups.com

Intel 15 supports _Quad after all; one just needs the right flags,
which I didn't find documented anywhere.

Jeff

========================================================
You need to use -Qoption,cpp,--extended_float_type option...See below:

$ cat test.c
void test() {
_Quad xx;
}
$ icc test.c -c
test.c(2): error: identifier "_Quad" is undefined
_Quad xx;
^

compilation aborted for test.c (code 2)
$ icc test.c -c -Qoption,cpp,--extended_float_type
========================================================

On Sat, Feb 28, 2015 at 9:32 PM, Jeff Hammond <jeff.s...@gmail.com> wrote:
> Devin:
>
> I just tested and Intel C 15 doesn't appear to support _Quad but does
> support __float128.
>
> Best,
>
> Jeff
>

> On Fri, Feb 27, 2015 at 12:54 PM, Devin Matthews
> <devinam...@gmail.com> wrote:
>> As far as compiler support, this is what I have been able to dig up:
>>
>> GCC and maybe Intel:
>>
>> Builtin __float128 and __complex128 types (since 4.6), library support
>> through -lquadmath. For Intel, I also see references to a _Quad type, but it
>> seems like maybe this is only older versions (Jeff?)?
>>
>> XL on BG/Q:
>>
>> The "native" option seems to be -qldbl128 along with long double on PPC.
>> There is also the -qfloat=gcclongdouble option for compatibility with GCC's
>> __float128.
>>
>> Otherwise, there are multiprecision libraries such as gmp and mpfr, although
>> I imagine that would be even slower. Or, you could write the microkernel in
>> FORTRAN and use real*16 :). You could also do explicit software
>> double-double arithmetic.
>>
>> Devin
>>
>>
>> On 2/27/15, 2:00 PM, Jack Poulson wrote:
>>
>> I agree with Jeff that properly handling the memory accesses is the
>> difficult part (what I would hope BLIS could handle).
>>
>> Ideally it would be straightforward to swap out the microkernels for
>> quad-precision; an initial efficient implementation (relative to
>> modified netlib) would be a great impetus for pushing further.
>>
>> Jack
>>
>> On 02/27/2015 11:57 AM, Jeff Hammond wrote:
>>
>> Given that the biggest difference in performance between Netlib and
>> BLIS/MKL/etc. is due to data access rather than FPU utilization, I do
>> not think hardware support is required for this to be of interest for
>> BLIS.
>>
>> If BLIS were to emulate the QP computation in software and have all
>> the cache-friendly access that's independent of the precise width of a
>> floating point number, I would expect Jack' estimation of 5-10x to be
>> pretty accurate.
>>
>> Perhaps the first step is for Field, Robert, et al. to write an paper
>> about how to implement QGEMM efficiently in BLIS.
>>
>> Best,
>>
>> Jeff
>>
>> On Fri, Feb 27, 2015 at 11:49 AM, Field G. Van Zee <fi...@cs.utexas.edu>
>> wrote:
>>
>> Jack,
>>
>> I always assumed people would not be generally interested in quad precision
>> until it was widely supported in hardware (and mainstream languages and
>> compilers). Since I have given literally zero attention to the issues
>> involved, I can't even hazard a rough guess for the effort that would be
>> required.
>>
>> Hopefully someone else can chime in.
>>
>> Field
>>
>>
>> On 02/27/15 13:42, Jack Poulson wrote:
>>
>> Dear BLIS developers,
>>
>> Many of the state-of-the-art solvers for the sparse linear and least
>> squares problems create intermediate matrices whose effective condition
>> numbers are the potentially the square of the condition number of the
>> original matrix.
>>
>> Furthermore, there are many techniques where, if the solution is desired
>> to n digits of accuracy, then the floating-point precision needs to
>> support 2*n digits of accuracy. With double-precision, this implies that
>> at most eight-digits of accuracy could be achieved with such techniques
>> (sparse Interior Point Methods often satisfy this property).
>>
>> For this reason, it is of use for a library to support quad-precision
>> factorizations, matrix-vector multiplication, and triangular solves.
>> Libraries such as PETSc currently create custom modifications of the
>> netlib implementation of the BLAS, but in no way could these
>> modifications be expected to be high-performance.
>>
>> How much effort do you think would be required for BLIS to support
>> quad-precision with a cost within a factor of 5-10 of double-precision?
>> I would be willing to dedicate a significant portion of my energy to
>> helping to support quad-precision Interior Point Methods in Elemental,
>> and this would be a critical component of such an effort.
>>
>> Jack

Jed Brown

unread,

Mar 4, 2015, 2:32:11 PM3/4/15

to Jeff Hammond, Devin Matthews, blis-d...@googlegroups.com

Jeff Hammond <jeff.s...@gmail.com> writes:

> Intel 15 supports _Quad after all; one just needs the right flags,
> which I didn't find documented anywhere.
>
> Jeff
>
> ========================================================
> You need to use -Qoption,cpp,--extended_float_type option...See below:

So it's deprecated and shouldn't be used for new code?

signature.asc

Jeff Hammond

unread,

Mar 4, 2015, 2:34:34 PM3/4/15

to Jed Brown, Devin Matthews, blis-d...@googlegroups.com

I did not say that. I said that the feature isn't available with the
default flags.

Jeff

Jed Brown

unread,

Mar 4, 2015, 2:54:20 PM3/4/15

to Jeff Hammond, Devin Matthews, blis-d...@googlegroups.com

Jeff Hammond <jeff.s...@gmail.com> writes:
> I did not say that. I said that the feature isn't available with the
> default flags.

You said it's not even documented. That sounds like ninja deprecation
to me. If you disagree, please cite one document claiming indicating
that it is not deprecated and one reason why people should use _Quad
instead of __float128 in new code.

signature.asc

Jeff Hammond

unread,

Mar 4, 2015, 4:27:42 PM3/4/15

to Jed Brown, Devin Matthews, blis-d...@googlegroups.com

I didn't say it wasn't documented. I said I couldn't find it
documented. My attempts to find the documentation were extremely
limited. Intel's website isn't indexed by Google as well as GCC's,
for whatever reason.

There is no evidence that it is deprecated. I merely said that a
non-ISO C feature requires a special flag to activate. Your
extrapolation from this to ninja deprecation is at best tenuous.

In any case, I think it makes sense to target __float128. It is of
course supported by GCC and seems to be supported by Intel 13+. I am
still waiting to hear exactly what this support means. In particular,
I have asked the compiler team if it is fully ABI compatible with GCC.
I submitted this query just this morning.

jeff

Devin Matthews

unread,

Mar 4, 2015, 4:42:46 PM3/4/15

to blis-d...@googlegroups.com

As far as I can tell from mailing list and bug reports from 2012-2014,
clang/llvm does not support __float128 (I see that GCC 4.7 and Boost
1.56 add patches to explicitly work around that lack of support).

Devin Matthews

Jeff Hammond

unread,

Apr 15, 2015, 4:33:05 PM4/15/15

to Jack Poulson, Matthias, blis-d...@googlegroups.com, kne...@gmail.com, rv...@cs.utexas.edu, fi...@cs.utexas.edu

> Thus, given that users typically used tuned double-precision versions
> rather than netlib, the current practical runtime difference could
> easily be expected to be more than 100x. Having a tuned quad-precision
> BLAS and LAPACK API could significantly shrink this performance gap.

So I have investigated this a bit and observe QP is 50x DP because of
the cost of the emulated floating-point operations. There may be ways
to reduce this, but the effort is non-trivial and requires changing
the software emulation that the compiler emits.

Since they key optimization in BLAS focus on memory and SIMD, I'm not
sure how much they matter for QP. But I will nonetheless be a
scientist and perform the experiment to know for sure.

Jeff

Marat Dukhan

unread,

Apr 15, 2015, 4:41:50 PM4/15/15

to Jeff Hammond, Jack Poulson, Matthias, blis-d...@googlegroups.com, Matthew Knepley, Robert van de Geijn, Field Van Zee

I have some plots too!

Here is the latency of quad (libquadmath) vs double-double vs double on Broadwell. Note that unlike quad-precision, double-double computations can benefit from SIMD and ILP.

Regards,

Marat

add-latency.pdf

mul-latency.pdf

Marat Dukhan

unread,

Apr 15, 2015, 4:46:43 PM4/15/15

to Jeff Hammond, Jack Poulson, Matthias, blis-d...@googlegroups.com, Matthew Knepley, Robert van de Geijn, Field Van Zee

And some additional data on double-double performance on different microarchitectures.

Regards,

Marat

ddadd-latency.pdf

ddadd-throughput.pdf

ddmul-latency.pdf

ddmul-throughput.pdf

Jeff Hammond

unread,

Apr 15, 2015, 4:53:57 PM4/15/15

to Marat Dukhan, Jack Poulson, Matthias, blis-d...@googlegroups.com, Matthew Knepley, Robert van de Geijn, Field Van Zee

Are your tests online somewhere?

I am interesting in testing on other platforms.

Jeff

Sent from my iPhone

<ddadd-latency.pdf>

<ddadd-throughput.pdf>

<ddmul-latency.pdf>

<ddmul-throughput.pdf>

Marat Dukhan

unread,

Apr 15, 2015, 4:55:54 PM4/15/15

to Jeff Hammond, Jack Poulson, Matthias, blis-d...@googlegroups.com, Matthew Knepley, Robert van de Geijn, Field Van Zee

Yes, albeit not in public repo. PM me your BitBucket username and I'll grant you access.