Some comments about the "C++ API for BLAS and LAPACK" proposal

183 views
Skip to first unread message

Vincent Picaud

unread,
Aug 23, 2017, 9:57:22 AM8/23/17
to SLATE User
First of all, thank you for the work and the initiative. Lack of
standard C++ API for these two libraries is a real concern.

A/ Some comments

I do agree with the report and the chosen approach. In peculiar I
think one must stick to basic function template to define the API and
do not introduce too much C++ "Meta-Programming" at this stage.

1/ Blaze library

I regret that the Blaze library (https://bitbucket.org/blaze-lib/blaze)
is not cited in "2.3/ Programming Language C++". IMHO Blaze
performance and code quality would worth a citation. The Expression
Templates approach is also well detailed in two companion publications
(cited at the end of the html page).

2/ On non-const reference of a temporary

Page 16 of the report I read "Because C++ can’t take a non-const
reference of a temporary, the output submatrix of each call must be a
local variable". I think this is misleading. Maybe this was true when
the Elemental lib was created, but this is no longer the case. Now
with C++11 rvalue reference
(http://en.cppreference.com/w/cpp/language/reference)
you can write code like:

class Matrix {
  public:

      Matrix view();
};

El::Herk (..., Matrix&& matrix) { ... }

//-------------

Matrix m;
El::Herk(...,m.view())

without having to use a local variable to store m.view(), as you wrote
in the report.

3/ Error management

Concerning the design, IMHO I think that there is still room for improvement.

First I think there is a benefit of splitting "errors" into two categories.

See https://isocpp.org/wiki/faq/exceptions#how-exceptions and https://isocpp.org/wiki/faq/exceptions#why-not-exceptions for instance

3.1/ Error related to caller direct responsibility:

For instance calling a routine with a negative size. I think this kind of errors must result in pre-condition/assertion failure. The caller has to fix its code!

3.2/ Run-time numerical error/integer overflow error...

In that cases, exceptions make more sense. For instance If a matrix is too ill-conditioned, throw an exception.

However I think the proposed implementation (p 33) can be problematic to process on the caller side.

Take for instance:
throw_if_(n,std::numeric_limits < blas_int >::max ());

And imagine a user calling Lapack using your API:

try {
  lapack_subroutine(...)
}
catch(const Error& e) {

// Now how to identify the problem?

if(e.what()=="std::numeric_limits < blas_int >::max()") {

 // ...


}

}

I think this is problematic at least two reasons:

a/ The string "std::numeric_limits < blas_int >::max()" can change.
For instance if you use automatic code formatting tool, maybe the
returned string will be "std::numeric_limits<blas_int>::max()"

b/ If you have error like a matrix element is NaN or a "pivot is too
small" this would be nice to be able to retrieve which row/column is
concerned. With the current implementation this extra information is
lost. One need a mechanism to store this information in the returned
exception.  (see
http://www.boost.org/doc/libs/1_64_0/libs/exception/doc/boost-exception.html
for instance).

B/ Fix minor problems of the current Blas/Lapack libs?

The main reason of my post is that, as a Blas/Lapack C++ user, I am
always annoyed with some problems I will try to give some examples below.  Some
of them are minor and maybe this effort is an opportunity to fix them.

1/ About NoTrans, Trans, TrancConj... (p 27)

1.1/ NoTrans, Trans, TrancConj are not closed for composition operator

"Conjugation" is missing and this is a barrier to develop general code
involving operator composition. For instance, I want to be able to
write:

Conj = Trans( TransConj )

Also BLAS does not define operations like:

y=a.conjugate(M).x+b.y

1.2/ Better names, for instance "NoTrans"="Identity"?

IMHO NoTrans, as you mention it, is not a good name. Why not simply use "Identity" and take the opportunity now to define:

"Op" with Identity, Transpose, TransConjugate, Conjugate readable names?

IMHO this would improve code readability.

In summary maybe define:

enum class Op : char { Identity = 'N', Transpose = 'T', TransConjugate = 'C', Conjugate = 'char_to_define_' };

For "Conjugate" operation which is not currently supported you
can throw an "not_implemented_yet" exception.

2/ There are some inconsistencies in the Blas API

2.1/ For instance, consider
- TRMV computes x <- op(A).x
- TRMM computes X <- alpha.op(A).X

We have a presence/absence of the "alpha" scalar if X is a vector or a matrix. Why not simply define the same:

x <- alpha.op(A).X

pattern for both cases?

2.2/ I also would love to have a general subroutine of the form:

C <- alpha.Op(A).B + beta.C

It exists for A=generic, symmetric/hermitian matrix but not for triangular ones.

However when A is triangular, we are not always interested by B in-place matrix product (the TRMM subroutines)

This lack of API uniformity is still a problem when we try to write generic code.

3/ Vocabulary: n and m

In Europe n is the number of rows and m the number of columns. I think in USA this is the reverse.

The US convention (m = # row, n # col) is certainly more logical, for instance in M_{i,j} of a m x n matrix, the letters pair (i,j) & (m,n) are following the same alphabetical order.

 However this is a little bit disturbing for an European reader. Maybe using I_size and J_size for row & column size removes this problem and every one is happy.

4/ Concerning Lapack/Lapacke

I was disappointed by Lapacke. It pretends to also manage RowMajor
format, but under the hood, it simply copies the matrix into the
ColMajor format (as you wrote in your report). This has a performance
impact and this is not really fair.

Personally when I code something using "Lapack", I stick to the
ColMajor mode for this reason, even if I use the Lapack_e_ interface.

Do you think that with time we will have a real RowMajor mode support for Lapack (without
temporary copies)? Is it one of your goal?


C/ Evolution of Blas/Lapack? Blis/Flame like approaches.

As a user, I really like the approach taken by Blis
(https://github.com/flame/blis). It supports "Conjugate" operator (my
1.1/ point) and general "stride" in both directions.

In this context a matrix is defined with two "strides" (not only with one "leading dimension"):

double *data,
int I_dim, int J_dim,
int I_stride, int I_stride

With that, you have generic in-place transposition and generic 2D
views. I think this is an incredible advantage in some applications
involving general tensors or multidimensional datasets.

I understand that this is out of scope and not supported by
Blas/Lapack.

Do you think you will go in this direction?

If so, the RowMaj and ColMajor enum is not relevant anymore:

ColMajor <=>  (I_stride==1)
RowMajor <=> (J_stride==1)

Maybe it is time to define a generic interface:

void gemv(
          const Op transA,
          const int64_t A_I_size, const int64_t A_J_size,
          const T alpha,
          const T const* A, const int64 A_I_stride, const int64 A_J_stride,
          const T const* x, const int64 x_stride,
          const T beta,
          T const* y, const int64 y_stride)


and return a "not_implemented_yet" exception for general strides (id (col_stride!=1)&&(row_stride!=1))
     
Again thank you for the work and for encouraging such open discussions.
Good luck & success for this project!
Best,
Vincent Picaud













Jakub Kurzak

unread,
Aug 23, 2017, 10:54:24 AM8/23/17
to Vincent Picaud, SLATE User
Vincent,
This is a pile of good comments.
Allow for some time for us to process your input.
Looks like Blaze is a good project to mention in the report.
I just found out we missed LATL (https://github.com/langou/latl/).
We will publish a revision at some point.
Jakub


--
You received this message because you are subscribed to the Google Groups "SLATE User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slate-user+...@icl.utk.edu.
To post to this group, send email to slate...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/slate-user/b8609535-4c99-43f0-9753-0068c8a01359%40icl.utk.edu.

Vincent Picaud

unread,
Aug 23, 2017, 12:38:16 PM8/23/17
to SLATE User, picaud....@gmail.com
Jakub,

Thank you for the positive feedback.
I am happy if I can bring some interesting comments.

I did not know LATL, I will have a look.

Best,
Vincent

sl...@christoph-conrads.name

unread,
Aug 29, 2017, 9:33:33 PM8/29/17
to SLATE User
> 4/ Concerning Lapack/Lapacke
>
> I was disappointed by Lapacke. It pretends to also manage RowMajor
> format, but under the hood, it simply copies the matrix into the
> ColMajor format (as you wrote in your report). This has a performance
> impact and this is not really fair.
>
> Personally when I code something using "Lapack", I stick to the
> ColMajor mode for this reason, even if I use the Lapack_e_ interface.
>
> Do you think that with time we will have a real RowMajor mode support for
> Lapack (without temporary copies)? Is it one of your goal?

BLAS and LAPACK are low-level libraries for linear algebra algorithms and I do
not see the need to support row-major matrix representations because from a
linear algebra perspective, one is simply dealing with a transposed matrix and
matrix computations involving the matrix transpose are supported by BLAS and
LAPACK.

There are many easy-to-use interfaces to the functionality provided by BLAS and
LAPACK, e.g., NumPy, Scipy, R, or Matlab. Hence if one starts to use BLAS or
LAPACK directly, then I would assume that speed and/or control over memory
consumption are a priority. LAPACK*E* gives you neither. If the BLAS and LAPACK
C++ interface wants to provide the convenience of handling matrices in
row-major format automatically, then the interface would have to examine and
modify the object dimensions and transposition flags for all functions for
optimal performance. This is a major implementation effort (think of the flags
needed to compute the SVD or eigenvalues of non-symmetric matrices) for an
interface that is used mostly by people familiar with linear algebra.

I am against the C++ interface handling matrices in row-major format. The user
should deal with it.

Jakub Kurzak

unread,
Aug 29, 2017, 10:15:14 PM8/29/17
to Christoph Conrads, SLATE User
I am for supporting both column-major and row-major in BLAS++, because it's quick and easy and has already been done in CBLAS.
I am not a fan of supporting it at all in LAPACK++, because there is no quick and easy way to do it.
This is meant as an API for LAPACK, which does not support row major, therefore no row major in LAPACK++.
I would offer the user a transposition function, which can be called explicitly.
After some talks with other ICL'ers, this is also my preferred way of checking for NaNs (credits to Piotr Luszczek).
I think convenience mechanisms that have a serious performance penalty, simply don't belong here.
Jakub


--
You received this message because you are subscribed to the Google Groups "SLATE User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slate-user+...@icl.utk.edu.
To post to this group, send email to slate...@icl.utk.edu.

Vincent Picaud

unread,
Aug 30, 2017, 2:40:07 AM8/30/17
to SLATE User, sl...@christoph-conrads.name
I am OK with the previous mails, Christoph and Jakub.

To be precise my opinion is:

  • Lapack -> only column major support : 100% ok with that
  • Extra functions for NaN check/Transpose: 100% ok with that

  • I still have some hesitations concerning row major support for blas.
-> This functionality nearly comes for free and the job is already done in cblas

honestly I have some hesitations...

If I have to give an anwer: I think Christoph is right.

Those who use blas/lapack want performance; Most of the high level wrappers I know (Julia, R...) use col major format.

-> Hence it is maybe useless to invest coding effort in row major support.

Vincent

Mark Gates

unread,
Aug 30, 2017, 7:08:52 AM8/30/17
to sl...@christoph-conrads.name, SLATE User
On Aug 29, 2017, at 9:33 PM, sl...@christoph-conrads.name wrote:

> 4/ Concerning Lapack/Lapacke
>
> I was disappointed by Lapacke. It pretends to also manage RowMajor
> format, but under the hood, it simply copies the matrix into the
> ColMajor format (as you wrote in your report). This has a performance
> impact and this is not really fair.
>
> Personally when I code something using "Lapack", I stick to the
> ColMajor mode for this reason, even if I use the Lapack_e_ interface.
>
> Do you think that with time we will have a real RowMajor mode support for
> Lapack (without temporary copies)? Is it one of your goal?

BLAS and LAPACK are low-level libraries for linear algebra algorithms and I do
not see the need to support row-major matrix representations because from a
linear algebra perspective, one is simply dealing with a transposed matrix and
matrix computations involving the matrix transpose are supported by BLAS and
LAPACK.

For BLAS, you can see the initial implementation at


which is also the blaspp folder in bitbucket.org/icl/slate. The effort isn’t large, and can be copied from CBLAS, but the logic is a bit less obvious than simply flipping the transpose flag. For instance:

- gemm flips both transpose flags, swaps m <=> n, A <=> B, and lda <=> ldb. It's a bit confusing for an end user to figure out that doing C = A*B becomes C^T = B^T * A^T.

- symm and hemm flip side and uplo, swap m <=> n.

- syrk, syr2k, herk, and her2k flip uplo and trans.

- gemv flips trans and swaps m <=> n. In complex for ConjTrans, because there is no Conj option in BLAS, only ConjTrans, it conjugates alpha, beta, x, y_in, and y_out, at O(m) + O(n) overhead for an O(mn) function when doing row-major.

- geru swaps m <=> n, x <=> y.

- ger swaps m <=> n, x <=> y. In complex, it conjugates y_in and calls geru, at O(n) overhead for an O(mn) function when doing row-major.

- symv and hemv flip uplo. In complex, hemv also conjugates things same as gemv.

- syr and her flips uplo. In complex, her also conjugates x, at O(n) overhead for an O(n^2) function when doing row-major.

Checking lda, ldb, etc. also changes.

In short, the effort for BLAS isn’t large, but involves different transformations for different functions. For Level 3 BLAS, there is no performance overhead. For some Level 2 BLAS functions, there is a low-order O(n) overhead to correctly deal with Conj in complex.

There are quite a few libraries that use row-major: Numpy by default, Boost::uBLAS by default, MTL by default, Eigen by default is col-major but also supports row-major. I’m sure other libraries and applications do. Hence, it seems like a valuable feature to figure out all the details once, instead of forcing each user or library to re-implement them, potentially poorly by physically transposing matrices.


There are many easy-to-use interfaces to the functionality provided by BLAS and
LAPACK, e.g., NumPy, Scipy, R, or Matlab. Hence if one starts to use BLAS or
LAPACK directly, then I would assume that speed and/or control over memory
consumption are a priority. LAPACK*E* gives you neither. If the BLAS and LAPACK
C++ interface wants to provide the convenience of handling matrices in
row-major format automatically, then the interface would have to examine and
modify the object dimensions and transposition flags for all functions for
optimal performance. This is a major implementation effort (think of the flags
needed to compute the SVD or eigenvalues of non-symmetric matrices) for an
interface that is used mostly by people familiar with linear algebra.

I am against the C++ interface handling matrices in row-major format. The user
should deal with it.

I agree regarding LAPACKE. We always use col-major and the LAPACKE _work interfaces to avoid unnecessary overheads.

For LAPACK, supporting row-major is much more complicated due to the sheer size of the library. For some functions (e.g., LU), in a wrapper there really is no better option than physically transposing the matrix in memory. Hence, we are not proposing to support row-major.

(Although actually the SVD is probably relatively easy since if A = U Sigma V^H then A^H = V Sigma U^H. Just swap dimensions and U, V. But I haven’t implemented that to verify. Many real symmetric / complex Hermitian routines are also probably easy by just flipping uplo.)

  -mark

Jakub Kurzak

unread,
Aug 30, 2017, 8:43:49 AM8/30/17
to Mark Gates, Christoph Conrads, SLATE User
I think that figuring it out across all of LAPACK is a monumental task.
We're not signing up for that.
We're actually planning to tackle the task mostly through autogeneration, like it was done in the past for the C API.
Consider that Intel took a pass and just stuck a transposition in there.
Jakub


--
You received this message because you are subscribed to the Google Groups "SLATE User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slate-user+unsubscribe@icl.utk.edu.

To post to this group, send email to slate...@icl.utk.edu.

Vincent Picaud

unread,
Aug 30, 2017, 9:48:59 AM8/30/17
to SLATE User, sl...@christoph-conrads.name

For BLAS, you can see the initial implementation at


Thank! I will have a look
 

which is also the blaspp folder in bitbucket.org/icl/slate. The effort isn’t large, and can be copied from CBLAS, but the logic is a bit less obvious than simply flipping the transpose flag. For instance:

- gemm flips both transpose flags, swaps m <=> n, A <=> B, and lda <=> ldb. It's a bit confusing for an end user to figure out that doing C = A*B becomes C^T = B^T * A^T.

- symm and hemm flip side and uplo, swap m <=> n.

- syrk, syr2k, herk, and her2k flip uplo and trans.

- gemv flips trans and swaps m <=> n. In complex for ConjTrans, because there is no Conj option in BLAS, only ConjTrans, it conjugates alpha, beta, x, y_in, and y_out, at O(m) + O(n) overhead for an O(mn) function when doing row-major.

I still think this would be great to have such "Conjugate" flag.


- geru swaps m <=> n, x <=> y.

- ger swaps m <=> n, x <=> y. In complex, it conjugates y_in and calls geru, at O(n) overhead for an O(mn) function when doing row-major.

- symv and hemv flip uplo. In complex, hemv also conjugates things same as gemv.

- syr and her flips uplo. In complex, her also conjugates x, at O(n) overhead for an O(n^2) function when doing row-major.

Checking lda, ldb, etc. also changes.

In short, the effort for BLAS isn’t large, but involves different transformations for different functions. For Level 3 BLAS, there is no performance overhead. For some Level 2 BLAS functions, there is a low-order O(n) overhead to correctly deal with Conj in complex.

Great!

Best,
Vincent
 
Reply all
Reply to author
Forward
0 new messages