Computational cost of creating a libflame object.

Erling D. Andersen

unread,

Nov 7, 2018, 10:13:10 AM11/7/18

to libflame-discuss

Hi

I am using potrf and friends from MKL within a sparse Cholesky. So I would do a tons of smallish potrf, trsm and the like.

Now it seems if I want to use libflame instead of Lapack, then for each potrf I have to create a temporary libflame object and delete it after the potrf is run.

Let us assume the size of the matrix is n. Then is the cost of creating the libflame object is O(1). Right?

Erling

Field G. Van Zee

unread,

Nov 7, 2018, 7:47:00 PM11/7/18

to libflame...@googlegroups.com

On 11/7/18 9:13 AM, Erling D. Andersen wrote:
> Hi
>
> I am using potrf and friends from MKL within a sparse Cholesky. So I would do a tons of smallish potrf, trsm and the like.

Hi Erling,

How small is smallish? Please be advised that libflame will not give
good performance for very small problems. (I consider "small" to be
anything less than 100, though others may draw the line in a slightly
different place.)

>
> Now it seems if I want to use libflame instead of Lapack, then for each potrf I have to create a temporary libflame object and delete it after the potrf is run.

I'm not sure I follow. What do you mean by "I have to create a temporary
libflame object"?

libflame provides the ?potrf_() APIs, so you don't have to do anything
beyond what you would normally do for LAPACK.

>
> Let us assume the size of the matrix is n. Then is the cost of creating the libflame object is O(1). Right?

It sounds like you're concerned about the behind-the-scenes management
of objects/metadata for the purposes of using libflame's internal
Cholesky implementation.

If so, then the answer is "yes". Constant order cost.

Field

Erling D. Andersen

unread,

Nov 9, 2018, 3:41:33 AM11/9/18

to libflame-discuss

Thanks. It answers my questions.

Erling D. Andersen

unread,

Nov 9, 2018, 4:14:11 AM11/9/18

to libflame-discuss

Background info no reply expected.

http://www.numerical.rl.ac.uk/people/j_scott/publications/2010/HoggReidScot2010_sisc.pdf

give a good description of how to build a sparse Cholesky.

I had fooled myself into thinking potrf is important in this case. It is not. It is of course gemm, syrk and trsm that is important.

However, note you tend work on matrices of of a size that is a small multiple of 64 e.g. 256. Hence if you do A*B then A would have 256 columns and an arbitrary number of rows. Since we working with dense data inside a sparse matrix then number of columns of is quite small e.g. 8.

Hence, overhead is important. For instance I found I should inline axpy and dot and not use the MKL ones because the overhead kills any advantage in my case.

Erling

Field G. Van Zee

unread,

Nov 9, 2018, 12:40:01 PM11/9/18

to libflame...@googlegroups.com

On 11/9/18 03:14, Erling D. Andersen wrote:
> Background info no reply expected.
>
> http://www.numerical.rl.ac.uk/people/j_scott/publications/2010/HoggReidScot2010_sisc.pdf
>
> give a good description of how to build a sparse Cholesky.
>
> I had fooled myself into thinking potrf is important in this case. It is not. It is of course gemm, syrk and trsm that is important.
>
> However, note you tend work on matrices of of a size that is a small multiple of 64 e.g. 256. Hence if you do A*B then A would have 256 columns and an arbitrary number of rows. Since we working with dense data inside a sparse matrix then number of columns of is quite small e.g. 8.

It sounds like you are saying that your (relatively) small matrix
multiplications use dimensions m = n = 256 with k = 8. Indeed, that
value of k is too small to achieve high performance with most
implementations of gemm/herk/syrk, since those implementations target
larger rank-k updates.

There are projects that are investigating how to get as much performance
as possible (even if short of "high") when one dimension is very small,
such k = 8 in your case. So there is hope, but such logic is not yet
integrated into BLIS, which is our dense linear algebra library for
BLAS-like computations. BLIS provides BLAS APIs for backwards
compatibility, but we encourage users to check out our two other APIs
(one BLAS-like and one object-based), each of which unlocks *much* more
functionality than is available through the BLAS interfaces. BLIS
performance is very competitive on many modern architectures for single
and multithreaded execution, often outperforming OpenBLAS.

I encourage you to check out our project if you haven't already [1].

[1] https://github.com/flame/blis

> Hence, overhead is important. For instance I found I should inline axpy and dot and not use the MKL ones because the overhead kills any advantage in my case.

Intel has vast resources which it can invest in projects such as MKL, so
if MKL does not perform well at your problem dimensions, it suggests to
me that the problem is indeed quite difficult.

In any case, thanks for your feedback. Please keep in touch with us.

Field

Reply all

Reply to author

Forward