Implementing a fast matrix “sandwich” product

97 views
Skip to first unread message

Jerry Mao

unread,
Aug 12, 2021, 2:27:28 AM8/12/21
to blis-devel
Hello!

I am an intern working with QuantCo on soon-to-be-open-source software for tabular matrices. One of the key pieces of functionality we are working on is a matrix “sandwich” product, and we are hoping to use BLIS to optimize our implementation.

More precisely, we want to compute, for an NxM matrix A and an N-dimensional vector x, the value of A^T * diag(x) * A, for large N. It is similar to computing A^T A as already supported by BLIS, but with weights from x.

We think implementing a fast sandwich product could be a cool demonstration of BLIS capabilities, and we hope it is an extension that the open-source community will also find useful.

I have been reading the BLIS documentation and am still trying to find my way around the codebase. My understanding is that BLIS decomposes multiplication tasks into blocks that can be solved by the micro-kernel.

I was wondering if you might have any advice about where in the code I should get started? And also, if you might have any suggestions on how to approach implementing this?

Thanks!

- Jerry Mao

Field G. Van Zee

unread,
Aug 12, 2021, 2:48:53 PM8/12/21
to blis-...@googlegroups.com
Hi Jerry,

Some of the people in our community have tossed around the idea of
implementing this sandwich operation (or one very similar to it). There
may even be a prototype of it somewhere out there.

I think I could implement this operation fairly quickly. Maybe a week,
or less. But I'm currently working on a much longer-term project and
haven't been able to indulge in such tasks lately.

I expect others who have looked at the operation more closely will chime in.

Field
> --
> You received this message because you are subscribed to the Google
> Groups "blis-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to blis-devel+...@googlegroups.com
> <mailto:blis-devel+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/blis-devel/f3cbe87e-f1ee-4248-849d-f69464792a19n%40googlegroups.com
> <https://groups.google.com/d/msgid/blis-devel/f3cbe87e-f1ee-4248-849d-f69464792a19n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Jeff Diamond

unread,
Aug 12, 2021, 2:53:40 PM8/12/21
to blis-...@googlegroups.com
Classic operator fusion, the #1 most requested blis feature for
post_BLAS functionality. :)
Oh, if only we could just get more people working on the BLIS APIs -
manpower is so limited.

Matthews, Devin

unread,
Aug 12, 2021, 3:01:54 PM8/12/21
to blis-...@googlegroups.com

Dear Jerry,

 

I have implemented similar operations C = A*D*B, C = D*A*B, etc. (with D a diagonal matrix) in TBLIS (https://github.com/devinamatthews/tblis). We're currently working on moving some of that infrastructure into BLIS so that we can easily implement the same types of operations.

 

(The technical answer: the way to implement this is to incorporate scaling by the weights x into the "packing kernel". This is an operation which copies blocks of A into a special optimized format. Adding the FLOPs to do the weighting there is essentially free, but requires modifying the packing kernel which is not currently easy to do.)

 

Thanks,

Devin Matthews

 

From: "blis-...@googlegroups.com" <blis-...@googlegroups.com> on behalf of "Field G. Van Zee" <fi...@cs.utexas.edu>
Date: Thursday, August 12, 2021 at 1:49 PM
To: "blis-...@googlegroups.com" <blis-...@googlegroups.com>
Subject: Re: [blis-devel] Implementing a fast matrix “sandwich” product

 

[EXTERNAL SENDER]

To unsubscribe from this group and stop receiving emails from it, send an email to blis-devel+...@googlegroups.com.

 

Jeff Hammond

unread,
Aug 12, 2021, 3:06:34 PM8/12/21
to Field G. Van Zee, blis-discuss
This is funny, because I've been calling the "tengy" kernel of NWChem [1] a matrix club sandwich for years, and it does something different.  But I defer to the mathematicians who name things.  In my case, BLIS is unlikely to be useful, because there is no matrix product happening.


Jeff

To unsubscribe from this group and stop receiving emails from it, send an email to blis-devel+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/71ac5cf1-6158-2bbc-8da0-0b47b77d5dfb%40cs.utexas.edu.


--

Jerry Mao

unread,
Aug 13, 2021, 8:42:17 AM8/13/21
to blis-devel
Thanks so much for the information and advice! Thanks also to Devin for the link to TBLIS, I was able to give it a try today and will experiment with it further.

We've been thinking about how to incorporate the sandwich product into BLIS, and we were also drawn to the idea of scaling during the packing step. To achieve this, one way we are considering is to somehow allow the user to inject a function pointer, which modifies the way that packing behaves. We feel like this would be better than, say, passing the vector as an additional parameter and potentially causing a breaking change. Does this seem like the correct approach for adding the functionality to BLIS?

Thanks,
- Jerry

Robert van de Geijn 2

unread,
Aug 13, 2021, 10:06:16 AM8/13/21
to blis-...@googlegroups.com

You may want to look at https://github.com/ChenhanYu/hmlp for inspiration.

Jeff Diamond

unread,
Aug 13, 2021, 10:58:10 AM8/13/21
to blis-...@googlegroups.com
Hi Jerry. 

We haven't yet had to overload packing routines.  The best options I see would be (1) use a sandbox implementation, where you control the call to the packing buffers, or (2), in theory, you can modify which function pointers in the array that's used to call packing.  This might be possible during the configuration call, but I'm not sure what order the packing functions are initialized.  You'd need to restore them when you're finished.  Since this would remain the same for an entire high level call, it seems like it'd be reasonably thread safe to do.

I'm glad you brought this up, because we've been thinking about an official blis operator extension API, and packing should definitely be part of that.  One idea that's been brought up by our user base is similar to what you suggest - have one API that just lets the user directly pass in the functions he wants blis to use for the given call.

- Jeff



On 8/13/21 7:42 AM, Jerry Mao wrote:

Jerry Mao

unread,
Aug 27, 2021, 8:19:52 AM8/27/21
to blis-devel
Hello,

Thanks so much for the help and advice! Over the last week Field has helped me with implementing the sandwich product in the BLIS sandbox, and I've also added support for parallelism in the 4th-loop to improve performance in our use case. The implementation works by applying the FLOPs during the packing step and I've just opened a PR here; it's been very exciting to learn about the BLIS framework and to be able to build a feature upon it!

Thanks again for the help!

Best,
- Jerry

Matthews, Devin

unread,
Aug 27, 2021, 7:56:21 PM8/27/21
to blis-devel

Hi Jerry,

 

I'm glad that you've had success with the "sandbox" approach and Field's help. I mentioned that we're working towards an interface to make it easier to implement extended operations like this using the "standard" BLIS framework (i.e. not a sandbox). I put together a mock-up of what your C += A*D*A^T operation might look like, see the attached source file. This example won't compile and run currently but should in the near future. We would love to have your input on the mock-up and especially any areas that still seem a bit rough or confusing. It's only 165 lines with lots of comments so hopefully this represents a simpler and more streamlined approach to your and others' problems.

 

Devin

 

From: "blis-...@googlegroups.com" <blis-...@googlegroups.com> on behalf of Jerry Mao <jerr...@quantco.com>
Date: Friday, August 27, 2021 at 7:19 AM
To: "blis-...@googlegroups.com" <blis-...@googlegroups.com>
Subject: Re: [blis-devel] Implementing a fast matrix “sandwich” product

 

[EXTERNAL SENDER]

Hello,

syrk_diagonal_example.c
Reply all
Reply to author
Forward
0 new messages