BLIS Batch API ?

Minh Quan HO

unread,

Aug 8, 2021, 2:15:47 PM8/8/21

to blis-devel

Hi all,

I'm just wondering if batched API is something already invoked or discussed before ? (like ones of MKL). If not, I'd be curious to look at it for BLIS :)

References:

https://software.intel.com/content/www/us/en/develop/articles/introducing-batch-gemm-operations.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+ISNMain+%28Intel+Developer+Zone+Articles+Feed%29

https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-fortran/top/blas-and-sparse-blas-routines/blas-like-extensions/gemm-batch.html

Quan

Jeff Hammond

unread,

Aug 8, 2021, 3:43:50 PM8/8/21

to Minh Quan HO, blis-devel

Batched BLAS on CPUs does not have as much value as you might think. Smart folks in Intel Labs will tell you that a simply written (threaded) loop over GEMM operations will do as well as any batched API, assuming one uses a good implementation of small-matrix BLAS like https://github.com/hfp/libxsmm. There is some benefit to nontemporal stores on x86.

ARM CPUs should be simpler since they have less issue driving memory and do not rely on wide SIMD (A64fx being an obvious exception here) to saturate compute.

There are probably some limits where a true batched API helps, but I'm not aware of any real world use cases for CPUs.

Jeff

--
You received this message because you are subscribed to the Google Groups "blis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-devel+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/41c08e15-7d7c-4773-abc3-21413c4f6077n%40googlegroups.com.

--

Jeff Hammond
jeff.s...@gmail.com
http://jeffhammond.github.io/

Minh Quan HO

unread,

Aug 8, 2021, 4:26:32 PM8/8/21

to Jeff Hammond, blis-devel

Indeed I'm asking not for CPU but for Accelerators (DSP, GPGPU etc.) on which Host-Device communication is expensive (I'm working on an OpenCL-based offloading BLIS).

And not all vendors may be able to produce an optimized small-matrix library like Intel did. If BLIS can reuse their micro-kernels to map on batched small matrices (via some magic packing routines), bingo.

Jeff Diamond

unread,

Aug 8, 2021, 4:37:20 PM8/8/21

to blis-...@googlegroups.com

Hi Minh. Presumably, then, you'd also be asking for BLIS kernels not written in assembly language?

To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/CAD_e%2BgcHN%2BaACoWe81b9meQ%2BpPjKmES414d0-OKXu4Gi9bxf3A%40mail.gmail.com.

Minh Quan HO

unread,

Aug 8, 2021, 4:50:56 PM8/8/21

to Jeff Diamond, blis-devel

Hi Jeff,
Not sure to understand correctly your question. But my presumption is
that any vendor should at least optimize BLIS's ukernels in assembly
to get the best performance on big matrices. But writing another
library for small-matrix could be something less affordable then,
typically for small companies with less people and new architecture.
So how can BLIS evolve to help them ?

> To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/8ce9407f-9900-68ee-0ca1-1ca81a4a32eb%40fastmail.com.

Jeff Diamond

unread,

Aug 8, 2021, 5:33:14 PM8/8/21

to Minh Quan HO, blis-devel

Hi Minh. When you mentioned offloading kernels to the GPU using OpenCL,
and I didn't know if OpenCL supported custom assembly language, and I
didn't know if GPUs accepted assembly language. (I guess technically
AMD ones would, I just don't know the details, and I suppose you might
imagine PTX like assembly, though I'm super skeptical, because if
there's one thing compilers don't do, it's register blocking.

The small matrix stuff would take more of an investment, given more
kernels. I personally think it's too much for a single small company to
handle, but the hope is that through the network of users, well known
ISAs could be supported.

- Jeff

Jeff Diamond

unread,

Aug 8, 2021, 5:41:20 PM8/8/21

to blis-...@googlegroups.com

One thing worth cautioning you about is that most of the magic of blis
isn't the kernels - it's the framework - meaning the loops, the
parallelism, the traversal patterns through the data - the Goto
Algorithm. And the framework is optimized around the on-chip memory
system of a CPU. It's likely that to get good performance out of a GPU,
you'd use a different "framework" to deal with far higher latencies and
far lower amounts of cache per thread - the same theoretical framework,
but very different traversal. On an FPGA, things might look different
yet again.

It's all about the notion that matrix multiply is a memory movement
problem, not a compute problem.

On 8/8/21 4:33 PM, Jeff Diamond wrote:
> Hi Minh. When you mentioned offloading kernels to the GPU using

> OpenCL, I didn't know if OpenCL supported embedding custom assembly
> language, and I didn't know if GPUs directly accepted assembly
> language. (I guess technically AMD ones probably would, I just don't

Minh Quan HO

unread,

Aug 8, 2021, 6:13:43 PM8/8/21

to Jeff Diamond, blis-devel

Maybe I'm falling out-of-topic, but I'll try to explain my context: I
managed to turn BLIS into such a "framework" for Host+Accelerator: a
BLIS-Host library and a BLIS-Device runtime/library.

The POC is done on my company accelerator (Kalray MPPA - somehow
similar to TI C6x), with OpenCL + extensions. The BLIS-Host
(x86/aarch64) discusses with the BLIS-Device (through an OpenCL
backend in BLIS) to send/receive data and commands. The result is a
BLIS-Host library doing implicit offloading onto Device (like NVIDIA
cuBLAS).

I then have quite good efficiency on big matrices, but not on small,
due to obvious sequential overheads (Amdahl's law). That's why I'm
looking at batched API.

> To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/90f69810-95c1-6b2f-007c-3f5d2365294d%40fastmail.com.

Minh Quan HO

unread,

Aug 8, 2021, 6:27:31 PM8/8/21

to Jeff Diamond, blis-devel

EDIT: by "implicit offloading" I mean a drop-in BLAS library to turn
any CPU-only BLAS into offloading (input matrices in Host memory).
cuBLAS API seems to require on-device pointers.

Jeff Diamond

unread,

Aug 8, 2021, 6:29:25 PM8/8/21

to blis-...@googlegroups.com

And of course, if your matrices are so small they fit entirely into an
L1 cache or an FPGA SRAM buffer, then you don't need a framework at all
- you just need enough bandwidth. :)

Jeff Diamond

unread,

Aug 8, 2021, 6:34:14 PM8/8/21

to Minh Quan HO, blis-devel

Thanks - context always helps. OK, if I understand, you're using the
blis framework to traverse the problem on the CPU and then send chunks
of work to the GPU? Very interesting - what sized chunks of work are you
dispatching to the GPU from BLIS?

Minh Quan HO

unread,

Aug 8, 2021, 6:47:20 PM8/8/21

to Jeff Diamond, blis-devel

For the moment: chunks of MxNC (the second chunk is sent
asynchronously while the first chunk is computed) i.e. the 5th loop is
offloaded. And it is the BLIS-Device which rolls out the whole Goto's
algorithm on each chunk. The BLIS-Host is somehow an empty shell,
sending data and waiting for result.

The traversal scheme on Host is still very simple today, and if M/N is
small, only some cores of Device are used for computation, others will
idle.

Jeff Diamond

unread,

Aug 8, 2021, 7:30:18 PM8/8/21

to Minh Quan HO, blis-devel

OK, so it sounds like for now, essentially most of the multiplication is
done on the GPU - but if that chunk is small, then it doesn't have to be
that sophisticated.

Minh Quan HO

unread,

Aug 9, 2021, 11:48:38 AM8/9/21

to blis-devel

It depends. If an application has thousands/millions of small GEMMs, so it could still be interesting to see if we can use accelerators, instead of CPU, given the total flops.

And yes, data movement must be handled wisely, too.

Jeff Diamond

unread,

Aug 9, 2021, 1:35:20 PM8/9/21

to blis-...@googlegroups.com

Thanks again for the description. There's a lot of folks trying to get blis to dispatch to GPUs, but that's typically for huge matrices. So your focus on lots of small matrices is pretty interesting. Hopefully you'll let us know what you find. :)

--

Jeff Diamond

jeff_d...@fastmail.com

To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/7759d074-d308-404a-abf2-bd987bea5f57n%40googlegroups.com.

Minh Quan HO

unread,

Aug 9, 2021, 3:36:37 PM8/9/21

to Jeff Diamond, blis-devel

Thanks Jeff for the discussion too. If you know pepple who tries to make BLIS hybrid, feel free to send them my contact.

To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/f3bf29e9-dbc4-4b07-b2b2-e0f47259ff46%40www.fastmail.com.

Jeff Hammond

unread,

Aug 10, 2021, 2:48:50 AM8/10/21

to Jeff Diamond, blis-discuss

On GPUs, batched BLAS allows GEMM operations to hit peak performance, e.g. https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/. My argument against batched APIs is strictly in the context of CPUs, because I didn't realize someone was porting BLIS to GPUs :-)

Hammond

To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/f3bf29e9-dbc4-4b07-b2b2-e0f47259ff46%40www.fastmail.com.

Minh Quan HO

unread,

Aug 10, 2021, 3:20:26 AM8/10/21

to Jeff Hammond, Jeff Diamond, blis-discuss

Not a problem. In case someone is interested, I did a presentation at
BLIS Retreat 2020:
https://www.cs.utexas.edu/users/flame/BLISRetreat2020/Minh.html
(There was a typo in the author list: Stepan Nassyr from Juelich
Supercomputing Centre was added but this work was Kalray-only)

Minh Quan

> To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/CAGKz%3DuKy5RGQJOmob6wMJAMURtCBZywoRZ8WTkHzAjoGoUM4eg%40mail.gmail.com.

Jeff Diamond

unread,

Aug 10, 2021, 12:42:13 PM8/10/21

to Minh Quan HO, Jeff Hammond, blis-discuss

Thanks to both of you. This is really useful info. :)

Reply all

Reply to author

Forward