Implementing convolution product

kal...@gmail.com

unread,

Aug 4, 2018, 4:16:44 PM8/4/18

to blis-discuss

Hey folks,

I'm investigating ways to leverage BLIS GEMM to implement efficient convolution for Neural Networks inference. At its core, the key is the same as in GEMM, pushing as many multiplications as possible through the CPU.

A typical case is the 2D convolution.

Example:
* input is a 400x300 image, in 3 channels (RGB) (400x300x3)
* filter will be a 5x5 square, working on the 3 input channel and extracting 128 features. (5x5x3x128).
* result is 400x300x128 (disregarding boundaries issues). Each "pixel" is now 128 float values deep.

Number of multiplications: 400x300x3x5x5x128.

One easy way to improve on the naive approach is to generate a "megamatrix" A and use GEMM:
* A will be 400x300 high, 5x5x3 wide. One line per pixel in the result. This matrix is highly redundant, each input value will appear 5x5 times (for each output pixel it contributes too).
* B is just the filter rearranged. 128 column, one per input features, with 5x5x3 lines.
* GEMM will compute all values in one single pass, one stride manipulation away to get the required result. Each line is a "pixel", each line has 128 column, one for each feature.

Despite the size of the "magamatrix" A, this is a significant improvment from the naive 6 nested loops approach.

I'm looking into smarter ways though:
* the filter is usually constant, B should be converted into panel once and for all
* crating the megamatrix knowning that GEMM will gut it out to rewrite it in panels is an even bigger waste.

So I'm considering doing the panelisation on my side. I am considering asking BLIS for MR and NR, doing panels the right side and then calling the kernel myself. Another option is to call GEMM, betting that everything will degenerate nicely as I'm giving panels of the right side...

What do you people think ?

Jeff Hammond

unread,

Aug 4, 2018, 4:26:42 PM8/4/18

to kal...@gmail.com, blis-discuss

You might look at how https://libxsmm.readthedocs.io/en/latest/libxsmm_dl/ does things and then port that design over to BLIS.

Jeff

Sent from my iPhone

--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
To post to this group, send email to blis-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/blis-discuss.
For more options, visit https://groups.google.com/d/optout.

Field G. Van Zee

unread,

Aug 5, 2018, 8:54:26 PM8/5/18

to blis-d...@googlegroups.com

kalizoy,

Thank you for your interest in BLIS. Also, I appreciate your very
detailed problem description. (There is no such thing as too many
details, imo.)

I'm not well versed in convolutions, but based on the way you describe
your problem, it sounds like you are proposing casting your convolutions
as one large matrix multiplication where A is 120000x75 and B is 75x128.
(Please correct me if I'm off.) I can't speak to whether this is a
"good" strategy compared to other convolution approaches, but let's
assume for a moment it is.

Since your m dimension is overwhelmingly large compared to your small n
and k dimensions, it seems to make sense to only parallelize m. If you
use BLIS for parallelism, you do NOT want to let BLIS parallelize the
problem on your behalf, as it may end up allocating some ways of
parallelism over n or k--in other words, you do NOT want to set
BLIS_NUM_THREADS. Instead, you should explicitly specify the loop you
wish to parallelize; there is only one that makes sense: the IC loop.

You are welcome to parallelize this in your application, but before you
go to such effort, you should at least try out BLIS's internal
multithreading and see how it performs. To use multithreaded BLIS,
please read the Multithreading.md document for complete details [1].

But the basic gist is that you configure with:

--enable-threading=[openmp|pthreads]

and then compile the library. When it's time to run your BLIS-linked
executable, try focusing all your parallelism in the IC loop. If you
have eight cores, try:

export BLIS_IC_NT=8
./application

If you configure with 'openmp' and are using the GNU implementation, you
may also want to set the GOMP_CPU_AFFINITY variable prior to running
your program:

export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"

It may not matter since you are only parallelizing one dimension/loop.
(I can't remember.) But it shouldn't hurt.

Your n and k dimensions are small enough that your per-core performance
will suffer somewhat, but there's probably not much you can do about
that. (You may still get good parallelism.)

Please let us know how it works out.

Field

PS: What hardware microarchitecture are you using? Is it
haswell/broadwell/skylake? I just realized there is a minor optimization
bug when the storage format of your output matrix C does not match the
output preference of the microkernel. It could be easy to end up
parallelizing the wrong dimension inadvertently. I'll look into it more
closely in the coming days and commit a fix.

[1] https://github.com/flame/blis/blob/master/docs/Multithreading.md

Devin Matthews

unread,

Aug 5, 2018, 10:21:24 PM8/5/18

to blis-discuss

You might also think about using my TBLIS package (github.com/devinamatthews/tblis) to frame this as a tensor contraction problem. I believe it is possible to set up the tensor strides such that the exact same mathematical operations is computed, but the data can be left in its original format, saving time in the "megamatrix" formation step. I'd be happy to discuss this in more detail if you're interested.

Of course, based on your filter sizes and precision requirements, FFT or Winograd is going to require fewer FLOPs...

Devin Matthews

Reply all

Reply to author

Forward