Optimization primitives (parallelism, tiling, memory layout reordering)

490 views

Skip to first unread message

ma...@numforge.co

unread,

Apr 26, 2019, 6:16:17 AM4/26/19

to MLIR

Hey there,

I wrote my own deep learning framework from scratch (https://github.com/mratsim/Arraymancer) in a little known language called Nim.

I'm currently refactoring my backend to exploit SIMD and memory locality optimization (like tiling) more.

After writing generic high performance kernels from scratch, for example a generic GEMM for floats, integers (any type with + and *)

I decided to write my own compiler and avoid hand optimising for complex layers like gemm, convolutions, bilinear or locally-connected layer/unshared convolution.

I'm currently evaluating what to emit (Nim/C code, LLVM IR or another IR) and maybe MLIR is a good fit.

What I'm concerned about is would MLIR allow me to control:

- what to parallelize and when (does a tensor dimension provides enough parallelism opportunities),

- the data layout to do loop tiling or blocking and eventually memory layout reordering (commonly named "packing" for GEMM)

I.e. I would like to know more about those stated goals:

* Ability to host high-performance-computing-style loop optimizations across kernels (fusion, loop interchange, tiling, etc) and to transform memory layouts of data.
* Code generation "lowering" transformations such as DMA insertion, explicit cache management, memory tiling, and vectorization for 1D and 2D register architectures.

And also compare performance with BLAS, CuBLAS, Nvidia Cutlass, MKL-DNN and Halide.

When I look into the LinAlg4 and the accompanying transforms, it seems like tiling is not done at the IR level.

Will there be higher-level optimization primitives for that in the future?

* For example, Halide allows the user to directly "annotate" the AST with tiling:

- L3 BLAS: https://github.com/halide/Halide/blob/0f36a46d9dfc2413f6b11f6aed204f4819452dc7/apps/linear_algebra/src/blas_l3_generators.cpp#L10-L159

- Simple optimised matrix multiplication: https://github.com/halide/Halide/blob/master/test/performance/matrix_multiplication.cpp

* Nvidia Cutlass provides tiling iterator primitives:

- https://github.com/NVIDIA/cutlass/blob/fe3438a3c1ccbdd03dc1aca3bb68099a9e2a58bd/CUTLASS.md#S-patterns-tiles-iterators

Kind regards,

Mamy

n...@google.com

unread,

Apr 27, 2019, 4:52:54 PM4/27/19

to MLIR

On Friday, April 26, 2019 at 6:16:17 AM UTC-4, ma...@numforge.co wrote:

Hey there,

I wrote my own deep learning framework from scratch (https://github.com/mratsim/Arraymancer) in a little known language called Nim.

Hi Mamy,

Looks cool, I did not know of Nim until now.

The syntax is very nice and minimalistic.

I'm currently refactoring my backend to exploit SIMD and memory locality optimization (like tiling) more.
After writing generic high performance kernels from scratch, for example a generic GEMM for floats, integers (any type with + and *)
I decided to write my own compiler and avoid hand optimising for complex layers like gemm, convolutions, bilinear or locally-connected layer/unshared convolution.

I'm currently evaluating what to emit (Nim/C code, LLVM IR or another IR) and maybe MLIR is a good fit.

What I'm concerned about is would MLIR allow me to control:
- what to parallelize and when (does a tensor dimension provides enough parallelism opportunities),
- the data layout to do loop tiling or blocking and eventually memory layout reordering (commonly named "packing" for GEMM)

I.e. I would like to know more about those stated goals:

* Ability to host high-performance-computing-style loop optimizations across kernels (fusion, loop interchange, tiling, etc) and to transform memory layouts of data.
* Code generation "lowering" transformations such as DMA insertion, explicit cache management, memory tiling, and vectorization for 1D and 2D register architectures.

MLIR is still very young, most of the transformations you describe are available as passes in the affine dialect.

E.g. https://github.com/tensorflow/mlir/blob/master/lib/Transforms/LoopFusion.cpp

https://github.com/tensorflow/mlir/blob/master/lib/Transforms/LoopTiling.cpp

https://github.com/tensorflow/mlir/blob/master/lib/Transforms/LoopUnroll.cpp

https://github.com/tensorflow/mlir/blob/master/lib/Transforms/DmaGeneration.cpp

https://github.com/tensorflow/mlir/blob/master/lib/Transforms/Vectorize.cpp

etc.

Today, they are only accessible as traditional passes for now (i.e. passes are applied at the module or function granularity, have heuristics, are not finely controlled etc). The heuristics and cost models are still very naive so I wouldn't expect high perf to come out of them (except in some limited internal cases).

An alternative that MLIR started exploring is more declarative transforms (a-la Halide).

It essentially boils down to a tradeoff on what "IR handles to maintain".

In a pass world, one only works at the function boundary and needs to inspect the IR to determine where to apply a transformation.

In a declarative world, one has functions that transform the IR in a very localized fashion and give back handles on which one can continue applying transformations. One example atm is the last 2 `tile` functions in https://github.com/tensorflow/mlir/blob/master/include/mlir/Transforms/LoopUtils.h

While MLIR aims at providing such finer-grain control, it is not there yet.

Also, there is no concrete model of parallelism atm beyond dependence analysis giving you some information (isLoopParallel exercised in this test: https://github.com/tensorflow/mlir/blob/master/lib/Analysis/TestParallelismDetection.cpp).

And also compare performance with BLAS, CuBLAS, Nvidia Cutlass, MKL-DNN and Halide.

MLIR is still very young and we are still building functional paths, not yet attacking performance gains (esp. not against libraries that we want to be able to just call when it makes sense).

When I look into the LinAlg4 and the accompanying transforms, it seems like tiling is not done at the IR level.
Will there be higher-level optimization primitives for that in the future?

I imagine "tiling at the IR level" is meant as tiling loops and returning more loops? In this case the tiling transforms pointed above do it.

The Linalg part of the tutorial takes an alternative approach: show how MLIR supports building something new: types and ops that support loop + library compilation.

The type of tiling you have there is either:

1. loop tiling (i.e. lower the library call to loops and then use mlir::tile to tile them) https://github.com/tensorflow/mlir/blob/master/examples/Linalg/Linalg4/lib/Transforms.cpp#L42

2. data tiling (i.e. tiling the views which result in tiled loops over library calls) https://github.com/tensorflow/mlir/blob/master/examples/Linalg/Linalg4/lib/Transforms.cpp#L180

In any case, the tutorial is for demonstration purposes: MLIR provides the infra, one can build things like Linalg relatively easily using MLIR.

* For example, Halide allows the user to directly "annotate" the AST with tiling:
- L3 BLAS: https://github.com/halide/Halide/blob/0f36a46d9dfc2413f6b11f6aed204f4819452dc7/apps/linear_algebra/src/blas_l3_generators.cpp#L10-L159
- Simple optimised matrix multiplication: https://github.com/halide/Halide/blob/master/test/performance/matrix_multiplication.cpp
* Nvidia Cutlass provides tiling iterator primitives:
- https://github.com/NVIDIA/cutlass/blob/fe3438a3c1ccbdd03dc1aca3bb68099a9e2a58bd/CUTLASS.md#S-patterns-tiles-iterators

mlir::tile should give you a declarative way of applying tiling (no legality or safety checks though).

We want to expose more underlying transformations in a declarative fashion, the first ones will probably be fusion and vectorization, but this is not yet planned concretely.

Cheers,

Kind regards,
Mamy

Mamy Ratsimbazafy

unread,

Apr 28, 2019, 5:24:04 AM4/28/19

to n...@google.com, MLIR

Thank you very much!

--
You received this message because you are subscribed to a topic in the Google Groups "MLIR" group.
To unsubscribe from this topic, visit https://groups.google.com/a/tensorflow.org/d/topic/mlir/UCOwz-KpXgk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/1b00bc1b-1b75-46ae-986c-fdf82926a769%40tensorflow.org.

Reply all

Reply to author

Forward

0 new messages