Hey there,
I'm currently refactoring my backend to exploit SIMD and memory locality optimization (like tiling) more.
I decided to write my own compiler and avoid hand optimising for complex layers like gemm, convolutions, bilinear or locally-connected layer/unshared convolution.
I'm currently evaluating what to emit (Nim/C code, LLVM IR or another IR) and maybe MLIR is a good fit.
What I'm concerned about is would MLIR allow me to control:
- what to parallelize and when (does a tensor dimension provides enough parallelism opportunities),
- the data layout to do loop tiling or blocking and eventually memory layout reordering (commonly named "packing" for GEMM)
I.e. I would like to know more about those stated goals:
* Ability to host high-performance-computing-style loop optimizations across kernels (fusion, loop interchange, tiling, etc) and to transform memory layouts of data.
* Code generation "lowering" transformations such as DMA insertion, explicit cache management, memory tiling, and vectorization for 1D and 2D register architectures.
When I look into the
LinAlg4 and the accompanying
transforms, it seems like tiling is not done at the IR level.
Will there be higher-level optimization primitives for that in the future?
* For example, Halide allows the user to directly "annotate" the AST with tiling:
* Nvidia Cutlass provides tiling iterator primitives:
Kind regards,
Mamy