RFC Strided MemRef Proposal (a.k.a Views)

500 views
Skip to first unread message

Nicolas Vasilache

unread,
Sep 9, 2019, 4:20:38 PM9/9/19
to MLIR
Hello everyone, this RFC addresses a subset of the more general discussion on tensor layout  and memref layouts.

This RFC is for adding an optional stride to MemRefType as a way to:
1. unify existing codegen strategies (Affine and Linalg dialects) by using the same standard type (i.e. !linalg.view is replaced everywhere by a strided memref);
2. provide a standard type that is closed under a set of interesting operations and passes function boundaries in a generic way (i.e. no need to clone and specialize a new function for each small change in layout);
3. provide an intuitive entry point to interfacing with external library calls operating on strided array objects. Such libraries often perform dynamic dispatch to adapt the strategy to problem sizes, alignment and strides between elements (see for example the primitive interface to BLAS that MLIR targets from the Linalg dialect).

This is also related to the presentation by Intel in one of the past open design meetings (see Aug 22nd 2019 recording and a primer on strides in memory formats).

The current proposal prioritizes point 1. above by replacing the !linalg.view type by a strided memref type. This consists of an optional memref "type attribute" that can be either static or dynamic. In the dynamic case, the information is opaque to the affine and linalg dialects and only materializes when lowering to LLVMIR for the purpose of address calculation.

Proposed syntax
An optional stride-list is added to the memref-type 
``` {.ebnf}
memref-type ::= `memref` `<` dimension-list-ranked tensor-memref-element-type
                (`,` semi-affine-map-composition)? (`,` memory-space)?
                (`,` stride-list)? `>`

semi-affine-map-composition ::= (semi-affine-map `,` )* semi-affine-map
memory-space ::= integer-literal
stride-list ::= `[` integer (`,` integer)* `]`
```

A stride specification is a list of integer values that are either non-negative (static case) or `-1` (dynamic case). Strides encode the distance in memory between successive entries along a particular dimension. For example, `memref<42x16xf32, [1, 64]>` specifies a view into a non-contiguous memory region of `42` by `16` `f32` elements in which the distance between two consecutive elements along the outer dimension is `sizeof(f32)` bytes and the distance between two consecutive elements along the outer dimension is `64 * sizeof(f32)` bytes. This corresponds to a column major "view" of the non-contiguous memory region.

At this time, strided memrefs do not allow affine map compositions layout to be specified: both properties are mutually exclusive.

Creation Ops
Strided memrefs represent a view abstraction over preallocated data and can thus not be allocated directly. They are constructed with special ops operating on preallocated data. It is illegal to dealloc a strided memref. The first users of strided memrefs will be the Linalg dialect operations that operate on !linalg.view types. A useful property is that the type is closed under these transformations (generalized transpose, slice, subview, view).

Aliasing
The specification of strides must not alias internally: given an n-D strided memref, indices `(i1, ..., in)` and `(j1, ..., jn)` may not refer to the same memory address unless `i1 == j1, ..., in == jn`. It is undefined behavior to create strided memrefs that alias internally. 
Aliasing between different strided memrefs is the subject of ongoing investigation. One possible path is to provide aliasing sets at the function boundary and view creation ops with well-defined semantics and runtime guarantees (e.g. view intersection, difference, split, concat).

Lowering to LLVM
Strided memrefs correspond conceptually to the following templated C++ struct:
```
template <typename Elem, size_t Rank>
struct {
  Elem *ptr;
  int64_t offset;
  int64_t sizes[Rank];
  int64_t strides[Rank];
};
```
The linearization procedure for address calculation for strided memrefs is the same as for linalg views: 
`base_offset + SUM_i index_i * stride_i`.

Implementation implications
This will result in a bunch of simplifications on the Linalg part of the code: DimOp, LoadOp, StoreOp can disappear. !linalg.view will be replaced by strided memref. This will also make it easier for Linalg to lower to Affine.

Uday Bondhugula

unread,
Sep 10, 2019, 12:15:52 AM9/10/19
to MLIR

Just wanted to point out that strides along memref dimensions can already be represented and work all the way through the lowering to LLVM IR, and to execution with the current memref's AffineMap layout. Just use the stride as the scaling factor for your affine map. The normalizeMemrefs utility normalizes that to an identity layout (by recalculating the new shaping and placing remappings for the indices), and so the lowering to LLVM already works transparently. Here's an example with stride two.

func @stride() {
  %A = alloc() : memref<64xf32, (d0) -> (2*d0)>
  affine.for %i = 0 to 64 {
    affine.load %A[%i] : memref<64xf32, (d0) -> (2 * d0)>
  }
  return
}


Run this to normalize (lower the non-identity layout); notice that (d0) -> (2 * d0) now disappears from the layout map, and the access becomes 2 * IV. 

$ mlir-opt -simplify-affine-structures stride.mlir
#map0 = (d0) -> (d0 * 2)
#map1 = () -> (0)
#map2 = () -> (64)


module {
  func @stride() {
    %0 = alloc() : memref<127xf32>
    affine.for %arg0 = 0 to 64 {
      %1 = affine.load %0[%arg0 * 2] : memref<127xf32>
    }
    return
  }
}


Lower to LLVM

$ bin/mlir-opt -simplify-affine-structures  /tmp/stride.mlir -lower-affine -lower-to-cfg -lower-to-llvm


module {
  func @stride() {
    %0 = llvm.mlir.constant(127 : index) : !llvm.i64
    %1 = llvm.mlir.constant(4 : index) : !llvm.i64
    %2 = llvm.mul %0, %1 : !llvm.i64
    %3 = llvm.call @malloc(%2) : (!llvm.i64) -> !llvm<"i8*">
    %4 = llvm.bitcast %3 : !llvm<"i8*"> to !llvm<"float*">
    %5 = llvm.mlir.constant(0 : index) : !llvm.i64
    %6 = llvm.mlir.constant(64 : index) : !llvm.i64
    %7 = llvm.mlir.constant(1 : index) : !llvm.i64
    llvm.br ^bb1(%5 : !llvm.i64)
  ^bb1(%8: !llvm.i64): // 2 preds: ^bb0, ^bb2
    %9 = llvm.icmp "slt" %8, %6 : !llvm.i64
    llvm.cond_br %9, ^bb2, ^bb3
  ^bb2: // pred: ^bb1
    %10 = llvm.mlir.constant(2 : index) : !llvm.i64
    %11 = llvm.mul %8, %10 : !llvm.i64
    %12 = llvm.mlir.constant(127 : index) : !llvm.i64
    %13 = llvm.getelementptr %4[%11] : (!llvm<"float*">, !llvm.i64) -> !llvm<"float*">
    %14 = llvm.load %13 {alignment = 16 : i32} : !llvm<"float*">
    %15 = llvm.add %8, %7 : !llvm.i64
    llvm.br ^bb1(%15 : !llvm.i64)
  ^bb3: // pred: ^bb1
    llvm.return
  }
  func @malloc(!llvm.i64) -> !llvm<"i8*">
}

Execute it:

bin/mlir-opt -simplify-affine-structures  /tmp/stride.mlir -lower-affine -lower-to-cfg -lower-to-llvm | mlir-cpu-runner -e stride -O3

Strides are just one special case for affine maps; if one wants to extract them from the map, that's easy as well. And if your goal for now is to just specify them and code generate, it already works.

~ Uday

Alex Zinenko

unread,
Sep 10, 2019, 9:01:34 AM9/10/19
to MLIR
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes.  What is the rationale for that?

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/31231477-2774-4b02-8d3e-c40cacd7125b%40tensorflow.org.


--
-- Alex

Alex Zinenko

unread,
Sep 10, 2019, 9:09:07 AM9/10/19
to MLIR
How would you represent memref<2x5xf32, [3, 17]> ?

Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Uday Bondhugula

unread,
Sep 10, 2019, 9:54:54 AM9/10/19
to MLIR


On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

              affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>



Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

The normalizeMemRefs can be easily extended to deal with such semi-affine maps: you'll get s0 * %i and s1 * %j for the access functions (composition works with semi-affine maps), and the alloc dimensions sizes will be scaled similarly. So, I don't see any difficulty handling semi-affine maps that are representing strided things. The utility can assert out with anything more complex that is semi-affine. I think the advantage with this approach is that it leads a lot of room to explore more complex layouts, and you could incrementally implement more and more on demand: for eg., tiled layouts where the tile sizes just bind to SSA values. Tiled layouts with constant tile sizes already work. 

~ Uday
 

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.


--
-- Alex

Alex Zinenko

unread,
Sep 10, 2019, 12:35:01 PM9/10/19
to Uday Bondhugula, MLIR
On Tue, Sep 10, 2019 at 3:54 PM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:


On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

              affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

I don't quite get where does 4x29 come from.

I suspect there might be a terminology confusion.  Stride in this proposal doesn't mean the same thing as in affine world.  It is a number of _linearized_ elements you need to step over to get to the next element along the given dimension.  So stride [3, 17] is not equivalent to (i, j) -> (3*i, 17*j).  Stride [51, 17] would have been equivalent to that.  In this particular case, the linearized offsets are as follows:
[0,0] -> 0
[0,1] -> 17
[0,2] -> 34
[0,3] -> 51
[0,4] -> 68
[1,0] -> 3
[1,1] -> 20
[1,2] -> 37
[1,3] -> 54
[1,4] -> 71
 



Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

It seems to rather be memref<4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> for the stride model in question, which leads to premature linearization of accesses.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org.


--
-- Alex

Mehdi Amini

unread,
Sep 10, 2019, 1:34:22 PM9/10/19
to Alex Zinenko, MLIR
On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes.  What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?
Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.

-- 
Mehdi

 

Alex Zinenko

unread,
Sep 10, 2019, 5:02:19 PM9/10/19
to MLIR
On Tue, Sep 10, 2019 at 7:34 PM Mehdi Amini <ami...@google.com> wrote:


On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes.  What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?
Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.


I don't think this is anyhow different from the sizes.  If the stride is static, it is present in the type itself.  If the stride is dynamic, we need a field in the struct to store that.
On concrete examples, `memref<42x?xf32>` is currently equivalent to `{float*, i64}` where the integer stores the dynamic size and 42 always remains a constant.  What this seems to propose (unless I am misreading the section), is to change that to be `{float*, i64, i64}` with integers containing _both_ sizes, regardless of them being static or dynamic in the type.  On the function boundary, you are not allowed to pass a memref<42x?xf32> where a memref<?x?xf32> is expected and vice versa, you need an exact match.
For strides, we can use exactly the same kind of encoding.  `memref<2x2xf32, [-1,1]>` will have one field for the dynamic stride, represented by -1 in the type and `memref<2x2xf32, [2,1]>` will not have any. This will pass function boundaries just as well as the current one for dynamic sizes does.  There is not necessarily a link between sizes being dynamic and strides being dynamic.

In fact, I originally proposed to store all sizes when going to LLVM.  We decided against it with the following arguments: (1) it creates a hazard of inconsistency between the static value visible in the type and the dynamic value stored; although we can declare such situation as undefined behavior, we did not want to have undefined behaviors at that point; (2) it makes it difficult to interface with C functions, without storing dynamic sizes, a statically-shaped memref is a bare pointer; (3) it creates storage overhead, and we did not (still don't) have CSE on the LLVM dialect.


--
-- Alex

Uday Bondhugula

unread,
Sep 10, 2019, 11:18:31 PM9/10/19
to MLIR


On Tuesday, September 10, 2019 at 10:05:01 PM UTC+5:30, Alex Zinenko wrote:


On Tue, Sep 10, 2019 at 3:54 PM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:


On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

              affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

I don't quite get where does 4x29 come from.

I suspect there might be a terminology confusion.  Stride in this proposal doesn't mean the same thing as in affine world.  It is a number of _linearized_ elements you need to step over to get to the next element along the given dimension.  So stride [3, 17] is not equivalent to (i, j) -> (3*i, 17*j).  Stride [51, 17] would have been equivalent to that.  In this particular case, the linearized offsets are as follows:
[0,0] -> 0
[0,1] -> 17
[0,2] -> 34
[0,3] -> 51
[0,4] -> 68
[1,0] -> 3
[1,1] -> 20
[1,2] -> 37
[1,3] -> 54
[1,4] -> 71
 



Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

It seems to rather be memref<4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> for the stride model in question, which leads to premature linearization of accesses.


memref< 4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> won't be a valid memref since the map here is 2-d while the memref has a 1-d shape. I guess you meant:

memref <128 x 128 x f32, (d0, d1) [s0, s1] -> (s0 * d1 + s1 * d1) >

The accesses themselves aren't linearized until one normalizes the memref to an identity layout, i.e., you'd still have:

%A[%i, %j]  : memref<128 x 128 x f32, ....>

(128 x 128 is the logical space which the load's store's still index).

As you point out further above, the strides per this proposal are very different than the multi-dimensional strides I was referring to, and there is nothing affine about the latter. When you have strided convolutions for example, the strides are actually the latter, i.e., when you have a stride 2 along each dimension on a 2-d convolution, the strides are just (2, 2) and not (2 * x_dimension_size, 2). 

~ Uday

~ Uday



 
 

Mamy Ratsimbazafy

unread,
Sep 10, 2019, 11:21:21 PM9/10/19
to MLIR
Strided tensors should be allowed to have negative or zero strides.
Also an offset, that points to the first element of the tensor should be added

1) Zero-strides are needed to represent broadcasting operations.
2) Negative strides are needed to reverse a tensor dimension without copy.

This also makes dynamic striding with -1 impossible but we can use a flag instead.

The main advantage is that it would allow MLIR to support all shape operations supported by Numpy, DLPack (a cross-framework tensor representation effort - https://github.com/dmlc/dlpack), PyTorch and all other frameworks that follow the "rank + shape + strides + offset + data" paradigm.
-- 
Mehdi

 

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.


--
-- Alex

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.


--
-- Alex

Mehdi AMINI

unread,
Sep 11, 2019, 1:25:10 AM9/11/19
to Alex Zinenko, MLIR
On Tue, Sep 10, 2019 at 2:02 PM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:


On Tue, Sep 10, 2019 at 7:34 PM Mehdi Amini <ami...@google.com> wrote:


On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes.  What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?
Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.


I don't think this is anyhow different from the sizes.  If the stride is static, it is present in the type itself.  If the stride is dynamic, we need a field in the struct to store that.
On concrete examples, `memref<42x?xf32>` is currently equivalent to `{float*, i64}` where the integer stores the dynamic size and 42 always remains a constant.  What this seems to propose (unless I am misreading the section), is to change that to be `{float*, i64, i64}` with integers containing _both_ sizes, regardless of them being static or dynamic in the type.  On the function boundary, you are not allowed to pass a memref<42x?xf32> where a memref<?x?xf32> is expected and vice versa, you need an exact match.
For strides, we can use exactly the same kind of encoding.  `memref<2x2xf32, [-1,1]>` will have one field for the dynamic stride, represented by -1 in the type and `memref<2x2xf32, [2,1]>` will not have any. This will pass function boundaries just as well as the current one for dynamic sizes does.  There is not necessarily a link between sizes being dynamic and strides being dynamic.

Right, what I was alluding to is that if you have a memref<8x?x?x?x?xf32> and you want to call a function that takes a memref<?x?x?x?x?xf32> you have to reallocate and copy the structure.
The advantage of always materializing the shape in the struct is that you don't need any conversion across function calls.
 

In fact, I originally proposed to store all sizes when going to LLVM.  We decided against it with the following arguments: (1) it creates a hazard of inconsistency between the static value visible in the type and the dynamic value stored; although we can declare such situation as undefined behavior, we did not want to have undefined behaviors at that point;

Such undefined behavior does not seem terrible to me, because it is one that is trivial to catch with runtime checks (we should avoid any UB that we don't have a "cheap" and principled way of detecting reliably).
 
(2) it makes it difficult to interface with C functions, without storing dynamic sizes, a statically-shaped memref is a bare pointer;

I'm not sure I totally follow what is more difficult here: if you have a static type then you can extract the pointer of the structure and pass it to a C function. If you don't have a static shape then you can't pass this to a C function?
 
(3) it creates storage overhead, and we did not (still don't) have CSE on the LLVM dialect.

This one seems short-sighted to me: I would not design core-components of the system based on the current optimizer (or lack of thereof) state (and there is CSE at the LLVM level, we should likely try to emit the code and improve LLVM if it does not match our IR patterns).

Best,

-- 
Mehdi

 

Alex Zinenko

unread,
Sep 11, 2019, 4:17:12 AM9/11/19
to MLIR
Exactly.
I believe the point of "linearized" strides is to be able to represent a hyper-rectangular subset of memref elements, e.g. to crop off padding or to do a sort of data tiling.  This encoding is used in numpy and PyTorch, among others, so there is value in supporting it.  This does not mean we have to remove the affine maps.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/62f3457a-9d44-4a10-995a-876468bb30ce%40tensorflow.org.


--
-- Alex

Alex Zinenko

unread,
Sep 11, 2019, 4:26:36 AM9/11/19
to MLIR
On Wed, Sep 11, 2019 at 5:21 AM Mamy Ratsimbazafy <ma...@numforge.co> wrote:
Strided tensors should be allowed to have negative or zero strides.
Also an offset, that points to the first element of the tensor should be added

1) Zero-strides are needed to represent broadcasting operations.
2) Negative strides are needed to reverse a tensor dimension without copy.

This also makes dynamic striding with -1 impossible but we can use a flag instead.

The main advantage is that it would allow MLIR to support all shape operations supported by Numpy, DLPack (a cross-framework tensor representation effort - https://github.com/dmlc/dlpack), PyTorch and all other frameworks that follow the "rank + shape + strides + offset + data" paradigm.

The problem with implicit broadcasting or actually other shape+stride configuration that makes the subscripts<->address relation non-bijective is that it makes any data-related analysis significantly more complex, if possible at all.  I suppose that's why the proposal explicitly prohibits such cases, which it calls "internal aliasing".  Modeling PyTorch at _memref_ level is not something that I would consider a priority at this point.  This needs to be defined at tensor level and lowered away if going to memrefs.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/cfddc618-76c2-4e6d-a46d-c0465eb48ed7%40tensorflow.org.


--
-- Alex

Volodymyr Arbatov

unread,
Sep 11, 2019, 5:11:39 AM9/11/19
to Alex Zinenko, MLIR
Strides (we call them "pitch" in our compiler) are useful for aligning dimension pointers while keeping original tensor dimensions. Hardware may have alignment constraints or software may prefer alignment of dimension pointers for performance reasons (not necessarily alignment of fastest dimensions) while tensor shape doesn't have to be aligned. Index mapping functions, which in my understanding are conceptually similar to affine maps, work on top of that within dimensions of a tensor shape.

- Volodymyr.
 

Alex Zinenko

unread,
Sep 11, 2019, 5:32:21 AM9/11/19
to MLIR
On Wed, Sep 11, 2019 at 7:25 AM Mehdi AMINI <joke...@gmail.com> wrote:


On Tue, Sep 10, 2019 at 2:02 PM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:


On Tue, Sep 10, 2019 at 7:34 PM Mehdi Amini <ami...@google.com> wrote:


On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes.  What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?
Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.


I don't think this is anyhow different from the sizes.  If the stride is static, it is present in the type itself.  If the stride is dynamic, we need a field in the struct to store that.
On concrete examples, `memref<42x?xf32>` is currently equivalent to `{float*, i64}` where the integer stores the dynamic size and 42 always remains a constant.  What this seems to propose (unless I am misreading the section), is to change that to be `{float*, i64, i64}` with integers containing _both_ sizes, regardless of them being static or dynamic in the type.  On the function boundary, you are not allowed to pass a memref<42x?xf32> where a memref<?x?xf32> is expected and vice versa, you need an exact match.
For strides, we can use exactly the same kind of encoding.  `memref<2x2xf32, [-1,1]>` will have one field for the dynamic stride, represented by -1 in the type and `memref<2x2xf32, [2,1]>` will not have any. This will pass function boundaries just as well as the current one for dynamic sizes does.  There is not necessarily a link between sizes being dynamic and strides being dynamic.

Right, what I was alluding to is that if you have a memref<8x?x?x?x?xf32> and you want to call a function that takes a memref<?x?x?x?x?xf32> you have to reallocate and copy the structure.
The advantage of always materializing the shape in the struct is that you don't need any conversion across function calls.

It's more at `memref_cast` that the structure reallocation must happen, independently of function calls.  But I agree with the reasoning.
 
 

In fact, I originally proposed to store all sizes when going to LLVM.  We decided against it with the following arguments:

It's a shame you weren't in the original discussion, these are very similar to my reasoning. 
 
(1) it creates a hazard of inconsistency between the static value visible in the type and the dynamic value stored; although we can declare such situation as undefined behavior, we did not want to have undefined behaviors at that point;

Such undefined behavior does not seem terrible to me, because it is one that is trivial to catch with runtime checks (we should avoid any UB that we don't have a "cheap" and principled way of detecting reliably).

Agreed.
Independently of this discussion, it would be great to have a sort of "assert" op for runtime checks.
  
(2) it makes it difficult to interface with C functions, without storing dynamic sizes, a statically-shaped memref is a bare pointer;

I'm not sure I totally follow what is more difficult here: if you have a static type then you can extract the pointer of the structure and pass it to a C function. If you don't have a static shape then you can't pass this to a C function?

There is no "extract pointer" or any notion of bare pointer at all when it comes to memrefs.  One option would be to use zero-rank memref for that and allow one to cast memrefs and change ranks.  Another is to have a separate pointer/buffer type.
You can pass a `struct {float*, int64_t}*` to a function just fine.
 
 
(3) it creates storage overhead, and we did not (still don't) have CSE on the LLVM dialect.

This one seems short-sighted to me: I would not design core-components of the system based on the current optimizer (or lack of thereof) state (and there is CSE at the LLVM level, we should likely try to emit the code and improve LLVM if it does not match our IR patterns).

Agreed.

I am generally supportive of the change, it makes the overall model simpler, but we need to make it consciously. 


--
-- Alex

Nicolas Vasilache

unread,
Sep 11, 2019, 12:09:04 PM9/11/19
to Alex Zinenko, MLIR
Thanks for your replies all, sorry for the delay I was out yesterday.
Here are some answers below.

On Tue, Sep 10, 2019 at 9:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>

Sure, consider it done.
 
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

Technically they can during lowering, as is pointed out a bit later by Volodymyr Arbatov. However, they may create undecidable problems for dependence analysis and force the insertion of case disjunction early in the compilation chain which seems undesirable. Since a strided memref is a (non-contiguous) view into an existing contiguous buffer it is subject to aliasing and is safer to disallow for now. 
 

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

We can but we would have to use a syntax such as `memref<42x16xf32, 1x?, /*memory space=*/2>` because integer array list attributes won't parse, I don't personally find this much better but I am open to suggestions. Maybe someone has a better proposal, in any case it will be easy to modify once implemented.
 

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes.  What is the rationale for that?

Essentially because of the discussion you already had with Mehdi Amini a bit later: simplicity and passing function boundaries (esp. going to externally precompiled function calls, multiversioning etc) vs runtime checks. The compression of the struct seems like an optimization that we can reconsider in the future once we have enough experience at the function call boundary.
 

Nicolas Vasilache

unread,
Sep 11, 2019, 12:17:30 PM9/11/19
to Uday Bondhugula, MLIR
On Tue, Sep 10, 2019 at 9:54 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:


On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

              affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

`memref<2x5xf32, [3, 17]>` is a strided view into a non-contiguous region of at least (4 * 17) + (1 * 3) + 1 elements (indexed as Alex explains below).
The memref itself is a reference to 2x5 elements.
`memref<4x29xf32>` is a contiguous region of 4x29 elements.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org.


--
N

Nicolas Vasilache

unread,
Sep 11, 2019, 12:17:48 PM9/11/19
to Alex Zinenko, Uday Bondhugula, MLIR
On Tue, Sep 10, 2019 at 12:35 PM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:


On Tue, Sep 10, 2019 at 3:54 PM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:


On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

              affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

I don't quite get where does 4x29 come from.

I suspect there might be a terminology confusion.  Stride in this proposal doesn't mean the same thing as in affine world.  It is a number of _linearized_ elements you need to step over to get to the next element along the given dimension.  So stride [3, 17] is not equivalent to (i, j) -> (3*i, 17*j).  Stride [51, 17] would have been equivalent to that.  In this particular case, the linearized offsets are as follows:
[0,0] -> 0
[0,1] -> 17
[0,2] -> 34
[0,3] -> 51
[0,4] -> 68
[1,0] -> 3
[1,1] -> 20
[1,2] -> 37
[1,3] -> 54
[1,4] -> 71
 



Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

It seems to rather be memref<4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> for the stride model in question, which leads to premature linearization of accesses.

+1 

Nicolas Vasilache

unread,
Sep 11, 2019, 12:41:15 PM9/11/19
to Mamy Ratsimbazafy, MLIR
On Tue, Sep 10, 2019 at 11:21 PM Mamy Ratsimbazafy <ma...@numforge.co> wrote:
Strided tensors should be allowed to have negative or zero strides.
Also an offset, that points to the first element of the tensor should be added

1) Zero-strides are needed to represent broadcasting operations.
2) Negative strides are needed to reverse a tensor dimension without copy.

This also makes dynamic striding with -1 impossible but we can use a flag instead.

The main advantage is that it would allow MLIR to support all shape operations supported by Numpy, DLPack (a cross-framework tensor representation effort - https://github.com/dmlc/dlpack), PyTorch and all other frameworks that follow the "rank + shape + strides + offset + data" paradigm.

These considerations make sense at the programmer/framework/library level where automatic compiler transformations do not occur.
The current proposal sits at a level below during which we want to perform compiler transformations, so it is more restrictive.
Without loss of generality, lowering from the numpy level to the strided memref level is possible (broadcast becomes a non-indexing loop, negative stride becomes a loop reversal).

The offset is present in the LLVM struct during lowering but it is opaque at the memref level. 
This is also for compiler transformations: the offset is only manipulated by ops that create views (generalized transposeslicesubviewview) and that come with higher-level static guarantees (even with dynamic strides/sizes/offset which would translate to symbolic case disjunctions early in the compilation flow).

Lastly, this has proven enough for now to target a lower level library abstraction (https://github.com/tensorflow/mlir/blob/master/test/mlir-cpu-runner/cblas_interface.cpp).
When we have evidence that we need to support richer behavior at the memref / compiler transformations level, we can revisit.

Does this make sense?
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/cfddc618-76c2-4e6d-a46d-c0465eb48ed7%40tensorflow.org.


--
N

Nicolas Vasilache

unread,
Sep 11, 2019, 12:51:01 PM9/11/19
to Alex Zinenko, MLIR

River Riddle

unread,
Sep 11, 2019, 1:51:47 PM9/11/19
to MLIR
How exactly does this compose with the affine compositions already present on memrefs? I'm also a bit concerned with adding a bunch of things to types, especially given the recent discussion about adding a layout attribute to tensor types. Are we ever going to inter-mix affine and strides?


On Wednesday, September 11, 2019 at 9:09:04 AM UTC-7, Nicolas Vasilache wrote:
Thanks for your replies all, sorry for the delay I was out yesterday.
Here are some answers below.

On Tue, Sep 10, 2019 at 9:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>

Sure, consider it done.
 
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

Technically they can during lowering, as is pointed out a bit later by Volodymyr Arbatov. However, they may create undecidable problems for dependence analysis and force the insertion of case disjunction early in the compilation chain which seems undesirable. Since a strided memref is a (non-contiguous) view into an existing contiguous buffer it is subject to aliasing and is safer to disallow for now. 
 

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

We can but we would have to use a syntax such as `memref<42x16xf32, 1x?, /*memory space=*/2>` because integer array list attributes won't parse, I don't personally find this much better but I am open to suggestions. Maybe someone has a better proposal, in any case it will be easy to modify once implemented.
I'd rather not bake in '-1' as dynamic in the IR, this is the exact reason why dimension lists use ?. 
 
To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.


--
-- Alex

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.


--
N

Nicolas Vasilache

unread,
Sep 11, 2019, 2:45:53 PM9/11/19
to River Riddle, MLIR
On Wed, Sep 11, 2019 at 1:51 PM 'River Riddle' via MLIR <ml...@tensorflow.org> wrote:
How exactly does this compose with the affine compositions already present on memrefs?
 
They are mutually exclusive as explicitly stated. Taking into account Alex's comment the ebnf looks like:

``` {.ebnf}
memref-type ::= `memref` `<` dimension-list-ranked tensor-memref-element-type
                ((`,` semi-affine-map-composition) | (`,` stride-list))?
                (`,` memory-space)? `>`


semi-affine-map-composition ::= (semi-affine-map `,` )* semi-affine-map
stride-list ::= `[` integer (`,` integer)* `]`
memory-space ::= integer-literal /* | TODO: address-space-id */
```

However, as Volodymyr points out: "Index mapping functions, which in my understanding are conceptually similar to affine maps, work on top of that within dimensions of a tensor shape."
This indeed opens the door to possible interactions during lowering but I do not expect such interactions will happen during analyses.
 
I'm also a bit concerned with adding a bunch of things to types, especially given the recent discussion about adding a layout attribute to tensor types.
Concerned how? 
I am much more concerned that memrefs just cannot represent non-continuous input memory regions.
The only way to access a subtensor of a tensor in MLIR is by encoding the information in enclosing loop bounds, which does not pass function boundaries.

Are we ever going to inter-mix affine and strides?
Yes, affine (without layout maps) only requires the "injective" indexing property which is also a prerequisite for strides.
If one wants a continuous memref from a strided memref one can either cast (+ runtime checks) or copy. 
 

On Wednesday, September 11, 2019 at 9:09:04 AM UTC-7, Nicolas Vasilache wrote:
Thanks for your replies all, sorry for the delay I was out yesterday.
Here are some answers below.

On Tue, Sep 10, 2019 at 9:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal!  I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions.  I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>

Sure, consider it done.
 
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

Technically they can during lowering, as is pointed out a bit later by Volodymyr Arbatov. However, they may create undecidable problems for dependence analysis and force the insertion of case disjunction early in the compilation chain which seems undesirable. Since a strided memref is a (non-contiguous) view into an existing contiguous buffer it is subject to aliasing and is safer to disallow for now. 
 

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

We can but we would have to use a syntax such as `memref<42x16xf32, 1x?, /*memory space=*/2>` because integer array list attributes won't parse, I don't personally find this much better but I am open to suggestions. Maybe someone has a better proposal, in any case it will be easy to modify once implemented.
I'd rather not bake in '-1' as dynamic in the IR, this is the exact reason why dimension lists use ?. 

Ok, my first inclination was to use an integer list attribute which is a proper MLIR attribute.
Since the dimension list stuff is just a parser abstraction (which must and with an x + element type) I guess I can evolve that into: 
 `memref<42x16xf32, 1x?xf32, /*memory space=*/2>` and force the types of size and stride to match.
Do you have a better proposal? 
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/f2932fc7-40a5-4335-8fb2-50e4b8059b9b%40tensorflow.org.


--
N

Mamy Ratsimbazafy

unread,
Sep 11, 2019, 3:03:41 PM9/11/19
to MLIR
The cblas interface is quite outdated, it only works with contiguous matrices and unit-stride in the non-leading dimension.

For several use-cases the BLIS interface that allows any striding (including zero/negative) provides key benefits for many use-cases:

  - spaCy, the leading NLP framework now uses BLIS by default

Now if we have a higher-level dialect that exposes Numpy-like memref that high-level frameworks can target and that lowers to the dialect of the specification that would also work but it seems like extra work that we could avoid.

Mamy

Andy Davis

unread,
Sep 11, 2019, 3:43:46 PM9/11/19
to Mamy Ratsimbazafy, MLIR
Side note: in case it wasn't clear from Nicolas' initial email, this work is a subset of work from the same working group that is proposing changes to both the Tensor and MemRef types. We'll be giving a public talk in a couple weeks to detail the proposal.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/698ff147-45a1-4c0f-94d2-7230afa0efa6%40tensorflow.org.

Mehdi Amini

unread,
Sep 11, 2019, 3:50:06 PM9/11/19
to Andy Davis, Mamy Ratsimbazafy, MLIR
On Wed, Sep 11, 2019 at 12:43 PM 'Andy Davis' via MLIR <ml...@tensorflow.org> wrote:
Side note: in case it wasn't clear from Nicolas' initial email, this work is a subset of work from the same working group that is proposing changes to both the Tensor and MemRef types. We'll be giving a public talk in a couple weeks to detail the proposal.

Can we get an email thread to discuss the proposal ahead of the meeting?
Otherwise folks can't necessarily wrapped their head around everything and prepare the right set of questions. Also addressing the most common question on the mailing-list can leave more room for discussing the hard part in the meeting.

Thanks,

-- 
Mehdi
 

Nicolas Vasilache

unread,
Sep 11, 2019, 4:05:15 PM9/11/19
to Mamy Ratsimbazafy, MLIR
On Wed, Sep 11, 2019 at 3:03 PM Mamy Ratsimbazafy <ma...@numforge.co> wrote:
The cblas interface is quite outdated, it only works with contiguous matrices and unit-stride in the non-leading dimension.

Well it is a minimal integration test that demonstrates things run end to end and that MLIR can call external C++ functions and produce proper results (shaking off linking and ABI issues for example).
If you look a little deeper in cblas.cpp you'll see that the impl is just a simple inefficient triple nested C++ loop: MLIR does not want to depend on external libraries but we have to ensure they can be called.
More advanced use cases can easily be built, this is the bread and butter of frameworks to wrap the fastest external library calls, dynamically.
 

For several use-cases the BLIS interface that allows any striding (including zero/negative) provides key benefits for many use-cases:

  - spaCy, the leading NLP framework now uses BLIS by default

Yes we have ongoing discussions and prototypes re BLIS, MKL-DNN, Eigen etc but these do not belong in MLIR core.
In the bigger picture, stride 0 immediately breaks the injective property and can be easily pre/post processed so I don't see value in adding it at this level of representation.
Negative strides is something I pondered and don't see a compelling reason to avoid (as long as the injective property is maintained).
OTOH they are not the most immediate concern and don't play nicely with the dimension list parser (e.g. you can't parse `16x-4x1xf32`) or the attribute parser unless we choose `0` to encode unknown which isn't great either.
Bottom line I am happy to punt on negative static strides for now and we can reconsider later once the unification work is landed.
 

Now if we have a higher-level dialect that exposes Numpy-like memref that high-level frameworks can target and that lowers to the dialect of the specification that would also work but it seems like extra work that we could avoid.

This is a tradeoff, making some restrictions allows analysis and compiler transformations.
That big switch + dynamic dispatch you find in virtually all libraries and frameworks to inject static behavior under a dynamic API has to be put somewhere.
Adding it early (framework) or late (library) makes sense. 
Adding it in the middle of the compiler stacks leads to undecidable problems or heavy multi-versioning, the bar to add it there is very high IMO.
JIT'ing can help with some of these issues but isn't a panacea either and the risk of specialization too early is very high.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/698ff147-45a1-4c0f-94d2-7230afa0efa6%40tensorflow.org.


--
N

Nicolas Vasilache

unread,
Sep 11, 2019, 4:07:40 PM9/11/19
to Andy Davis, Mamy Ratsimbazafy, MLIR
Indeed, this is what the first sentence was meant to convey, but it never hurts to clarify:
 `Hello everyone, this RFC addresses a subset of the more general discussion on tensor layout  and memref layouts.`

Thanks Andy!



--
N

Mahesh Ravishankar

unread,
Sep 11, 2019, 5:26:43 PM9/11/19
to MLIR
If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that
2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>. So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?
3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

River Riddle

unread,
Sep 11, 2019, 5:29:35 PM9/11/19
to MLIR


On Wednesday, September 11, 2019 at 2:26:43 PM UTC-7, Mahesh Ravishankar wrote:
If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that
Just a note, but `addf` and all of the other scalar operations do not operate on memref. 

Mahesh Ravishankar

unread,
Sep 11, 2019, 5:55:19 PM9/11/19
to MLIR
Right, I realized that as soon as I posted it. Sorry for the confusion. Read 'addf' as an `operation that takes operands of memref type`

Nicolas Vasilache

unread,
Sep 11, 2019, 7:03:44 PM9/11/19
to Mahesh Ravishankar, MLIR
On Wed, Sep 11, 2019 at 5:26 PM 'Mahesh Ravishankar' via MLIR <ml...@tensorflow.org> wrote:
If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
Yes, although in this very particular corner case we could promote memref<5x5xf32, [25, 1]> to memref<5x5xf32>
 
1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that
Yes, it is the responsibility of each op that takes memrefs to specify what they want and verifiers should be adapted accordingly.
affine.load/store, std.load/store and std.dim work fine with strided memrefs.
 
2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>.
So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?
In general a strided memref is not a specialization of a continuous memref, they are different types. 
The case you point out is a special case that we may want to just promote automatically.
 
3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.
Yes a MemRef without strides implicitly has "default strides" which is what you'd get from a continuous buffer.
By abusing notations, memref<MxNxPxf32> has strides [N * P, P, 1].
This is already implicit in the existing memref address calculation and does not change.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/a4a640df-a7a6-471c-9756-7040d086a07f%40tensorflow.org.


--
N

Uday Bondhugula

unread,
Sep 11, 2019, 10:51:55 PM9/11/19
to MLIR


On Wednesday, September 11, 2019 at 9:47:30 PM UTC+5:30, Nicolas Vasilache wrote:


On Tue, Sep 10, 2019 at 9:54 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:


On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

              affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

`memref<2x5xf32, [3, 17]>` is a strided view into a non-contiguous region of at least (4 * 17) + (1 * 3) + 1 elements (indexed as Alex explains below).
The memref itself is a reference to 2x5 elements.
`memref<4x29xf32>` is a contiguous region of 4x29 elements.

It became clear later that memref<2x5xf32, [3,17]> as per this proposal is the same as

memref<2x5xf32, (d0, d1) -> (3*d0 + 17*d1) > 

with the current memref with affine map. Is this accurate? There is no linearization of the access here. You would still have load/stores represented as %A[%i, %j] since the logical space is 2 x 5 -- so it does represent a non-contiguous region. The linearization hides in the non-identity affine map until one normalizes the memref. 

@alex zinenko, coming back to both of your examples, FWIW, as far as the power of representation goes, it would be good to clarify if this stride proposal allows one to represent anything that cannot already be represented through affine maps? What would be an example? Or is the motivation purely a more convenient representation for manipulation/lowering? Both examples Alex gave (non-contiguous regions - either constant strides or with dynamic ones) are represented with current memrefs. 

~ Uday




 
 

Uday Bondhugula

unread,
Sep 11, 2019, 11:04:30 PM9/11/19
to MLIR


On Thursday, September 12, 2019 at 12:15:53 AM UTC+5:30, Nicolas Vasilache wrote:
On Wed, Sep 11, 2019 at 1:51 PM 'River Riddle' via MLIR <ml...@tensorflow.org> wrote:
How exactly does this compose with the affine compositions already present on memrefs?
 
They are mutually exclusive as explicitly stated. Taking into account Alex's comment the ebnf looks like:

``` {.ebnf}
memref-type ::= `memref` `<` dimension-list-ranked tensor-memref-element-type
                ((`,` semi-affine-map-composition) | (`,` stride-list))?
                (`,` memory-space)? `>`

semi-affine-map-composition ::= (semi-affine-map `,` )* semi-affine-map
stride-list ::= `[` integer (`,` integer)* `]`
memory-space ::= integer-literal /* | TODO: address-space-id */
```

However, as Volodymyr points out: "Index mapping functions, which in my understanding are conceptually similar to affine maps, work on top of that within dimensions of a tensor shape."
This indeed opens the door to possible interactions during lowering but I do not expect such interactions will happen during analyses.
 
I'm also a bit concerned with adding a bunch of things to types, especially given the recent discussion about adding a layout attribute to tensor types.
Concerned how? 
I am much more concerned that memrefs just cannot represent non-continuous input memory regions.

This is the part that isn't clear to me. As per the stride proposal syntax, anything like:

memref < n1 x n2, [s0, s1] > 

is the same as the current

memref< n1 x n2, (d0, d1) -> (d0 * s0 + d1 * s1) >

You would neeed buffer/view abstractions either way to separate alloc'ed memref's with alias'ed ones. So is this stride proposal adding any representational power? The current memrefs can already represent non-contiguous memory regions, albeit in a different way. 

While on this, note that affine maps are canonicalized so that you have (d0 * s0) instead of (s0 * d0). So extracting the stride (if you ever want to) is trivial. For code generation and linearization, it's not even needed because you would finally transform/lower the load to:


affine.load %A [ %i * %s0 + %j * %s1 ] :   

before the LLVM lowering happens, and the current lowering to LLVM dialect just works out of the box on this. With this new proposal like this, one would have to just redo the same lowering/linearize or extend the LLVM lowering to look at / deal with the strides. This is straightforward but just duplication and providing another path to achieve the same thing?

~ Uday







 

Uday Bondhugula

unread,
Sep 11, 2019, 11:42:11 PM9/11/19
to MLIR


On Thursday, September 12, 2019 at 8:21:55 AM UTC+5:30, Uday Bondhugula wrote:


On Wednesday, September 11, 2019 at 9:47:30 PM UTC+5:30, Nicolas Vasilache wrote:


On Tue, Sep 10, 2019 at 9:54 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:


On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

              affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

`memref<2x5xf32, [3, 17]>` is a strided view into a non-contiguous region of at least (4 * 17) + (1 * 3) + 1 elements (indexed as Alex explains below).
The memref itself is a reference to 2x5 elements.
`memref<4x29xf32>` is a contiguous region of 4x29 elements.

It became clear later that memref<2x5xf32, [3,17]> as per this proposal is the same as

memref<2x5xf32, (d0, d1) -> (3*d0 + 17*d1) > 

with the current memref with affine map. Is this accurate? There is no linearization of the access here. You would still have load/stores represented as %A[%i, %j] since the logical space is 2 x 5 -- so it does represent a non-contiguous region. The linearization hides in the non-identity affine map until one normalizes the memref. 


In fact, I just ran this example; it does allocate a region of 72 elements (17*4 + 3*1 + 1) - the accessed space is non-contiguous. Is this the desired behavior? The current LLVM lowering just works on it.

INPUT

func @stride() {
  %A = alloc() : memref<2x5xf32, (d0,d1) -> (3*d0 + 17*d1)>
  affine.for %i = 0 to 2 {
    affine.for %j = 0 to 5 {
      affine.load %A[%i, %j] : memref<2x5xf32, (d0,d1) -> (3*d0 + 17*d1)>
    }
  }
  return
}

OUTPUT
$ mlir-opt -simplify-affine-structures  /tmp/stride.mlir 
#map0 = (d0, d1) -> (d0 * 3 + d1 * 17)

module {
  func @stride() {
    %0 = alloc() : memref<72xf32>
    affine.for %arg0 = 0 to 2 {
      affine.for %arg1 = 0 to 5 {
        %1 = affine.load %0[%arg0 * 3 + %arg1 * 17] : memref<72xf32>
      }
    }
    return
  }
}

Alex Zinenko

unread,
Sep 12, 2019, 3:49:35 AM9/12/19
to Nicolas Vasilache, Mahesh Ravishankar, MLIR


On Thu, Sep 12, 2019, 01:03 'Nicolas Vasilache' via MLIR <ml...@tensorflow.org> wrote:
On Wed, Sep 11, 2019 at 5:26 PM 'Mahesh Ravishankar' via MLIR <ml...@tensorflow.org> wrote:
If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
Yes, although in this very particular corner case we could promote memref<5x5xf32, [25, 1]> to memref<5x5xf32>

This sounds like a reasonable behavior to me. We already do the same to eliminate identity maps from the memref.

 
1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that
Yes, it is the responsibility of each op that takes memrefs to specify what they want and verifiers should be adapted accordingly.
affine.load/store, std.load/store and std.dim work fine with strided memrefs.
 
2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>.
So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?
In general a strided memref is not a specialization of a continuous memref, they are different types. 
The case you point out is a special case that we may want to just promote automatically.
 
3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.
Yes a MemRef without strides implicitly has "default strides" which is what you'd get from a continuous buffer.
By abusing notations, memref<MxNxPxf32> has strides [N * P, P, 1].
This is already implicit in the existing memref address calculation and does not change.

Can we specify this in documentation, for clarity? This is also relevant for lowering to LLVM.

Alex Zinenko

unread,
Sep 12, 2019, 4:50:07 AM9/12/19
to Uday Bondhugula, MLIR
Indeed, this seems representable as affine maps in case of local allocations and static strides.

How this would look like with dynamic strides?  I understand you want to model them as symbols, but it also means you will have to carry around the values of strides, which would be particularly painful on function boundaries.  More generally, what's the story with your memref canonicalization on function boundaries?  If I have an external function

func @foo(memref<2x5xf32, ?x?>)
or
func @foo(memref<2x5xf32, [s0,s1](d0,d1) -> (s0*d0 + s1*d1)>, index, index)  // you need to pass in the values for s0 and s1
if you prefer

I cannot just change its signature because I have no knowledge of its internals or of its other callers.  Do you have a cast operation that would go from memref<72xf32> back to memref<2x5xf32,  [s0,s1](d0,d1) -> (s0*d0 + s1*d1)>?  Conversely, inside such a function, how does one keep track of which values are strides and which values are just function arguments?  I am not saying it is not possible, but the amount of bookkeeping external to the IR becomes worrying.

As for the implementation workload, a version of strided view has existed in Linalg since March.  The required changes are mostly mechanical and I expect the net reduction in code size once everything is implemented, since this will allow us to remove some of the code duplication.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/8ea8642f-c6c1-4c8f-ac7d-c941b0d31c5f%40tensorflow.org.


--
-- Alex

Uday Bondhugula

unread,
Sep 12, 2019, 5:02:53 AM9/12/19
to MLIR
SSA values are bound to those affine map's layout symbols just like they are bound to dynamic dimensions. One should be able to extract them the same way the dim op does for shape symbols across function boundaries / dominance scopes. Because there hasn't been a use case, I guess they have not been implemented. But the treatment is no different from the SSA values attached to dynamic dimensions. In particular:

1) we are missing an 

%v = extract_symbol %A, 0 : memref<..., (d0, d1) [s0] (...)

(This would be needed to get the SSA value associated s0 when %A escapes the dominance of the SSA symbol that was bound at alloc time.)

2) The LLVM lowering would have to treat these additional SSA values the same way it would handle SSA values associated with dynamic dimensions. Is this meaningful? 

The symbols associated with affine layout maps serve multiple purposes, as you know, not just strides. They could represent parametric tile sizes for data tiling, or they could be parametric shifts/padding for example. You get all at once (including compositions) without having to specialize/reimplement support for each separately.

~ Uday

 

Volodymyr Arbatov

unread,
Sep 12, 2019, 10:19:10 AM9/12/19
to Uday Bondhugula, MLIR
This sounds interesting. As I understand the affine map on memref is the way to express data layout, right? 
One of potential problems with strides is that they correspond to specific, though most commonly used, data layouts (or, in other words, it's a particular form of index mapping). When it comes to more complex data layouts, for example multi-level nested tiled data layout, these parameters might become useless or reflect memory organization connected to original dimensions in a complex way.






To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/b18a8cc4-bdd4-46d7-bd44-141148aba89b%40tensorflow.org.

Uday Bondhugula

unread,
Sep 12, 2019, 10:39:21 AM9/12/19
to MLIR


On Thursday, September 12, 2019 at 7:49:10 PM UTC+5:30, Volodymyr Arbatov wrote:
This sounds interesting. As I understand the affine map on memref is the way to express data layout, right? 

Yes, that's right. 
 
One of potential problems with strides is that they correspond to specific, though most commonly used, data layouts (or, in other words, it's a particular form of index mapping). When it comes to more complex data layouts, for example multi-level nested tiled data layout, these parameters might become useless or reflect memory organization connected to original dimensions in a complex way.

+1.  Multi-level/nested tiled layouts are more naturally expressed as index mappings - either concretely as mod's/div's w.r.t the tile sizes as is the case with affine maps or with some special syntax for tiled layouts that an earlier thread was discussing. With the current affine map in memref, there is one example here of a tiled layout (with one level of data tiling):



 






Andy Davis

unread,
Sep 12, 2019, 10:41:04 AM9/12/19
to Mehdi Amini, Mamy Ratsimbazafy, MLIR
Yes. We'll send an email thread out before the meeting, with lead time to allow people to discuss and formulate questions...

Nicolas Vasilache

unread,
Sep 12, 2019, 11:46:06 AM9/12/19
to Uday Bondhugula, MLIR
Yes, and in the general case what does it look like in terms of LLVM memory descriptor?
```
template <typename Elem, size_t NumDims, size_t NumSyms>
struct {
  Elem *ptr;
  int64_t sizes[NumDims];
  int64_t symbols[NumSyms];
};
```
If we went for a similar implementation/convention that exists for MemRef -> LLVM today, NumDims can be an arbitrary values in [0, Rank] and NumSyms can be an arbitrary value (e.g. 42, because why not).

Part of this proposal is to use a standard form that is already widespread in the ML community (in which NumDims == NumSyms == Rank with an extra offset).
i.e.
```
template <typename Elem, size_t Rank>
struct {
  Elem *ptr;
  int64_t offset;
  int64_t sizes[Rank];
  int64_t strides[Rank];
};
```
Metadata operations like view, slice, arbitrary transpose etc are closed on this representation. A standard form is good, it reduces surprises and implementation complexity.
As mentioned in another reply, this is an intermediate level of abstraction between numpy, torch, mxnet/dlpack on one hand and affine on the other hand which helps progressive lowering.

Now let's consider what happens at function boundaries, I see 3 cases:
1. a memref + layout map doesn't lower to LLVM because you need a new type (a function) to encode each different layout. I suspect this is (one of) the reason(s) why normalizeMemRefs exists?
2. since 1. does not pass function boundaries, one would as per your example above "Run this to normalize (lower the non-identity layout); notice that (d0) -> (2 * d0) now disappears from the layout map, and the access becomes 2 * IV.". What this does is move the information from the data type into operations.
3. strided memrefs as proposed just pass the function boundary because they are in a normal form (i.e. the function in 1. is fixed).

Point 2. above could be seen as a workaround for memrefs to passing function boundaries by moving the burden of translation to a predefined set of operations.
Now, since memref is a standard type, let's imagine that there exists a dialect in which there are other types of ops than the affine ones and that take memrefs (MLIR is supposed to be extensible after all).
How does all this work in the context of new ops?

To make it super concrete, here is a fusion example in strided memref form with higher-level ops that have useful semantics (as in they can be lowered to existing implementations that run at 90%+ peak):

INPUT:
func @f2(%A: memref<?x?xf32, [?, 1]>, %B: memref<?x?xf32, [?, 1]>, %C: memref<?x?xf32, [?, 1]>, %D: memref<?x?xf32, [?, 1]>, %E: memref<?x?xf32, [?, 1]>) {
  linalg.matmul(%A, %B, %C) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>
  linalg.matmul(%C, %D, %E) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>
  return
}

OUTPUT:
mlir-opt ./third_party/llvm/llvm/projects/google_mlir/test/Dialect/Linalg/fusion.mlir -linalg-fusion -linalg-fusion-tile-sizes=8,16 -canonicalize -cse

Note that this also would work out of the box with parametric tile sizes (but not from the command line) but this is besides the point.
I'll refer you to the .td in the Linalg dialect for details. 

  func @f2(%arg0: memref<?x?xf32, [?, 1]>, %arg1: memref<?x?xf32, [?, 1]>, %arg2: memref<?x?xf32, [?, 1]>, %arg3: memref<?x?xf32, [?, 1]>, %arg4: memref<?x?xf32, [?, 1]>) {
    %c1 = constant 1 : index
    %c0 = constant 0 : index
    %c16 = constant 16 : index
    %c8 = constant 8 : index
    %0 = dim %arg2, 0 : memref<?x?xf32, [?, 1]>
    %1 = dim %arg3, 1 : memref<?x?xf32, [?, 1]>
    loop.for %arg5 = %c0 to %0 step %c8 {
      loop.for %arg6 = %c0 to %1 step %c16 {
        %2 = affine.apply #map0(%arg5)
        %3 = dim %arg2, 1 : memref<?x?xf32, [?, 1]>
        %4 = linalg.subview %arg2[%arg5, %2, %c1, %c0, %3, %c1] : memref<?x?xf32, [?, 1]>
        %5 = dim %arg3, 0 : memref<?x?xf32, [?, 1]>
        %6 = affine.apply #map1(%arg6)
        %7 = linalg.subview %arg3[%c0, %5, %c1, %arg6, %6, %c1] : memref<?x?xf32, [?, 1]>
        %8 = linalg.subview %arg4[%arg5, %2, %c1, %arg6, %6, %c1] : memref<?x?xf32, [?, 1]>
        %9 = dim %arg0, 1 : memref<?x?xf32, [?, 1]>
        %10 = linalg.subview %arg0[%arg5, %2, %c1, %c0, %9, %c1] : memref<?x?xf32, [?, 1]>
        %11 = linalg.subview %arg1[%c0, %9, %c1, %c0, %3, %c1] : memref<?x?xf32, [?, 1]>
        linalg.matmul(%10, %11, %4) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>
        linalg.matmul(%4, %7, %8) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>
      }
    }
    return
  }
 
Pass this through -linalg-lower-to-llvm you get a bunch of IR with eventually:
module {
  func @f2(%arg0: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg1: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg2: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg3: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg4: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">) {
    llvm.br ^bb1(%1 : !llvm.i64)
   ...
    llvm.call @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(%180, %181, %182) : (!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">) -> ()
    ...
    llvm.call @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(%183, %184, %185) : (!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">) -> ()
    llvm.br ^bb3(%53 : !llvm.i64)
  ^bb5: // pred: ^bb3
    %186 = llvm.add %6, %3 : !llvm.i64
    llvm.br ^bb1(%186 : !llvm.i64)
  ^bb6: // pred: ^bb1
    llvm.return
  }
  func @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">)

What would it take to make such a flow work from the NumDims + NumSyms representation in the absence of affine ops?
Is it always possible to take an arbitrary composition of layout maps and extract the proper semantics that matches an external API?
If so, isn't it akin to raising from a lower-level abstraction, information that can be propagated/progressively lowered from a more structured, higher-level abstraction? 
What would be the complexity of implementing and maintaining the features required to achieve this and how long would it block progress? 

Some concrete extra points to consider in addition to the concrete use case above:
1.  memref is a standard type but it currently has affine-only layout, this is a first attempt at decoupling memref layout from affine semantics with concrete use cases. Does this mean a more future-proof way is to have completely separate types for non-strictly affine data and computations? Note, this is precisely how Linalg has evolved so far but I think it is time for uniformization.   
2. to achieve the above Linalg transformation, non-strictly affine behavior is already used (i.e. generic tiling by parameters that may not be allowed as symbols in affine, boundary conditions on views that use min/max). Again locking memrefs to strict affine behavior is limiting. 
3. the above reasoning also generalizes more broadly to more irregular data types (ragged, sparse) but this is not part of the current proposal.

Lastly, if there is a counter-proposal to achieve the same generic loop + library compilation that Linalg does but in the pure affine world, let's hear it and see some code.
In the meantime, I second Alex's opinion that "As for the implementation workload, a version of strided view has existed in Linalg since March.  The required changes are mostly mechanical and I expect the net reduction in code size once everything is implemented, since this will allow us to remove some of the code duplication.". The implementation is ready and functional.

The symbols associated with affine layout maps serve multiple purposes, as you know, not just strides. They could represent parametric tile sizes for data tiling, or they could be parametric shifts/padding for example. You get all at once (including compositions) without having to specialize/reimplement support for each separately.

Side note, once all is composed in an affine layout blob of mixed tiling, shifts and padding with modulos and boundary conditions. How easy is it to generate fast code, or even better reduce to operations on metadata and call high-performance copy libraries with in-register vector transposes? I see similarities with extracting elemental transformations from an affine schedule which is still an open problem AFAIK.
 
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/b18a8cc4-bdd4-46d7-bd44-141148aba89b%40tensorflow.org.


--
N

Mahesh Ravishankar

unread,
Sep 12, 2019, 1:02:47 PM9/12/19
to Alex Zinenko, Nicolas Vasilache, MLIR
On Thu, Sep 12, 2019 at 12:49 AM Alex Zinenko <zin...@google.com> wrote:


On Thu, Sep 12, 2019, 01:03 'Nicolas Vasilache' via MLIR <ml...@tensorflow.org> wrote:
On Wed, Sep 11, 2019 at 5:26 PM 'Mahesh Ravishankar' via MLIR <ml...@tensorflow.org> wrote:
If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
Yes, although in this very particular corner case we could promote memref<5x5xf32, [25, 1]> to memref<5x5xf32>

This sounds like a reasonable behavior to me. We already do the same to eliminate identity maps from the memref.

 
1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that
Yes, it is the responsibility of each op that takes memrefs to specify what they want and verifiers should be adapted accordingly.
affine.load/store, std.load/store and std.dim work fine with strided memrefs.
 
2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>.
So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?
In general a strided memref is not a specialization of a continuous memref, they are different types. 
The case you point out is a special case that we may want to just promote automatically.
 
3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.
Yes a MemRef without strides implicitly has "default strides" which is what you'd get from a continuous buffer.
By abusing notations, memref<MxNxPxf32> has strides [N * P, P, 1].
This is already implicit in the existing memref address calculation and does not change.

Can we specify this in documentation, for clarity? This is also relevant for lowering to LLVM.

I second documenting this. Agree with all the points made here if there is an implicit "default strides" for Memrefs. AFAIK this is actually stated anywhere.


--
Mahesh

Uday Kumar Reddy B

unread,
Sep 16, 2019, 4:14:54 PM9/16/19
to Nicolas Vasilache, MLIR
Hi Nicolas

I'm going to respond to your numerous questions/comments in bulk here
instead of inline :-), since I think they are all inter-related, and I
wanted a particular order to it. (I'm skipping some of the questions
that I found digressing and unrelated to this strided memref proposal.)

1) I think my first question was on clarifying separately on
*representational power* and *ease of representation/transformation*.
Although I don't see an explicit response to the first part, it's clear
(and also with Alex confirming) that this proposal deals with a subset
of what affine maps can already represent. In your earlier response to
River's question, you stated
"I am much more concerned that memrefs just cannot represent
non-continuous input memory regions.", i.e., the current affine map
layout couldn't represent non-contiguous regions and that this proposal
did; but that is clearly inaccurate. It would be good to reconsider
things in view of that. Your previous email has thus been all about ease
of representation - so let's get to that.

2) Coming to the ease of representation: I think your point that it is
hard to pull out individual aspects from compound maps like strides,
tile sizes, offsets (for whatever reasons you need them) doesn't make
sense at all because you are really comparing apples to oranges there!
:-) This proposal in discussion is on strided memrefs, and if you look
at how those strides get into and are represented in affine maps, it's
just trivial to extract them out if the strides are what you wanted to
represent (just with a pattern match on the affine maps - see further
below).

Now, if the affine map is a compound map representing a composition of
several things, the comparison should be made to a design that is
roughly as powerful (representational power-wise), and is able to
represent all of that (strides, tiled layouts, nested tiled layout, all
useful compositions thereof such as strided intra-tile + permuted tile
space), and then evaluate which one is better. The representation in
this proposal is really adding a very thin layer of abstraction to
represent a proper subset of what's possible with affine maps, and at
the expense of new syntax and a new representation, and providing an
alternate/duplicate lowering path.

3) Reg. your comment on the different number of dimensional arguments
and symbolic arguments to the AllocOp: I'm not sure why that's an issue.
One should let the compiler do its job here! :-) If there are really
going to be 42 symbolic arguments binding to a layout maps (albeit
unrealistic), there's probably a reason, and let the machinery propagate
stuff. The key is that the current affine map based design is unified
with the rest of the SSA IR and as such SSA values bound to the
dimensions, symbols of the alloc all benefit from the existing infra on
constant propagation into affine maps, and algebraic simplification and
canonicalization of maps.

4) Reg. your comment on hearing about a counter proposal on the affine
dialect, this is a bit surprising, since the counter proposal you are
asking for is already the current memref design we have! :-) It already
supports tiled layouts (including nested ones), strides, shifts, and
their compositions, in design, and the implementation works too when the
tile sizes/strides are constant. As I mentioned, extending to dynamic
ones has two implementation alternatives and there are reasons for both
to exist:

(a) normalizeMemRefs, which turns the memref into one with identity
layout, and so the LLVM lowering won't have to be updated and it would
just work as is. The C++ struct you show with "ArrayRef<Value *>
symbols" will not be needed since the normalized memref with an identity
map doesn't have symbols! And as I mentioned, if you have a memref like
this:
memref<? x ? x f32, (d0, d1)[s0, s1] -> (s0 * d0 + s1 * d1)>
(which is exactly what you get with strides s0, s1 in your proposal)

the normalized memref will just have (d0, d1) -> (d0, d1) and you get
the index remapping/composition with all the loads/stores, i.e.,
%A[%i, %j] becomes %A[ %s0* %i + %s1 * %j].
And you can do this normalization as late as you want - probably while
lowering away affine stuff. If you are normalizing it, you have to
normalize the memref everywhere. Note that when one updates the memref
layout at the caller, the memref's type has changed, and as such the
callee function would have to be cloned for the new memref type, which
is the right thing to do since a different layout map drastically
impacts how the function should be optimized. But this nice benefit is
common to all proposals that put layout info into the type.

(b) keep track of the symbols the same way as the shape SSA values
associated with the dynamic dimensions in the LLVM lowering. You need
this design only to preserve/hide complex non-identity layouts *all the
way* up until the LLVM lowering - normally, one would like to just
normalize during -lower-affine, because the linearized stuff can be left
alone as we are anyway out of affine.load/stores. Or you could do that
just before LLVM lowering in order to separate concerns and not make the
really nice LLVM lowering we currently have heavy. With all these design
points, the LLVM lowering will NOT see any symbolic operands on the
alloc's or views!

5) At several points, you claim that affine is limiting in many ways and
give orthogonal examples: but the thing in discussion is *this* layout
proposal and what is being proposed here is a small subset of what can
already be achieved with affine maps. It doesn't make sense to make the
affine restriction argument unless the proposal can deal with something
that can't already be represented with affine maps (for eg. for
affine.for, the case of non-constant steps is a valid one, but this
proposal isn't about loop structure). So let's stick to the topic at hand.

6) The danger with the proposed representation is that if someone needs
something more in the future, this specialized representation and syntax
are hard to adapt/extend to it, and you'll have to go back to the
drawing board. And consequently, one gets a pot pourri of different
syntax/representations attached with the memref grammar looking like:

memref<? x ? x f32, rep1 | rep2 | rep3 | ...>,

leading to more duplication in infra and confusion among users as to
which one to use.

In contrast, *I'm yet to see a dense tensor layout used in practice that
can't be represented with a (semi)-affine map*. It'll be great to see an
example, and even if you have one, that still doesn't impact this
proposal. Overall, decoupling memref's from affine maps merely because
the memref is a standard/core type and the layout is an affine thing
isn't meaningful by itself - we should of course still try that, but in
the presence of a design that is either adding more representational
power or making it more convenient to represent/transform something.

7) On your remark on seeing code on a counter proposal: let's not merely
use the existence of code / the amount of additional code needed to
evaluate a design given the current context -- because as I mentioned,
the current proposal and the existing affine map based memref achieve
very different design end-goals from the point of view of
representational power. And in any case, I don't think either requires
any complex implementation at all. So, let's just primarily focus on the
design.

8) I also disagree with some of the other digressing comments like the
ones on non-affine behavior, "generating fast code, calling copy
libaries with in-register transpose", etc. Besides being unrelated to
the current layout proposal, I think the conclusions are inaccurate!
Using the affine dialect doesn't mean you can't do such things (MLIR
mixes affine with SSA and standard op variants after all), but I'd like
to avoid the discussion from diverging --- those would require equally
long responses/discussions as this one. :-)

To conclude, I think this proposal is really adding a very thin layer of
abstraction at the expense of new syntax and a new representation; it
also only achieves what is already possible with current memref design,
and it is also in a manner not as unified with the rest of MLIR as
current memrefs are. Without augmenting it further to represent more
stuff, I think it doesn't pay the price of changing the IR and the type
system.

~ Uday