RFC Strided MemRef Proposal (a.k.a Views)

Nicolas Vasilache

unread,

Sep 9, 2019, 4:20:38 PM9/9/19

to MLIR

Hello everyone, this RFC addresses a subset of the more general discussion on tensor layout and memref layouts.

This RFC is for adding an optional stride to MemRefType as a way to:

1. unify existing codegen strategies (Affine and Linalg dialects) by using the same standard type (i.e. !linalg.view is replaced everywhere by a strided memref);

2. provide a standard type that is closed under a set of interesting operations and passes function boundaries in a generic way (i.e. no need to clone and specialize a new function for each small change in layout);

3. provide an intuitive entry point to interfacing with external library calls operating on strided array objects. Such libraries often perform dynamic dispatch to adapt the strategy to problem sizes, alignment and strides between elements (see for example the primitive interface to BLAS that MLIR targets from the Linalg dialect).

This is also related to the presentation by Intel in one of the past open design meetings (see Aug 22nd 2019 recording and a primer on strides in memory formats).

The current proposal prioritizes point 1. above by replacing the !linalg.view type by a strided memref type. This consists of an optional memref "type attribute" that can be either static or dynamic. In the dynamic case, the information is opaque to the affine and linalg dialects and only materializes when lowering to LLVMIR for the purpose of address calculation.

Proposed syntax

An optional stride-list is added to the memref-type

``` {.ebnf}
memref-type ::= `memref` `<` dimension-list-ranked tensor-memref-element-type
(`,` semi-affine-map-composition)? (`,` memory-space)?
(`,` stride-list)? `>`

semi-affine-map-composition ::= (semi-affine-map `,` )* semi-affine-map
memory-space ::= integer-literal
stride-list ::= `[` integer (`,` integer)* `]`
```

A stride specification is a list of integer values that are either non-negative (static case) or `-1` (dynamic case). Strides encode the distance in memory between successive entries along a particular dimension. For example, `memref<42x16xf32, [1, 64]>` specifies a view into a non-contiguous memory region of `42` by `16` `f32` elements in which the distance between two consecutive elements along the outer dimension is `sizeof(f32)` bytes and the distance between two consecutive elements along the outer dimension is `64 * sizeof(f32)` bytes. This corresponds to a column major "view" of the non-contiguous memory region.

At this time, strided memrefs do not allow affine map compositions layout to be specified: both properties are mutually exclusive.

Creation Ops
Strided memrefs represent a view abstraction over preallocated data and can thus not be allocated directly. They are constructed with special ops operating on preallocated data. It is illegal to dealloc a strided memref. The first users of strided memrefs will be the Linalg dialect operations that operate on !linalg.view types. A useful property is that the type is closed under these transformations (generalized transpose, slice, subview, view).

Aliasing

The specification of strides must not alias internally: given an n-D strided memref, indices `(i1, ..., in)` and `(j1, ..., jn)` may not refer to the same memory address unless `i1 == j1, ..., in == jn`. It is undefined behavior to create strided memrefs that alias internally.

Aliasing between different strided memrefs is the subject of ongoing investigation. One possible path is to provide aliasing sets at the function boundary and view creation ops with well-defined semantics and runtime guarantees (e.g. view intersection, difference, split, concat).

Lowering to LLVM

Strided memrefs correspond conceptually to the following templated C++ struct:

```

template <typename Elem, size_t Rank>
struct {
Elem *ptr;
int64_t offset;
int64_t sizes[Rank];
int64_t strides[Rank];
};

```

The linearization procedure for address calculation for strided memrefs is the same as for linalg views:
`base_offset + SUM_i index_i * stride_i`.

Implementation implications

This will result in a bunch of simplifications on the Linalg part of the code: DimOp, LoadOp, StoreOp can disappear. !linalg.view will be replaced by strided memref. This will also make it easier for Linalg to lower to Affine.

Uday Bondhugula

unread,

Sep 10, 2019, 12:15:52 AM9/10/19

to MLIR

Just wanted to point out that strides along memref dimensions can already be represented and work all the way through the lowering to LLVM IR, and to execution with the current memref's AffineMap layout. Just use the stride as the scaling factor for your affine map. The normalizeMemrefs utility normalizes that to an identity layout (by recalculating the new shaping and placing remappings for the indices), and so the lowering to LLVM already works transparently. Here's an example with stride two.

func @stride() {

%A = alloc() : memref<64xf32, (d0) -> (2*d0)>

affine.for %i = 0 to 64 {

affine.load %A[%i] : memref<64xf32, (d0) -> (2 * d0)>

}

return

}

Run this to normalize (lower the non-identity layout); notice that (d0) -> (2 * d0) now disappears from the layout map, and the access becomes 2 * IV.

$ mlir-opt -simplify-affine-structures stride.mlir

#map0 = (d0) -> (d0 * 2)

#map1 = () -> (0)

#map2 = () -> (64)

module {

func @stride() {

%0 = alloc() : memref<127xf32>

affine.for %arg0 = 0 to 64 {

%1 = affine.load %0[%arg0 * 2] : memref<127xf32>

}

return

}

Lower to LLVM

$ bin/mlir-opt -simplify-affine-structures /tmp/stride.mlir -lower-affine -lower-to-cfg -lower-to-llvm

module {

func @stride() {

%0 = llvm.mlir.constant(127 : index) : !llvm.i64

%1 = llvm.mlir.constant(4 : index) : !llvm.i64

%2 = llvm.mul %0, %1 : !llvm.i64

%3 = llvm.call @malloc(%2) : (!llvm.i64) -> !llvm<"i8*">

%4 = llvm.bitcast %3 : !llvm<"i8*"> to !llvm<"float*">

%5 = llvm.mlir.constant(0 : index) : !llvm.i64

%6 = llvm.mlir.constant(64 : index) : !llvm.i64

%7 = llvm.mlir.constant(1 : index) : !llvm.i64

llvm.br ^bb1(%5 : !llvm.i64)

^bb1(%8: !llvm.i64): // 2 preds: ^bb0, ^bb2

%9 = llvm.icmp "slt" %8, %6 : !llvm.i64

llvm.cond_br %9, ^bb2, ^bb3

^bb2: // pred: ^bb1

%10 = llvm.mlir.constant(2 : index) : !llvm.i64

%11 = llvm.mul %8, %10 : !llvm.i64

%12 = llvm.mlir.constant(127 : index) : !llvm.i64

%13 = llvm.getelementptr %4[%11] : (!llvm<"float*">, !llvm.i64) -> !llvm<"float*">

%14 = llvm.load %13 {alignment = 16 : i32} : !llvm<"float*">

%15 = llvm.add %8, %7 : !llvm.i64

llvm.br ^bb1(%15 : !llvm.i64)

^bb3: // pred: ^bb1

llvm.return

}

func @malloc(!llvm.i64) -> !llvm<"i8*">

}

Execute it:

bin/mlir-opt -simplify-affine-structures /tmp/stride.mlir -lower-affine -lower-to-cfg -lower-to-llvm | mlir-cpu-runner -e stride -O3

Strides are just one special case for affine maps; if one wants to extract them from the map, that's easy as well. And if your goal for now is to just specify them and code generate, it already works.

~ Uday

Alex Zinenko

unread,

Sep 10, 2019, 9:01:34 AM9/10/19

to MLIR

Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?

Like in memref<42x16xf32, [1, 64], /*memory space=*/2>

It does not seem difficult to parse.

More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes. What is the rationale for that?

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/31231477-2774-4b02-8d3e-c40cacd7125b%40tensorflow.org.

--

-- Alex

Alex Zinenko

unread,

Sep 10, 2019, 9:09:07 AM9/10/19

to MLIR

How would you represent memref<2x5xf32, [3, 17]> ?

Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/c467a672-a642-46d7-8141-891e1f2bfe2c%40tensorflow.org.

--

-- Alex

Uday Bondhugula

unread,

Sep 10, 2019, 9:54:54 AM9/10/19

to MLIR

On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:

How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

The normalizeMemRefs can be easily extended to deal with such semi-affine maps: you'll get s0 * %i and s1 * %j for the access functions (composition works with semi-affine maps), and the alloc dimensions sizes will be scaled similarly. So, I don't see any difficulty handling semi-affine maps that are representing strided things. The utility can assert out with anything more complex that is semi-affine. I think the advantage with this approach is that it leads a lot of room to explore more complex layouts, and you could incrementally implement more and more on demand: for eg., tiled layouts where the tile sizes just bind to SSA values. Tiled layouts with constant tile sizes already work.

https://github.com/tensorflow/mlir/blob/master/test/Transforms/memref-normalize.mlir

~ Uday

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/c467a672-a642-46d7-8141-891e1f2bfe2c%40tensorflow.org.

--
-- Alex

Alex Zinenko

unread,

Sep 10, 2019, 12:35:01 PM9/10/19

to Uday Bondhugula, MLIR

On Tue, Sep 10, 2019 at 3:54 PM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

I don't quite get where does 4x29 come from.

I suspect there might be a terminology confusion. Stride in this proposal doesn't mean the same thing as in affine world. It is a number of _linearized_ elements you need to step over to get to the next element along the given dimension. So stride [3, 17] is not equivalent to (i, j) -> (3*i, 17*j). Stride [51, 17] would have been equivalent to that. In this particular case, the linearized offsets are as follows:

[0,0] -> 0

[0,1] -> 17

[0,2] -> 34

[0,3] -> 51

[0,4] -> 68

[1,0] -> 3

[1,1] -> 20

[1,2] -> 37

[1,3] -> 54

[1,4] -> 71

Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

It seems to rather be memref<4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> for the stride model in question, which leads to premature linearization of accesses.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org.

--

-- Alex

Mehdi Amini

unread,

Sep 10, 2019, 1:34:22 PM9/10/19

to Alex Zinenko, MLIR

On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:

Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes. What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?

Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.

--

Mehdi

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655iBcP72Y6R15AF-E__CLyQhhi0Ppq4YK4isRxn_6Bn4OA%40mail.gmail.com.

Alex Zinenko

unread,

Sep 10, 2019, 5:02:19 PM9/10/19

to MLIR

On Tue, Sep 10, 2019 at 7:34 PM Mehdi Amini <ami...@google.com> wrote:

On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes. What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?
Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.

I don't think this is anyhow different from the sizes. If the stride is static, it is present in the type itself. If the stride is dynamic, we need a field in the struct to store that.

On concrete examples, `memref<42x?xf32>` is currently equivalent to `{float*, i64}` where the integer stores the dynamic size and 42 always remains a constant. What this seems to propose (unless I am misreading the section), is to change that to be `{float*, i64, i64}` with integers containing _both_ sizes, regardless of them being static or dynamic in the type. On the function boundary, you are not allowed to pass a memref<42x?xf32> where a memref<?x?xf32> is expected and vice versa, you need an exact match.

For strides, we can use exactly the same kind of encoding. `memref<2x2xf32, [-1,1]>` will have one field for the dynamic stride, represented by -1 in the type and `memref<2x2xf32, [2,1]>` will not have any. This will pass function boundaries just as well as the current one for dynamic sizes does. There is not necessarily a link between sizes being dynamic and strides being dynamic.

In fact, I originally proposed to store all sizes when going to LLVM. We decided against it with the following arguments: (1) it creates a hazard of inconsistency between the static value visible in the type and the dynamic value stored; although we can declare such situation as undefined behavior, we did not want to have undefined behaviors at that point; (2) it makes it difficult to interface with C functions, without storing dynamic sizes, a statically-shaped memref is a bare pointer; (3) it creates storage overhead, and we did not (still don't) have CSE on the LLVM dialect.

--

-- Alex

Uday Bondhugula

unread,

Sep 10, 2019, 11:18:31 PM9/10/19

to MLIR

On Tuesday, September 10, 2019 at 10:05:01 PM UTC+5:30, Alex Zinenko wrote:

On Tue, Sep 10, 2019 at 3:54 PM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

I don't quite get where does 4x29 come from.

I suspect there might be a terminology confusion. Stride in this proposal doesn't mean the same thing as in affine world. It is a number of _linearized_ elements you need to step over to get to the next element along the given dimension. So stride [3, 17] is not equivalent to (i, j) -> (3*i, 17*j). Stride [51, 17] would have been equivalent to that. In this particular case, the linearized offsets are as follows:
[0,0] -> 0
[0,1] -> 17
[0,2] -> 34
[0,3] -> 51
[0,4] -> 68
[1,0] -> 3
[1,1] -> 20
[1,2] -> 37
[1,3] -> 54
[1,4] -> 71

Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

It seems to rather be memref<4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> for the stride model in question, which leads to premature linearization of accesses.

memref< 4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> won't be a valid memref since the map here is 2-d while the memref has a 1-d shape. I guess you meant:

memref <128 x 128 x f32, (d0, d1) [s0, s1] -> (s0 * d1 + s1 * d1) >

The accesses themselves aren't linearized until one normalizes the memref to an identity layout, i.e., you'd still have:

%A[%i, %j] : memref<128 x 128 x f32, ....>

(128 x 128 is the logical space which the load's store's still index).

As you point out further above, the strides per this proposal are very different than the multi-dimensional strides I was referring to, and there is nothing affine about the latter. When you have strided convolutions for example, the strides are actually the latter, i.e., when you have a stride 2 along each dimension on a 2-d convolution, the strides are just (2, 2) and not (2 * x_dimension_size, 2).

~ Uday

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org.

--
-- Alex

Mamy Ratsimbazafy

unread,

Sep 10, 2019, 11:21:21 PM9/10/19

to MLIR

Strided tensors should be allowed to have negative or zero strides.

Also an offset, that points to the first element of the tensor should be added

1) Zero-strides are needed to represent broadcasting operations.

2) Negative strides are needed to reverse a tensor dimension without copy.

This also makes dynamic striding with -1 impossible but we can use a flag instead.

The main advantage is that it would allow MLIR to support all shape operations supported by Numpy, DLPack (a cross-framework tensor representation effort - https://github.com/dmlc/dlpack), PyTorch and all other frameworks that follow the "rank + shape + strides + offset + data" paradigm.

--
Mehdi

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/31231477-2774-4b02-8d3e-c40cacd7125b%40tensorflow.org.

--
-- Alex

--
You received this message because you are subscribed to the Google Groups "MLIR" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655iBcP72Y6R15AF-E__CLyQhhi0Ppq4YK4isRxn_6Bn4OA%40mail.gmail.com.

--
-- Alex

Mehdi AMINI

unread,

Sep 11, 2019, 1:25:10 AM9/11/19

to Alex Zinenko, MLIR

On Tue, Sep 10, 2019 at 2:02 PM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:

On Tue, Sep 10, 2019 at 7:34 PM Mehdi Amini <ami...@google.com> wrote:

On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes. What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?
Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.

I don't think this is anyhow different from the sizes. If the stride is static, it is present in the type itself. If the stride is dynamic, we need a field in the struct to store that.
On concrete examples, `memref<42x?xf32>` is currently equivalent to `{float*, i64}` where the integer stores the dynamic size and 42 always remains a constant. What this seems to propose (unless I am misreading the section), is to change that to be `{float*, i64, i64}` with integers containing _both_ sizes, regardless of them being static or dynamic in the type. On the function boundary, you are not allowed to pass a memref<42x?xf32> where a memref<?x?xf32> is expected and vice versa, you need an exact match.
For strides, we can use exactly the same kind of encoding. `memref<2x2xf32, [-1,1]>` will have one field for the dynamic stride, represented by -1 in the type and `memref<2x2xf32, [2,1]>` will not have any. This will pass function boundaries just as well as the current one for dynamic sizes does. There is not necessarily a link between sizes being dynamic and strides being dynamic.

Right, what I was alluding to is that if you have a memref<8x?x?x?x?xf32> and you want to call a function that takes a memref<?x?x?x?x?xf32> you have to reallocate and copy the structure.

The advantage of always materializing the shape in the struct is that you don't need any conversion across function calls.

In fact, I originally proposed to store all sizes when going to LLVM. We decided against it with the following arguments: (1) it creates a hazard of inconsistency between the static value visible in the type and the dynamic value stored; although we can declare such situation as undefined behavior, we did not want to have undefined behaviors at that point;

Such undefined behavior does not seem terrible to me, because it is one that is trivial to catch with runtime checks (we should avoid any UB that we don't have a "cheap" and principled way of detecting reliably).

(2) it makes it difficult to interface with C functions, without storing dynamic sizes, a statically-shaped memref is a bare pointer;

I'm not sure I totally follow what is more difficult here: if you have a static type then you can extract the pointer of the structure and pass it to a C function. If you don't have a static shape then you can't pass this to a C function?

(3) it creates storage overhead, and we did not (still don't) have CSE on the LLVM dialect.

This one seems short-sighted to me: I would not design core-components of the system based on the current optimizer (or lack of thereof) state (and there is CSE at the LLVM level, we should likely try to emit the code and improve LLVM if it does not match our IR patterns).

Best,

--

Mehdi

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gXvAb9A_UgxhuNGJ5S%2Bc2oFo8JFtNwNiv%2B4Um8rkH%2BhQ%40mail.gmail.com.

Alex Zinenko

unread,

Sep 11, 2019, 4:17:12 AM9/11/19

to MLIR

Exactly.

I believe the point of "linearized" strides is to be able to represent a hyper-rectangular subset of memref elements, e.g. to crop off padding or to do a sort of data tiling. This encoding is used in numpy and PyTorch, among others, so there is value in supporting it. This does not mean we have to remove the affine maps.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/62f3457a-9d44-4a10-995a-876468bb30ce%40tensorflow.org.

--

-- Alex

Alex Zinenko

unread,

Sep 11, 2019, 4:26:36 AM9/11/19

to MLIR

On Wed, Sep 11, 2019 at 5:21 AM Mamy Ratsimbazafy <ma...@numforge.co> wrote:

Strided tensors should be allowed to have negative or zero strides.
Also an offset, that points to the first element of the tensor should be added

1) Zero-strides are needed to represent broadcasting operations.
2) Negative strides are needed to reverse a tensor dimension without copy.

This also makes dynamic striding with -1 impossible but we can use a flag instead.

The main advantage is that it would allow MLIR to support all shape operations supported by Numpy, DLPack (a cross-framework tensor representation effort - https://github.com/dmlc/dlpack), PyTorch and all other frameworks that follow the "rank + shape + strides + offset + data" paradigm.

The problem with implicit broadcasting or actually other shape+stride configuration that makes the subscripts<->address relation non-bijective is that it makes any data-related analysis significantly more complex, if possible at all. I suppose that's why the proposal explicitly prohibits such cases, which it calls "internal aliasing". Modeling PyTorch at _memref_ level is not something that I would consider a priority at this point. This needs to be defined at tensor level and lowered away if going to memrefs.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/cfddc618-76c2-4e6d-a46d-c0465eb48ed7%40tensorflow.org.

--

-- Alex

Volodymyr Arbatov

unread,

Sep 11, 2019, 5:11:39 AM9/11/19

to Alex Zinenko, MLIR

Strides (we call them "pitch" in our compiler) are useful for aligning dimension pointers while keeping original tensor dimensions. Hardware may have alignment constraints or software may prefer alignment of dimension pointers for performance reasons (not necessarily alignment of fastest dimensions) while tensor shape doesn't have to be aligned. Index mapping functions, which in my understanding are conceptually similar to affine maps, work on top of that within dimensions of a tensor shape.

- Volodymyr.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gXiOis3sCZm9FM0MWeBdEbJT%3DuYtqdSXd%2BCfZeFLCKJA%40mail.gmail.com.

Alex Zinenko

unread,

Sep 11, 2019, 5:32:21 AM9/11/19

to MLIR

On Wed, Sep 11, 2019 at 7:25 AM Mehdi AMINI <joke...@gmail.com> wrote:

On Tue, Sep 10, 2019 at 2:02 PM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:

On Tue, Sep 10, 2019 at 7:34 PM Mehdi Amini <ami...@google.com> wrote:

On Tue, Sep 10, 2019 at 6:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>
It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes. What is the rationale for that?

What about function boundaries? If you don't store all memref for a given rank the same way you need to make new copies across function boundaries I guess?
Inside a local scope, nothing prevent from constant-propagating and compressing a locally used struct from unused fields.

I don't think this is anyhow different from the sizes. If the stride is static, it is present in the type itself. If the stride is dynamic, we need a field in the struct to store that.
On concrete examples, `memref<42x?xf32>` is currently equivalent to `{float*, i64}` where the integer stores the dynamic size and 42 always remains a constant. What this seems to propose (unless I am misreading the section), is to change that to be `{float*, i64, i64}` with integers containing _both_ sizes, regardless of them being static or dynamic in the type. On the function boundary, you are not allowed to pass a memref<42x?xf32> where a memref<?x?xf32> is expected and vice versa, you need an exact match.
For strides, we can use exactly the same kind of encoding. `memref<2x2xf32, [-1,1]>` will have one field for the dynamic stride, represented by -1 in the type and `memref<2x2xf32, [2,1]>` will not have any. This will pass function boundaries just as well as the current one for dynamic sizes does. There is not necessarily a link between sizes being dynamic and strides being dynamic.

Right, what I was alluding to is that if you have a memref<8x?x?x?x?xf32> and you want to call a function that takes a memref<?x?x?x?x?xf32> you have to reallocate and copy the structure.
The advantage of always materializing the shape in the struct is that you don't need any conversion across function calls.

It's more at `memref_cast` that the structure reallocation must happen, independently of function calls. But I agree with the reasoning.

In fact, I originally proposed to store all sizes when going to LLVM. We decided against it with the following arguments:

It's a shame you weren't in the original discussion, these are very similar to my reasoning.

(1) it creates a hazard of inconsistency between the static value visible in the type and the dynamic value stored; although we can declare such situation as undefined behavior, we did not want to have undefined behaviors at that point;

Such undefined behavior does not seem terrible to me, because it is one that is trivial to catch with runtime checks (we should avoid any UB that we don't have a "cheap" and principled way of detecting reliably).

Agreed.

Independently of this discussion, it would be great to have a sort of "assert" op for runtime checks.

(2) it makes it difficult to interface with C functions, without storing dynamic sizes, a statically-shaped memref is a bare pointer;

I'm not sure I totally follow what is more difficult here: if you have a static type then you can extract the pointer of the structure and pass it to a C function. If you don't have a static shape then you can't pass this to a C function?

There is no "extract pointer" or any notion of bare pointer at all when it comes to memrefs. One option would be to use zero-rank memref for that and allow one to cast memrefs and change ranks. Another is to have a separate pointer/buffer type.
You can pass a `struct {float*, int64_t}*` to a function just fine.

(3) it creates storage overhead, and we did not (still don't) have CSE on the LLVM dialect.

This one seems short-sighted to me: I would not design core-components of the system based on the current optimizer (or lack of thereof) state (and there is CSE at the LLVM level, we should likely try to emit the code and improve LLVM if it does not match our IR patterns).

Agreed.

I am generally supportive of the change, it makes the overall model simpler, but we need to make it consciously.

--

-- Alex

Nicolas Vasilache

unread,

Sep 11, 2019, 12:09:04 PM9/11/19

to Alex Zinenko, MLIR

Thanks for your replies all, sorry for the delay I was out yesterday.

Here are some answers below.

On Tue, Sep 10, 2019 at 9:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:

Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>

Sure, consider it done.

It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

Technically they can during lowering, as is pointed out a bit later by Volodymyr Arbatov. However, they may create undecidable problems for dependence analysis and force the insertion of case disjunction early in the compilation chain which seems undesirable. Since a strided memref is a (non-contiguous) view into an existing contiguous buffer it is subject to aliasing and is safer to disallow for now.

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

We can but we would have to use a syntax such as `memref<42x16xf32, 1x?, /*memory space=*/2>` because integer array list attributes won't parse, I don't personally find this much better but I am open to suggestions. Maybe someone has a better proposal, in any case it will be easy to modify once implemented.

3. The proposed lowering to LLVM seems to differ from what we currently have, in particular we only store _dynamic_ size whereas what you propose looks like storing _all_ sizes. What is the rationale for that?

Essentially because of the discussion you already had with Mehdi Amini a bit later: simplicity and passing function boundaries (esp. going to externally precompiled function calls, multiversioning etc) vs runtime checks. The compression of the struct seems like an optimization that we can reconsider in the future once we have enough experience at the function call boundary.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655iBcP72Y6R15AF-E__CLyQhhi0Ppq4YK4isRxn_6Bn4OA%40mail.gmail.com.

--

N

Nicolas Vasilache

unread,

Sep 11, 2019, 12:17:30 PM9/11/19

to Uday Bondhugula, MLIR

On Tue, Sep 10, 2019 at 9:54 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

`memref<2x5xf32, [3, 17]>` is a strided view into a non-contiguous region of at least (4 * 17) + (1 * 3) + 1 elements (indexed as Alex explains below).

The memref itself is a reference to 2x5 elements.

`memref<4x29xf32>` is a contiguous region of 4x29 elements.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org.

--

N

Nicolas Vasilache

unread,

Sep 11, 2019, 12:17:48 PM9/11/19

to Alex Zinenko, Uday Bondhugula, MLIR

On Tue, Sep 10, 2019 at 12:35 PM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:

On Tue, Sep 10, 2019 at 3:54 PM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

I don't quite get where does 4x29 come from.

I suspect there might be a terminology confusion. Stride in this proposal doesn't mean the same thing as in affine world. It is a number of _linearized_ elements you need to step over to get to the next element along the given dimension. So stride [3, 17] is not equivalent to (i, j) -> (3*i, 17*j). Stride [51, 17] would have been equivalent to that. In this particular case, the linearized offsets are as follows:
[0,0] -> 0
[0,1] -> 17
[0,2] -> 34
[0,3] -> 51
[0,4] -> 68
[1,0] -> 3
[1,1] -> 20
[1,2] -> 37
[1,3] -> 54
[1,4] -> 71

Independently of that, current implementation of Linalg transformation relies on stride information to be fully dynamic, which is not possible with affine maps.

Do you mean strides that are SSA values (and unknown at compile time)? This is just represented using semi-affine maps and normalized to identity maps. If your strides are (s0, s1) along the two dimensions, you'll have:

memref < 128 x 128 x f32, (d0, d1) -> (s0 * d0, s1 * d1) >

It seems to rather be memref<4096 x f32, (d0, d1) -> (s0*d0 + s1*d1)> for the stride model in question, which leads to premature linearization of accesses.

+1

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gTe5Ock1b9kbAtKh4S0n5vgqQJsORDkfcfof%2BcXuXRTA%40mail.gmail.com.

--

N

Nicolas Vasilache

unread,

Sep 11, 2019, 12:41:15 PM9/11/19

to Mamy Ratsimbazafy, MLIR

On Tue, Sep 10, 2019 at 11:21 PM Mamy Ratsimbazafy <ma...@numforge.co> wrote:

Strided tensors should be allowed to have negative or zero strides.
Also an offset, that points to the first element of the tensor should be added

1) Zero-strides are needed to represent broadcasting operations.
2) Negative strides are needed to reverse a tensor dimension without copy.

This also makes dynamic striding with -1 impossible but we can use a flag instead.

The main advantage is that it would allow MLIR to support all shape operations supported by Numpy, DLPack (a cross-framework tensor representation effort - https://github.com/dmlc/dlpack), PyTorch and all other frameworks that follow the "rank + shape + strides + offset + data" paradigm.

These considerations make sense at the programmer/framework/library level where automatic compiler transformations do not occur.

The current proposal sits at a level below during which we want to perform compiler transformations, so it is more restrictive.

Without loss of generality, lowering from the numpy level to the strided memref level is possible (broadcast becomes a non-indexing loop, negative stride becomes a loop reversal).

The offset is present in the LLVM struct during lowering but it is opaque at the memref level.

This is also for compiler transformations: the offset is only manipulated by ops that create views (generalized transpose, slice, subview, view) and that come with higher-level static guarantees (even with dynamic strides/sizes/offset which would translate to symbolic case disjunctions early in the compilation flow).

Lastly, this has proven enough for now to target a lower level library abstraction (https://github.com/tensorflow/mlir/blob/master/test/mlir-cpu-runner/cblas_interface.cpp).

When we have evidence that we need to support richer behavior at the memref / compiler transformations level, we can revisit.

Does this make sense?

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/cfddc618-76c2-4e6d-a46d-c0465eb48ed7%40tensorflow.org.

--

N

Nicolas Vasilache

unread,

Sep 11, 2019, 12:51:01 PM9/11/19

to Alex Zinenko, MLIR

+1 to all this

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655jzbtBe0miTuAyAAn_mFrvSZ87AD9JAP7goHs1_a%3DGDFQ%40mail.gmail.com.

--

N

River Riddle

unread,

Sep 11, 2019, 1:51:47 PM9/11/19

to MLIR

How exactly does this compose with the affine compositions already present on memrefs? I'm also a bit concerned with adding a bunch of things to types, especially given the recent discussion about adding a layout attribute to tensor types. Are we ever going to inter-mix affine and strides?

On Wednesday, September 11, 2019 at 9:09:04 AM UTC-7, Nicolas Vasilache wrote:

Thanks for your replies all, sorry for the delay I was out yesterday.
Here are some answers below.

On Tue, Sep 10, 2019 at 9:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>

Sure, consider it done.

It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

Technically they can during lowering, as is pointed out a bit later by Volodymyr Arbatov. However, they may create undecidable problems for dependence analysis and force the insertion of case disjunction early in the compilation chain which seems undesirable. Since a strided memref is a (non-contiguous) view into an existing contiguous buffer it is subject to aliasing and is safer to disallow for now.

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

We can but we would have to use a syntax such as `memref<42x16xf32, 1x?, /*memory space=*/2>` because integer array list attributes won't parse, I don't personally find this much better but I am open to suggestions. Maybe someone has a better proposal, in any case it will be easy to modify once implemented.

I'd rather not bake in '-1' as dynamic in the IR, this is the exact reason why dimension lists use ?.

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/31231477-2774-4b02-8d3e-c40cacd7125b%40tensorflow.org.

--
-- Alex

--
You received this message because you are subscribed to the Google Groups "MLIR" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655iBcP72Y6R15AF-E__CLyQhhi0Ppq4YK4isRxn_6Bn4OA%40mail.gmail.com.

--
N

Nicolas Vasilache

unread,

Sep 11, 2019, 2:45:53 PM9/11/19

to River Riddle, MLIR

On Wed, Sep 11, 2019 at 1:51 PM 'River Riddle' via MLIR <ml...@tensorflow.org> wrote:

How exactly does this compose with the affine compositions already present on memrefs?

They are mutually exclusive as explicitly stated. Taking into account Alex's comment the ebnf looks like:

``` {.ebnf}
memref-type ::= `memref` `<` dimension-list-ranked tensor-memref-element-type

((`,` semi-affine-map-composition) | (`,` stride-list))?
(`,` memory-space)? `>`

semi-affine-map-composition ::= (semi-affine-map `,` )* semi-affine-map

stride-list ::= `[` integer (`,` integer)* `]`

memory-space ::= integer-literal /* | TODO: address-space-id */
```

However, as Volodymyr points out: "Index mapping functions, which in my understanding are conceptually similar to affine maps, work on top of that within dimensions of a tensor shape."

This indeed opens the door to possible interactions during lowering but I do not expect such interactions will happen during analyses.

I'm also a bit concerned with adding a bunch of things to types, especially given the recent discussion about adding a layout attribute to tensor types.

Concerned how?

I am much more concerned that memrefs just cannot represent non-continuous input memory regions.

The only way to access a subtensor of a tensor in MLIR is by encoding the information in enclosing loop bounds, which does not pass function boundaries.

Are we ever going to inter-mix affine and strides?

Yes, affine (without layout maps) only requires the "injective" indexing property which is also a prerequisite for strides.

If one wants a continuous memref from a strided memref one can either cast (+ runtime checks) or copy.

On Wednesday, September 11, 2019 at 9:09:04 AM UTC-7, Nicolas Vasilache wrote:
Thanks for your replies all, sorry for the delay I was out yesterday.
Here are some answers below.

On Tue, Sep 10, 2019 at 9:01 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:
Thanks for the proposal! I appreciate the unification effort, and this looks like a step in the right direction given the tensor+memref layout discussions. I have several questions:

1. If affine map composition and stride list are mutually exclusive, is it possible to just alternate them in the type signature to make it syntactically impossible to mix?
Like in memref<42x16xf32, [1, 64], /*memory space=*/2>

Sure, consider it done.

It does not seem difficult to parse.
More conceptually, do you intend for stride list and affine maps to eventually mix, or are strides more of an alternative layout representation for memrefs (tiled layout proposed for tensors and MKL can be other supported layout kinds).

Technically they can during lowering, as is pointed out a bit later by Volodymyr Arbatov. However, they may create undecidable problems for dependence analysis and force the insertion of case disjunction early in the compilation chain which seems undesirable. Since a strided memref is a (non-contiguous) view into an existing contiguous buffer it is subject to aliasing and is safer to disallow for now.

2. (syntax nitpick): is it possible to use `?` instead of -1 for the stride list for dynamic sizes; `?` better reflects the fact that the stride is unknown, while `-1` can be interpreted as going in the backward direction.

We can but we would have to use a syntax such as `memref<42x16xf32, 1x?, /*memory space=*/2>` because integer array list attributes won't parse, I don't personally find this much better but I am open to suggestions. Maybe someone has a better proposal, in any case it will be easy to modify once implemented.
I'd rather not bake in '-1' as dynamic in the IR, this is the exact reason why dimension lists use ?.

Ok, my first inclination was to use an integer list attribute which is a proper MLIR attribute.

Since the dimension list stuff is just a parser abstraction (which must and with an x + element type) I guess I can evolve that into:

`memref<42x16xf32, 1x?xf32, /*memory space=*/2>` and force the types of size and stride to match.

Do you have a better proposal?

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/f2932fc7-40a5-4335-8fb2-50e4b8059b9b%40tensorflow.org.

--

N

Mamy Ratsimbazafy

unread,

Sep 11, 2019, 3:03:41 PM9/11/19

to MLIR

The cblas interface is quite outdated, it only works with contiguous matrices and unit-stride in the non-leading dimension.

For several use-cases the BLIS interface that allows any striding (including zero/negative) provides key benefits for many use-cases:

- LLVM Polly is basing polyhedral optimizations kernels on BLIS

- Einsum and tensor-contraction can work without copies or transposition external to the BLIS call for most contraction types

- spaCy, the leading NLP framework now uses BLIS by default

Now if we have a higher-level dialect that exposes Numpy-like memref that high-level frameworks can target and that lowers to the dialect of the specification that would also work but it seems like extra work that we could avoid.

Mamy

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/cfddc618-76c2-4e6d-a46d-c0465eb48ed7%40tensorflow.org.

--
N

Andy Davis

unread,

Sep 11, 2019, 3:43:46 PM9/11/19

to Mamy Ratsimbazafy, MLIR

Side note: in case it wasn't clear from Nicolas' initial email, this work is a subset of work from the same working group that is proposing changes to both the Tensor and MemRef types. We'll be giving a public talk in a couple weeks to detail the proposal.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/698ff147-45a1-4c0f-94d2-7230afa0efa6%40tensorflow.org.

Mehdi Amini

unread,

Sep 11, 2019, 3:50:06 PM9/11/19

to Andy Davis, Mamy Ratsimbazafy, MLIR

On Wed, Sep 11, 2019 at 12:43 PM 'Andy Davis' via MLIR <ml...@tensorflow.org> wrote:

Side note: in case it wasn't clear from Nicolas' initial email, this work is a subset of work from the same working group that is proposing changes to both the Tensor and MemRef types. We'll be giving a public talk in a couple weeks to detail the proposal.

Can we get an email thread to discuss the proposal ahead of the meeting?

Otherwise folks can't necessarily wrapped their head around everything and prepare the right set of questions. Also addressing the most common question on the mailing-list can leave more room for discussing the hard part in the meeting.

Thanks,

--

Mehdi

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFfp4mOGXtYazzGkVi5iKs8P%3DLk6pXHLM1DvhKJD8Gb2%2BMnHRw%40mail.gmail.com.

Nicolas Vasilache

unread,

Sep 11, 2019, 4:05:15 PM9/11/19

to Mamy Ratsimbazafy, MLIR

On Wed, Sep 11, 2019 at 3:03 PM Mamy Ratsimbazafy <ma...@numforge.co> wrote:

The cblas interface is quite outdated, it only works with contiguous matrices and unit-stride in the non-leading dimension.

Well it is a minimal integration test that demonstrates things run end to end and that MLIR can call external C++ functions and produce proper results (shaking off linking and ABI issues for example).

If you look a little deeper in cblas.cpp you'll see that the impl is just a simple inefficient triple nested C++ loop: MLIR does not want to depend on external libraries but we have to ensure they can be called.

More advanced use cases can easily be built, this is the bread and butter of frameworks to wrap the fastest external library calls, dynamically.

For several use-cases the BLIS interface that allows any striding (including zero/negative) provides key benefits for many use-cases:

- LLVM Polly is basing polyhedral optimizations kernels on BLIS
- Einsum and tensor-contraction can work without copies or transposition external to the BLIS call for most contraction types
- spaCy, the leading NLP framework now uses BLIS by default

Yes we have ongoing discussions and prototypes re BLIS, MKL-DNN, Eigen etc but these do not belong in MLIR core.

In the bigger picture, stride 0 immediately breaks the injective property and can be easily pre/post processed so I don't see value in adding it at this level of representation.

Negative strides is something I pondered and don't see a compelling reason to avoid (as long as the injective property is maintained).

OTOH they are not the most immediate concern and don't play nicely with the dimension list parser (e.g. you can't parse `16x-4x1xf32`) or the attribute parser unless we choose `0` to encode unknown which isn't great either.

Bottom line I am happy to punt on negative static strides for now and we can reconsider later once the unification work is landed.

Now if we have a higher-level dialect that exposes Numpy-like memref that high-level frameworks can target and that lowers to the dialect of the specification that would also work but it seems like extra work that we could avoid.

This is a tradeoff, making some restrictions allows analysis and compiler transformations.

That big switch + dynamic dispatch you find in virtually all libraries and frameworks to inject static behavior under a dynamic API has to be put somewhere.

Adding it early (framework) or late (library) makes sense.

Adding it in the middle of the compiler stacks leads to undecidable problems or heavy multi-versioning, the bar to add it there is very high IMO.

JIT'ing can help with some of these issues but isn't a panacea either and the risk of specialization too early is very high.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/698ff147-45a1-4c0f-94d2-7230afa0efa6%40tensorflow.org.

--

N

Nicolas Vasilache

unread,

Sep 11, 2019, 4:07:40 PM9/11/19

to Andy Davis, Mamy Ratsimbazafy, MLIR

Indeed, this is what the first sentence was meant to convey, but it never hurts to clarify:

`Hello everyone, this RFC addresses a subset of the more general discussion on tensor layout and memref layouts.`

Thanks Andy!

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFfp4mOGXtYazzGkVi5iKs8P%3DLk6pXHLM1DvhKJD8Gb2%2BMnHRw%40mail.gmail.com.

--

N

Mahesh Ravishankar

unread,

Sep 11, 2019, 5:26:43 PM9/11/19

to MLIR

If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.

1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that

2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>. So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?

3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/698ff147-45a1-4c0f-94d2-7230afa0efa6%40tensorflow.org.

--
You received this message because you are subscribed to the Google Groups "MLIR" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFfp4mOGXtYazzGkVi5iKs8P%3DLk6pXHLM1DvhKJD8Gb2%2BMnHRw%40mail.gmail.com.

--
N

River Riddle

unread,

Sep 11, 2019, 5:29:35 PM9/11/19

to MLIR

On Wednesday, September 11, 2019 at 2:26:43 PM UTC-7, Mahesh Ravishankar wrote:

If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that

Just a note, but `addf` and all of the other scalar operations do not operate on memref.

Mahesh Ravishankar

unread,

Sep 11, 2019, 5:55:19 PM9/11/19

to MLIR

Right, I realized that as soon as I posted it. Sorry for the confusion. Read 'addf' as an `operation that takes operands of memref type`

Nicolas Vasilache

unread,

Sep 11, 2019, 7:03:44 PM9/11/19

to Mahesh Ravishankar, MLIR

On Wed, Sep 11, 2019 at 5:26 PM 'Mahesh Ravishankar' via MLIR <ml...@tensorflow.org> wrote:

If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.

Yes, although in this very particular corner case we could promote memref<5x5xf32, [25, 1]> to memref<5x5xf32>

1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that

Yes, it is the responsibility of each op that takes memrefs to specify what they want and verifiers should be adapted accordingly.

affine.load/store, std.load/store and std.dim work fine with strided memrefs.

2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>.

So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?

In general a strided memref is not a specialization of a continuous memref, they are different types.

The case you point out is a special case that we may want to just promote automatically.

3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.

Yes a MemRef without strides implicitly has "default strides" which is what you'd get from a continuous buffer.

By abusing notations, memref<MxNxPxf32> has strides [N * P, P, 1].

This is already implicit in the existing memref address calculation and does not change.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/a4a640df-a7a6-471c-9756-7040d086a07f%40tensorflow.org.

--

N

Uday Bondhugula

unread,

Sep 11, 2019, 10:51:55 PM9/11/19

to MLIR

On Wednesday, September 11, 2019 at 9:47:30 PM UTC+5:30, Nicolas Vasilache wrote:

On Tue, Sep 10, 2019 at 9:54 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

`memref<2x5xf32, [3, 17]>` is a strided view into a non-contiguous region of at least (4 * 17) + (1 * 3) + 1 elements (indexed as Alex explains below).
The memref itself is a reference to 2x5 elements.
`memref<4x29xf32>` is a contiguous region of 4x29 elements.

It became clear later that memref<2x5xf32, [3,17]> as per this proposal is the same as

memref<2x5xf32, (d0, d1) -> (3*d0 + 17*d1) >

with the current memref with affine map. Is this accurate? There is no linearization of the access here. You would still have load/stores represented as %A[%i, %j] since the logical space is 2 x 5 -- so it does represent a non-contiguous region. The linearization hides in the non-identity affine map until one normalizes the memref.

@alex zinenko, coming back to both of your examples, FWIW, as far as the power of representation goes, it would be good to clarify if this stride proposal allows one to represent anything that cannot already be represented through affine maps? What would be an example? Or is the motivation purely a more convenient representation for manipulation/lowering? Both examples Alex gave (non-contiguous regions - either constant strides or with dynamic ones) are represented with current memrefs.

~ Uday

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org.

--
N

Uday Bondhugula

unread,

Sep 11, 2019, 11:04:30 PM9/11/19

to MLIR

On Thursday, September 12, 2019 at 12:15:53 AM UTC+5:30, Nicolas Vasilache wrote:

On Wed, Sep 11, 2019 at 1:51 PM 'River Riddle' via MLIR <ml...@tensorflow.org> wrote:
How exactly does this compose with the affine compositions already present on memrefs?

They are mutually exclusive as explicitly stated. Taking into account Alex's comment the ebnf looks like:

``` {.ebnf}
memref-type ::= `memref` `<` dimension-list-ranked tensor-memref-element-type
((`,` semi-affine-map-composition) | (`,` stride-list))?
(`,` memory-space)? `>`

semi-affine-map-composition ::= (semi-affine-map `,` )* semi-affine-map
stride-list ::= `[` integer (`,` integer)* `]`
memory-space ::= integer-literal /* | TODO: address-space-id */
```

However, as Volodymyr points out: "Index mapping functions, which in my understanding are conceptually similar to affine maps, work on top of that within dimensions of a tensor shape."
This indeed opens the door to possible interactions during lowering but I do not expect such interactions will happen during analyses.

I'm also a bit concerned with adding a bunch of things to types, especially given the recent discussion about adding a layout attribute to tensor types.
Concerned how?
I am much more concerned that memrefs just cannot represent non-continuous input memory regions.

This is the part that isn't clear to me. As per the stride proposal syntax, anything like:

memref < n1 x n2, [s0, s1] >

is the same as the current

memref< n1 x n2, (d0, d1) -> (d0 * s0 + d1 * s1) >

You would neeed buffer/view abstractions either way to separate alloc'ed memref's with alias'ed ones. So is this stride proposal adding any representational power? The current memrefs can already represent non-contiguous memory regions, albeit in a different way.

While on this, note that affine maps are canonicalized so that you have (d0 * s0) instead of (s0 * d0). So extracting the stride (if you ever want to) is trivial. For code generation and linearization, it's not even needed because you would finally transform/lower the load to:

affine.load %A [ %i * %s0 + %j * %s1 ] :

before the LLVM lowering happens, and the current lowering to LLVM dialect just works out of the box on this. With this new proposal like this, one would have to just redo the same lowering/linearize or extend the LLVM lowering to look at / deal with the strides. This is straightforward but just duplication and providing another path to achieve the same thing?

~ Uday

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/f2932fc7-40a5-4335-8fb2-50e4b8059b9b%40tensorflow.org.

--
N

Uday Bondhugula

unread,

Sep 11, 2019, 11:42:11 PM9/11/19

to MLIR

On Thursday, September 12, 2019 at 8:21:55 AM UTC+5:30, Uday Bondhugula wrote:

On Wednesday, September 11, 2019 at 9:47:30 PM UTC+5:30, Nicolas Vasilache wrote:

On Tue, Sep 10, 2019 at 9:54 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

On Tuesday, September 10, 2019 at 6:39:07 PM UTC+5:30, Alex Zinenko wrote:
How would you represent memref<2x5xf32, [3, 17]> ?

Are 3 and 17 the strides along the two respective dimensions? Why is this special? This should just work. The normlizeMemRefs utility will turn this memref into memref<4x29xf32> with an identity layout, and a load on the original one will be rewritten to:

affine.load %0[%arg0 * 3, %arg1 * 7] : memref<4x29xf32>

`memref<2x5xf32, [3, 17]>` is a strided view into a non-contiguous region of at least (4 * 17) + (1 * 3) + 1 elements (indexed as Alex explains below).
The memref itself is a reference to 2x5 elements.
`memref<4x29xf32>` is a contiguous region of 4x29 elements.

It became clear later that memref<2x5xf32, [3,17]> as per this proposal is the same as

memref<2x5xf32, (d0, d1) -> (3*d0 + 17*d1) >

with the current memref with affine map. Is this accurate? There is no linearization of the access here. You would still have load/stores represented as %A[%i, %j] since the logical space is 2 x 5 -- so it does represent a non-contiguous region. The linearization hides in the non-identity affine map until one normalizes the memref.

In fact, I just ran this example; it does allocate a region of 72 elements (17*4 + 3*1 + 1) - the accessed space is non-contiguous. Is this the desired behavior? The current LLVM lowering just works on it.

INPUT

func @stride() {

%A = alloc() : memref<2x5xf32, (d0,d1) -> (3*d0 + 17*d1)>

affine.for %i = 0 to 2 {

affine.for %j = 0 to 5 {

affine.load %A[%i, %j] : memref<2x5xf32, (d0,d1) -> (3*d0 + 17*d1)>

}

return

}

OUTPUT

$ mlir-opt -simplify-affine-structures /tmp/stride.mlir

#map0 = (d0, d1) -> (d0 * 3 + d1 * 17)

module {

func @stride() {

%0 = alloc() : memref<72xf32>

affine.for %arg0 = 0 to 2 {

affine.for %arg1 = 0 to 5 {

%1 = affine.load %0[%arg0 * 3 + %arg1 * 17] : memref<72xf32>

}

return

}

Alex Zinenko

unread,

Sep 12, 2019, 3:49:35 AM9/12/19

to Nicolas Vasilache, Mahesh Ravishankar, MLIR

On Thu, Sep 12, 2019, 01:03 'Nicolas Vasilache' via MLIR <ml...@tensorflow.org> wrote:

On Wed, Sep 11, 2019 at 5:26 PM 'Mahesh Ravishankar' via MLIR <ml...@tensorflow.org> wrote:
If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
Yes, although in this very particular corner case we could promote memref<5x5xf32, [25, 1]> to memref<5x5xf32>

This sounds like a reasonable behavior to me. We already do the same to eliminate identity maps from the memref.

1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that
Yes, it is the responsibility of each op that takes memrefs to specify what they want and verifiers should be adapted accordingly.
affine.load/store, std.load/store and std.dim work fine with strided memrefs.

2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>.
So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?
In general a strided memref is not a specialization of a continuous memref, they are different types.
The case you point out is a special case that we may want to just promote automatically.

3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.
Yes a MemRef without strides implicitly has "default strides" which is what you'd get from a continuous buffer.
By abusing notations, memref<MxNxPxf32> has strides [N * P, P, 1].
This is already implicit in the existing memref address calculation and does not change.

Can we specify this in documentation, for clarity? This is also relevant for lowering to LLVM.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAO%3D1vNNkRqz7U1L9%2BA1iutrVt_hnknv3Z499%2BscsaEXWskk6Yw%40mail.gmail.com.

Alex Zinenko

unread,

Sep 12, 2019, 4:50:07 AM9/12/19

to Uday Bondhugula, MLIR

Indeed, this seems representable as affine maps in case of local allocations and static strides.

How this would look like with dynamic strides? I understand you want to model them as symbols, but it also means you will have to carry around the values of strides, which would be particularly painful on function boundaries. More generally, what's the story with your memref canonicalization on function boundaries? If I have an external function

func @foo(memref<2x5xf32, ?x?>)

or
func @foo(memref<2x5xf32, [s0,s1](d0,d1) -> (s0*d0 + s1*d1)>, index, index) // you need to pass in the values for s0 and s1

if you prefer

I cannot just change its signature because I have no knowledge of its internals or of its other callers. Do you have a cast operation that would go from memref<72xf32> back to memref<2x5xf32, [s0,s1](d0,d1) -> (s0*d0 + s1*d1)>? Conversely, inside such a function, how does one keep track of which values are strides and which values are just function arguments? I am not saying it is not possible, but the amount of bookkeeping external to the IR becomes worrying.

As for the implementation workload, a version of strided view has existed in Linalg since March. The required changes are mostly mechanical and I expect the net reduction in code size once everything is implemented, since this will allow us to remove some of the code duplication.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/8ea8642f-c6c1-4c8f-ac7d-c941b0d31c5f%40tensorflow.org.

--

-- Alex

Uday Bondhugula

unread,

Sep 12, 2019, 5:02:53 AM9/12/19

to MLIR

SSA values are bound to those affine map's layout symbols just like they are bound to dynamic dimensions. One should be able to extract them the same way the dim op does for shape symbols across function boundaries / dominance scopes. Because there hasn't been a use case, I guess they have not been implemented. But the treatment is no different from the SSA values attached to dynamic dimensions. In particular:

1) we are missing an

%v = extract_symbol %A, 0 : memref<..., (d0, d1) [s0] (...)

(This would be needed to get the SSA value associated s0 when %A escapes the dominance of the SSA symbol that was bound at alloc time.)

2) The LLVM lowering would have to treat these additional SSA values the same way it would handle SSA values associated with dynamic dimensions. Is this meaningful?

The symbols associated with affine layout maps serve multiple purposes, as you know, not just strides. They could represent parametric tile sizes for data tiling, or they could be parametric shifts/padding for example. You get all at once (including compositions) without having to specialize/reimplement support for each separately.

~ Uday

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/8ea8642f-c6c1-4c8f-ac7d-c941b0d31c5f%40tensorflow.org.

--
-- Alex

Volodymyr Arbatov

unread,

Sep 12, 2019, 10:19:10 AM9/12/19

to Uday Bondhugula, MLIR

This sounds interesting. As I understand the affine map on memref is the way to express data layout, right?

One of potential problems with strides is that they correspond to specific, though most commonly used, data layouts (or, in other words, it's a particular form of index mapping). When it comes to more complex data layouts, for example multi-level nested tiled data layout, these parameters might become useless or reflect memory organization connected to original dimensions in a complex way.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/b18a8cc4-bdd4-46d7-bd44-141148aba89b%40tensorflow.org.

Uday Bondhugula

unread,

Sep 12, 2019, 10:39:21 AM9/12/19

to MLIR

On Thursday, September 12, 2019 at 7:49:10 PM UTC+5:30, Volodymyr Arbatov wrote:

This sounds interesting. As I understand the affine map on memref is the way to express data layout, right?

Yes, that's right.

One of potential problems with strides is that they correspond to specific, though most commonly used, data layouts (or, in other words, it's a particular form of index mapping). When it comes to more complex data layouts, for example multi-level nested tiled data layout, these parameters might become useless or reflect memory organization connected to original dimensions in a complex way.

+1. Multi-level/nested tiled layouts are more naturally expressed as index mappings - either concretely as mod's/div's w.r.t the tile sizes as is the case with affine maps or with some special syntax for tiled layouts that an earlier thread was discussing. With the current affine map in memref, there is one example here of a tiled layout (with one level of data tiling):

https://github.com/tensorflow/mlir/blob/ebe8e2dbbeec6408ec2674206e4c3a2d02cacb02/test/Transforms/memref-normalize.mlir#L61

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/b18a8cc4-bdd4-46d7-bd44-141148aba89b%40tensorflow.org.

Andy Davis

unread,

Sep 12, 2019, 10:41:04 AM9/12/19

to Mehdi Amini, Mamy Ratsimbazafy, MLIR

Yes. We'll send an email thread out before the meeting, with lead time to allow people to discuss and formulate questions...

Nicolas Vasilache

unread,

Sep 12, 2019, 11:46:06 AM9/12/19

to Uday Bondhugula, MLIR

Yes, and in the general case what does it look like in terms of LLVM memory descriptor?

```

template <typename Elem, size_t NumDims, size_t NumSyms>
struct {
Elem *ptr;
int64_t sizes[NumDims];
int64_t symbols[NumSyms];
};

```

If we went for a similar implementation/convention that exists for MemRef -> LLVM today, NumDims can be an arbitrary values in [0, Rank] and NumSyms can be an arbitrary value (e.g. 42, because why not).

Part of this proposal is to use a standard form that is already widespread in the ML community (in which NumDims == NumSyms == Rank with an extra offset).

i.e.

```

template <typename Elem, size_t Rank>
struct {
Elem *ptr;
int64_t offset;
int64_t sizes[Rank];
int64_t strides[Rank];
};

```

Metadata operations like view, slice, arbitrary transpose etc are closed on this representation. A standard form is good, it reduces surprises and implementation complexity.

As mentioned in another reply, this is an intermediate level of abstraction between numpy, torch, mxnet/dlpack on one hand and affine on the other hand which helps progressive lowering.

Now let's consider what happens at function boundaries, I see 3 cases:

1. a memref + layout map doesn't lower to LLVM because you need a new type (a function) to encode each different layout. I suspect this is (one of) the reason(s) why normalizeMemRefs exists?

2. since 1. does not pass function boundaries, one would as per your example above "Run this to normalize (lower the non-identity layout); notice that (d0) -> (2 * d0) now disappears from the layout map, and the access becomes 2 * IV.". What this does is move the information from the data type into operations.

3. strided memrefs as proposed just pass the function boundary because they are in a normal form (i.e. the function in 1. is fixed).

Point 2. above could be seen as a workaround for memrefs to passing function boundaries by moving the burden of translation to a predefined set of operations.

Now, since memref is a standard type, let's imagine that there exists a dialect in which there are other types of ops than the affine ones and that take memrefs (MLIR is supposed to be extensible after all).

How does all this work in the context of new ops?

To make it super concrete, here is a fusion example in strided memref form with higher-level ops that have useful semantics (as in they can be lowered to existing implementations that run at 90%+ peak):

INPUT:
func @f2(%A: memref<?x?xf32, [?, 1]>, %B: memref<?x?xf32, [?, 1]>, %C: memref<?x?xf32, [?, 1]>, %D: memref<?x?xf32, [?, 1]>, %E: memref<?x?xf32, [?, 1]>) {
linalg.matmul(%A, %B, %C) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>
linalg.matmul(%C, %D, %E) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>

return
}

OUTPUT:

mlir-opt ./third_party/llvm/llvm/projects/google_mlir/test/Dialect/Linalg/fusion.mlir -linalg-fusion -linalg-fusion-tile-sizes=8,16 -canonicalize -cse

Note that this also would work out of the box with parametric tile sizes (but not from the command line) but this is besides the point.

I'll refer you to the .td in the Linalg dialect for details.

func @f2(%arg0: memref<?x?xf32, [?, 1]>, %arg1: memref<?x?xf32, [?, 1]>, %arg2: memref<?x?xf32, [?, 1]>, %arg3: memref<?x?xf32, [?, 1]>, %arg4: memref<?x?xf32, [?, 1]>) {

%c1 = constant 1 : index
%c0 = constant 0 : index
%c16 = constant 16 : index
%c8 = constant 8 : index
%0 = dim %arg2, 0 : memref<?x?xf32, [?, 1]>
%1 = dim %arg3, 1 : memref<?x?xf32, [?, 1]>
loop.for %arg5 = %c0 to %0 step %c8 {
loop.for %arg6 = %c0 to %1 step %c16 {
%2 = affine.apply #map0(%arg5)
%3 = dim %arg2, 1 : memref<?x?xf32, [?, 1]>
%4 = linalg.subview %arg2[%arg5, %2, %c1, %c0, %3, %c1] : memref<?x?xf32, [?, 1]>
%5 = dim %arg3, 0 : memref<?x?xf32, [?, 1]>
%6 = affine.apply #map1(%arg6)
%7 = linalg.subview %arg3[%c0, %5, %c1, %arg6, %6, %c1] : memref<?x?xf32, [?, 1]>
%8 = linalg.subview %arg4[%arg5, %2, %c1, %arg6, %6, %c1] : memref<?x?xf32, [?, 1]>
%9 = dim %arg0, 1 : memref<?x?xf32, [?, 1]>
%10 = linalg.subview %arg0[%arg5, %2, %c1, %c0, %9, %c1] : memref<?x?xf32, [?, 1]>
%11 = linalg.subview %arg1[%c0, %9, %c1, %c0, %3, %c1] : memref<?x?xf32, [?, 1]>
linalg.matmul(%10, %11, %4) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>
linalg.matmul(%4, %7, %8) : memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>, memref<?x?xf32, [?, 1]>
}
}

return
}

Pass this through -linalg-lower-to-llvm you get a bunch of IR with eventually:

module {
func @f2(%arg0: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg1: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg2: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg3: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">, %arg4: !llvm<"{ float*, i64, [2 x i64], [2 x i64] }">) {
llvm.br ^bb1(%1 : !llvm.i64)

...

llvm.call @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(%180, %181, %182) : (!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">) -> ()
...
llvm.call @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(%183, %184, %185) : (!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">) -> ()
llvm.br ^bb3(%53 : !llvm.i64)
^bb5: // pred: ^bb3
%186 = llvm.add %6, %3 : !llvm.i64
llvm.br ^bb1(%186 : !llvm.i64)
^bb6: // pred: ^bb1
llvm.return
}
func @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">)

What would it take to make such a flow work from the NumDims + NumSyms representation in the absence of affine ops?

Is it always possible to take an arbitrary composition of layout maps and extract the proper semantics that matches an external API?

If so, isn't it akin to raising from a lower-level abstraction, information that can be propagated/progressively lowered from a more structured, higher-level abstraction?

What would be the complexity of implementing and maintaining the features required to achieve this and how long would it block progress?

Some concrete extra points to consider in addition to the concrete use case above:

1. memref is a standard type but it currently has affine-only layout, this is a first attempt at decoupling memref layout from affine semantics with concrete use cases. Does this mean a more future-proof way is to have completely separate types for non-strictly affine data and computations? Note, this is precisely how Linalg has evolved so far but I think it is time for uniformization.

2. to achieve the above Linalg transformation, non-strictly affine behavior is already used (i.e. generic tiling by parameters that may not be allowed as symbols in affine, boundary conditions on views that use min/max). Again locking memrefs to strict affine behavior is limiting.

3. the above reasoning also generalizes more broadly to more irregular data types (ragged, sparse) but this is not part of the current proposal.

Lastly, if there is a counter-proposal to achieve the same generic loop + library compilation that Linalg does but in the pure affine world, let's hear it and see some code.

In the meantime, I second Alex's opinion that "As for the implementation workload, a version of strided view has existed in Linalg since March. The required changes are mostly mechanical and I expect the net reduction in code size once everything is implemented, since this will allow us to remove some of the code duplication.". The implementation is ready and functional.

The symbols associated with affine layout maps serve multiple purposes, as you know, not just strides. They could represent parametric tile sizes for data tiling, or they could be parametric shifts/padding for example. You get all at once (including compositions) without having to specialize/reimplement support for each separately.

Side note, once all is composed in an affine layout blob of mixed tiling, shifts and padding with modulos and boundary conditions. How easy is it to generate fast code, or even better reduce to operations on metadata and call high-performance copy libraries with in-register vector transposes? I see similarities with extracting elemental transformations from an affine schedule which is still an open problem AFAIK.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/b18a8cc4-bdd4-46d7-bd44-141148aba89b%40tensorflow.org.

--

N

Mahesh Ravishankar

unread,

Sep 12, 2019, 1:02:47 PM9/12/19

to Alex Zinenko, Nicolas Vasilache, MLIR

On Thu, Sep 12, 2019 at 12:49 AM Alex Zinenko <zin...@google.com> wrote:

On Thu, Sep 12, 2019, 01:03 'Nicolas Vasilache' via MLIR <ml...@tensorflow.org> wrote:
On Wed, Sep 11, 2019 at 5:26 PM 'Mahesh Ravishankar' via MLIR <ml...@tensorflow.org> wrote:
If I am reading this correctly, then memref<5x5xf32> and memref<5x5xf32, [25, 1]> are different types.
Yes, although in this very particular corner case we could promote memref<5x5xf32, [25, 1]> to memref<5x5xf32>

This sounds like a reasonable behavior to me. We already do the same to eliminate identity maps from the memref.

1) Is it the responsibility of the operation on memrefs to see if a given stride matches the semantics of the operation. For example, currently 'addf` doesn't have any notion of strides. Presumably if it gets an operand with the stride specified, the verifier should (would?) fail. If the op truly doesn't care about the strides, then the operation verifier would have to be extended to allow that
Yes, it is the responsibility of each op that takes memrefs to specify what they want and verifiers should be adapted accordingly.
affine.load/store, std.load/store and std.dim work fine with strided memrefs.

2) One could interpret memref<5x5xf32, [25, 1]> as a specialization of the memref<5x5xf32>.
So any op that has an operand of type memref<5x5xf32> should be happy with a memref<5x5xf32, [25, 1]>. Is there a general mechanism to allow for this. One way would be to have MemrefStridedType which derives from a MemrefType. That would capture the specialization and ops that need just need a Memref type operand would be fine getting a MemrefStridedType opeand. Are there downsides of that approach?
In general a strided memref is not a specialization of a continuous memref, they are different types.
The case you point out is a special case that we may want to just promote automatically.

3) This might be outside of the scope of this proposal, but is there a notion of "default strides". I realize that having a default with dynamics shapes is tricky, but just want to get some clarity on what is being considered along this line.
Yes a MemRef without strides implicitly has "default strides" which is what you'd get from a continuous buffer.
By abusing notations, memref<MxNxPxf32> has strides [N * P, P, 1].
This is already implicit in the existing memref address calculation and does not change.

Can we specify this in documentation, for clarity? This is also relevant for lowering to LLVM.

I second documenting this. Agree with all the points made here if there is an implicit "default strides" for Memrefs. AFAIK this is actually stated anywhere.

--

Mahesh

Uday Kumar Reddy B

unread,

Sep 16, 2019, 4:14:54 PM9/16/19

to Nicolas Vasilache, MLIR

Hi Nicolas

I'm going to respond to your numerous questions/comments in bulk here
instead of inline :-), since I think they are all inter-related, and I
wanted a particular order to it. (I'm skipping some of the questions
that I found digressing and unrelated to this strided memref proposal.)

1) I think my first question was on clarifying separately on
*representational power* and *ease of representation/transformation*.
Although I don't see an explicit response to the first part, it's clear
(and also with Alex confirming) that this proposal deals with a subset
of what affine maps can already represent. In your earlier response to
River's question, you stated

"I am much more concerned that memrefs just cannot represent

non-continuous input memory regions.", i.e., the current affine map
layout couldn't represent non-contiguous regions and that this proposal
did; but that is clearly inaccurate. It would be good to reconsider
things in view of that. Your previous email has thus been all about ease
of representation - so let's get to that.

2) Coming to the ease of representation: I think your point that it is
hard to pull out individual aspects from compound maps like strides,
tile sizes, offsets (for whatever reasons you need them) doesn't make
sense at all because you are really comparing apples to oranges there!
:-) This proposal in discussion is on strided memrefs, and if you look
at how those strides get into and are represented in affine maps, it's
just trivial to extract them out if the strides are what you wanted to
represent (just with a pattern match on the affine maps - see further
below).

Now, if the affine map is a compound map representing a composition of
several things, the comparison should be made to a design that is
roughly as powerful (representational power-wise), and is able to
represent all of that (strides, tiled layouts, nested tiled layout, all
useful compositions thereof such as strided intra-tile + permuted tile
space), and then evaluate which one is better. The representation in
this proposal is really adding a very thin layer of abstraction to
represent a proper subset of what's possible with affine maps, and at
the expense of new syntax and a new representation, and providing an
alternate/duplicate lowering path.

3) Reg. your comment on the different number of dimensional arguments
and symbolic arguments to the AllocOp: I'm not sure why that's an issue.
One should let the compiler do its job here! :-) If there are really
going to be 42 symbolic arguments binding to a layout maps (albeit
unrealistic), there's probably a reason, and let the machinery propagate
stuff. The key is that the current affine map based design is unified
with the rest of the SSA IR and as such SSA values bound to the
dimensions, symbols of the alloc all benefit from the existing infra on
constant propagation into affine maps, and algebraic simplification and
canonicalization of maps.

4) Reg. your comment on hearing about a counter proposal on the affine
dialect, this is a bit surprising, since the counter proposal you are
asking for is already the current memref design we have! :-) It already
supports tiled layouts (including nested ones), strides, shifts, and
their compositions, in design, and the implementation works too when the
tile sizes/strides are constant. As I mentioned, extending to dynamic
ones has two implementation alternatives and there are reasons for both
to exist:

(a) normalizeMemRefs, which turns the memref into one with identity
layout, and so the LLVM lowering won't have to be updated and it would
just work as is. The C++ struct you show with "ArrayRef<Value *>
symbols" will not be needed since the normalized memref with an identity
map doesn't have symbols! And as I mentioned, if you have a memref like
this:
memref<? x ? x f32, (d0, d1)[s0, s1] -> (s0 * d0 + s1 * d1)>
(which is exactly what you get with strides s0, s1 in your proposal)

the normalized memref will just have (d0, d1) -> (d0, d1) and you get
the index remapping/composition with all the loads/stores, i.e.,
%A[%i, %j] becomes %A[ %s0* %i + %s1 * %j].
And you can do this normalization as late as you want - probably while
lowering away affine stuff. If you are normalizing it, you have to
normalize the memref everywhere. Note that when one updates the memref
layout at the caller, the memref's type has changed, and as such the
callee function would have to be cloned for the new memref type, which
is the right thing to do since a different layout map drastically
impacts how the function should be optimized. But this nice benefit is
common to all proposals that put layout info into the type.

(b) keep track of the symbols the same way as the shape SSA values
associated with the dynamic dimensions in the LLVM lowering. You need
this design only to preserve/hide complex non-identity layouts *all the
way* up until the LLVM lowering - normally, one would like to just
normalize during -lower-affine, because the linearized stuff can be left
alone as we are anyway out of affine.load/stores. Or you could do that
just before LLVM lowering in order to separate concerns and not make the
really nice LLVM lowering we currently have heavy. With all these design
points, the LLVM lowering will NOT see any symbolic operands on the
alloc's or views!

5) At several points, you claim that affine is limiting in many ways and
give orthogonal examples: but the thing in discussion is *this* layout
proposal and what is being proposed here is a small subset of what can
already be achieved with affine maps. It doesn't make sense to make the
affine restriction argument unless the proposal can deal with something
that can't already be represented with affine maps (for eg. for
affine.for, the case of non-constant steps is a valid one, but this
proposal isn't about loop structure). So let's stick to the topic at hand.

6) The danger with the proposed representation is that if someone needs
something more in the future, this specialized representation and syntax
are hard to adapt/extend to it, and you'll have to go back to the
drawing board. And consequently, one gets a pot pourri of different
syntax/representations attached with the memref grammar looking like:

memref<? x ? x f32, rep1 | rep2 | rep3 | ...>,

leading to more duplication in infra and confusion among users as to
which one to use.

In contrast, *I'm yet to see a dense tensor layout used in practice that
can't be represented with a (semi)-affine map*. It'll be great to see an
example, and even if you have one, that still doesn't impact this
proposal. Overall, decoupling memref's from affine maps merely because
the memref is a standard/core type and the layout is an affine thing
isn't meaningful by itself - we should of course still try that, but in
the presence of a design that is either adding more representational
power or making it more convenient to represent/transform something.

7) On your remark on seeing code on a counter proposal: let's not merely
use the existence of code / the amount of additional code needed to
evaluate a design given the current context -- because as I mentioned,
the current proposal and the existing affine map based memref achieve
very different design end-goals from the point of view of
representational power. And in any case, I don't think either requires
any complex implementation at all. So, let's just primarily focus on the
design.

8) I also disagree with some of the other digressing comments like the
ones on non-affine behavior, "generating fast code, calling copy
libaries with in-register transpose", etc. Besides being unrelated to
the current layout proposal, I think the conclusions are inaccurate!
Using the affine dialect doesn't mean you can't do such things (MLIR
mixes affine with SSA and standard op variants after all), but I'd like
to avoid the discussion from diverging --- those would require equally
long responses/discussions as this one. :-)

To conclude, I think this proposal is really adding a very thin layer of
abstraction at the expense of new syntax and a new representation; it
also only achieves what is already possible with current memref design,
and it is also in a manner not as unified with the rest of MLIR as
current memrefs are. Without augmenting it further to represent more
stuff, I think it doesn't pay the price of changing the IR and the type
system.

~ Uday

> llvm.br <http://llvm.br> ^bb1(%1 : !llvm.i64)

> ...
> llvm.call @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(%180,
> %181, %182) : (!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{
> float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64],
> [2 x i64] }*">) -> ()
> ...
> llvm.call @linalg_matmul_viewsxsxf32_viewsxsxf32_viewsxsxf32(%183,
> %184, %185) : (!llvm<"{ float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{
> float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, i64, [2 x i64],
> [2 x i64] }*">) -> ()

> llvm.br <http://llvm.br> ^bb3(%53 : !llvm.i64)

> ^bb5: // pred: ^bb3
> %186 = llvm.add %6, %3 : !llvm.i64

> llvm.br <http://llvm.br> ^bb1(%186 : !llvm.i64)

> llvm.br <http://llvm.br> ^bb1(%5 :

> !llvm.i64)
> ^bb1(%8: !llvm.i64):// 2 preds: ^bb0,
> ^bb2
> %9 = llvm.icmp "slt" %8, %6 : !llvm.i64
> llvm.cond_br %9, ^bb2, ^bb3
> ^bb2:// pred: ^bb1
> %10 = llvm.mlir.constant(2 : index)
> : !llvm.i64
> %11 = llvm.mul %8, %10 : !llvm.i64
> %12 = llvm.mlir.constant(127 :
> index) : !llvm.i64
> %13 = llvm.getelementptr %4[%11] :
> (!llvm<"float*">, !llvm.i64) ->
> !llvm<"float*">
> %14 = llvm.load %13 {alignment = 16
> : i32} : !llvm<"float*">
> %15 = llvm.add %8, %7 : !llvm.i64

> llvm.br <http://llvm.br> ^bb1(%15 :

> !llvm.i64)
> ^bb3:// pred: ^bb1
> llvm.return
> }
> func @malloc(!llvm.i64) -> !llvm<"i8*">
> }
>
> Execute it:
>
> bin/mlir-opt
> -simplify-affine-structures
> /tmp/stride.mlir -lower-affine
> -lower-to-cfg -lower-to-llvm |
> mlir-cpu-runner -e stride -O3
>
> Strides are just one special case for
> affine maps; if one wants to extract
> them from the map, that's easy as well.
> And if your goal for now is to just
> specify them and code generate, it
> already works.
>
> ~ Uday
>
>
> On Tuesday, September 10, 2019 at
> 1:50:38 AM UTC+5:30, Nicolas Vasilache
> wrote:
>
> Hello everyone, this RFC addresses a
> subset of the more general
> discussion ontensor layout

> <https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/sCaIEKm2RxA>
> and memref layouts
> <https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/zdc6Je15ozM>.

>
> This RFC is for adding an optional
> stride to MemRefType as a way to:
> 1. unify existing codegen strategies
> (Affine and Linalg dialects) by
> using the same standard type (i.e.
> !linalg.view is replaced everywhere
> by a strided memref);
> 2. provide a standard type that is
> closed under a set of interesting
> operations and passes function
> boundaries in a generic way (i.e. no
> need to clone and specialize a new
> function for each small change in
> layout);
> 3. provide an intuitive entry point
> to interfacing with external library
> calls operating on strided array
> objects. Such libraries often
> perform dynamic dispatch to adapt
> the strategy to problem sizes,
> alignment and strides between
> elements (see for example the
> primitive interface to BLAS

> <https://github.com/tensorflow/mlir/blob/master/test/mlir-cpu-runner/cblas_interface.cpp#L101>

> that MLIR targets from the Linalg
> dialect).
>
> This is also related to the
> presentation by Intel in one of the
> past open design meetings (see Aug
> 22nd 2019 recording

> <https://drive.google.com/corp/drive/folders/1c7Zn0j4qGTswhNJNJKQ07lEF6u756NID>

> and a primer on strides in memory
> formats

> <https://intel.github.io/mkl-dnn/understanding_memory_formats.html>).

> <https://github.com/tensorflow/mlir/blob/40fe0fe736067ce7d76ff4a54c2671c35ffa03a5/include/mlir/Dialect/Linalg/IR/LinalgOps.td#L402>,
> slice
> <https://github.com/tensorflow/mlir/blob/40fe0fe736067ce7d76ff4a54c2671c35ffa03a5/include/mlir/Dialect/Linalg/IR/LinalgOps.td#L240>,
> subview
> <https://github.com/tensorflow/mlir/blob/40fe0fe736067ce7d76ff4a54c2671c35ffa03a5/include/mlir/Dialect/Linalg/IR/LinalgOps.td#L326>,
> view
> <https://github.com/tensorflow/mlir/blob/40fe0fe736067ce7d76ff4a54c2671c35ffa03a5/include/mlir/Dialect/Linalg/IR/LinalgOps.td#L433>).

> <https://groups.google.com/a/tensorflow.org/d/msgid/mlir/c467a672-a642-46d7-8141-891e1f2bfe2c%40tensorflow.org?utm_medium=email&utm_source=footer>.

>
>
>
> --
> -- Alex
>
> --
> You received this message because you are
> subscribed to the Google Groups "MLIR" group.
> To unsubscribe from this group and stop
> receiving emails from it, send an email to
> ml...@tensorflow.org.
> To view this discussion on the web visit
> https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org

> <https://groups.google.com/a/tensorflow.org/d/msgid/mlir/68df4b54-6a97-4766-adbb-5483274676c5%40tensorflow.org?utm_medium=email&utm_source=footer>.

>
>
>
> --
> N
>
> --
> You received this message because you are subscribed to the
> Google Groups "MLIR" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to ml...@tensorflow.org.
> To view this discussion on the web visit
> https://groups.google.com/a/tensorflow.org/d/msgid/mlir/8ea8642f-c6c1-4c8f-ac7d-c941b0d31c5f%40tensorflow.org

> <https://groups.google.com/a/tensorflow.org/d/msgid/mlir/8ea8642f-c6c1-4c8f-ac7d-c941b0d31c5f%40tensorflow.org?utm_medium=email&utm_source=footer>.

>
>
>
> --
> -- Alex
>
> --
> You received this message because you are subscribed to the Google
> Groups "MLIR" group.
> To unsubscribe from this group and stop receiving emails from it,

> send an email to mlir+uns...@tensorflow.org
> <mailto:mlir+uns...@tensorflow.org>.

> To view this discussion on the web visit

> https://groups.google.com/a/tensorflow.org/d/msgid/mlir/b18a8cc4-bdd4-46d7-bd44-141148aba89b%40tensorflow.org
> <https://groups.google.com/a/tensorflow.org/d/msgid/mlir/b18a8cc4-bdd4-46d7-bd44-141148aba89b%40tensorflow.org?utm_medium=email&utm_source=footer>.
>
>
>
> --
> N

Mehdi AMINI

unread,

Sep 17, 2019, 3:16:03 AM9/17/19

to Uday Kumar Reddy B, Nicolas Vasilache, MLIR

Hi Uday,

Thanks for the very thorough email!

I'm not entirely sure about everything else, but I have some questions here.

There is an assumption here that it is possible and desirable to clone and specialize all the code for every possible normalization.

While this may be profitable in some case, I can imagine that the increased code-size can be undesirable, so is the potentially very large amount of cloning and specialization on compile time.

But more importantly, this does not address the use of memref as a type that can cross module boundaries. Not being able to have some modularity in the codegen (through the ability to specify an ABI) seems overly restrictive to me.

It seems that the lowering and the structure that Nicolas mentioned is allowing to express an external function taking a memref for a given shape (or even including some dynamic shapes), and call this external function with memref arguments that have various stride configurations without having to specialize this function.

While admittedly this is only for a subset of the expressiveness power of affine maps, the modularity aspect still seems valuable depending on your use-case.

Do you believe that there is no applicability for this?

Thanks,

--

Mehdi

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/4c2a4e6e-e0b1-e249-b10b-a2c48524ed45%40iisc.ac.in.

Uday Bondhugula

unread,

Sep 17, 2019, 4:22:45 AM9/17/19

to MLIR

Hi Mehdi, the cloning and specialization doesn't happen in all cases, and definitely not in the dynamic stride case since the number of symbols and thus the type signature match. Consider this example:

// %A and %B are two memrefs with the same shape but different strides in their layout.

func @bar() {

%A = alloc %A(%M, %N) [%s0, %s1] : memref < ? x ? x f32, (d0, d1) [s0, s1] -> (s0 * d0 + s1 * d1)

%B = alloc %B(%M, %N) [%s2, %s3] memref <? x ? x f32, (d0, d1) [s0, s1] -> (s0 * d0 + s1 * d1)

call @foo (%A) : memref < ? x ? x f32, (d0, d1) [s0, s1] -> (s0 * d0 + s1 * d1)

call @foo (%B) : memref <? x ? x f32, (d0, d1) [s0, s1] -> (s0 * d0 + s1 * d1)

return;

}

// This function will not get cloned or specialized since there is *only one* type for the memref

// and the type could bind to different SSA stride values at each call site.

func @foo (%A : memref <? x ? x f32, (d0, d1) [s0, s1] -> (s0 * d0 + s1 * d1) ) {

...

}

Also, when you normalize a memref (if you choose to), its type changes to

memref <? x ? x f32> : (i.e, identity layout (d0, d1) -> (d0, d1) )

The SSA values for the symbolic arguments become extra arguments to the callee; so the function signatures still match, and you don't have to clone/specialize the function. This was the option (a) I was referring to and this would also be transparent to the LLVM lowering since it would never see any symbolic operands associated with the memref.

And if you have a choice if the strides are set to constants at different call sites: if you propagate the constants into the maps, you will have to specialize; if you keep them as symbols in the map, you won't have to.

While this may be profitable in some case, I can imagine that the increased code-size can be undesirable, so is the potentially very large amount of cloning and specialization on compile time.
But more importantly, this does not address the use of memref as a type that can cross module boundaries. Not being able to have some modularity in the codegen (through the ability to specify an ABI) seems overly restrictive to me.
It seems that the lowering and the structure that Nicolas mentioned is allowing to express an external function taking a memref for a given shape (or even including some dynamic shapes), and call this external function with memref arguments that have various stride configurations without having to specialize this function.
While admittedly this is only for a subset of the expressiveness power of affine maps, the modularity aspect still seems valuable depending on your use-case.
Do you believe that there is no applicability for this?

Yes, definitely; this option should also exist. In fact, I think the ability to send in different strides from different call sites as parameters while not cloning the function is more important / a more common use case. But both specializing and "no specializing" options are possible when using affine maps.

~ Uday

Alex Zinenko

unread,

Sep 17, 2019, 9:19:17 AM9/17/19

to Uday Bondhugula, MLIR

Thanks for the detailed arguments on both sides. From my perspective, it looks like there are many valid points _and_ some missing aspects in both cases.

For the "just keep the memref", there is not enough practical considerations in the LangRef or in the lowering for it to be compared to the proposal in question. In particular, the following questions

1. How does memref normalization work for layout maps with symbols?

def @foo(%A: memref<?x?xf32, (d0,d1)[s0] -> (s0*d0 + d1)>, %i: index, %j: index) {

load %A[%i, %j] : memref<?x?xf32, (d0,d1)[s0] -> (s0*d0 + d1)>

/*...*/

}

I checked the current implementation, and it does not do anything. (Side note, it crashes on `alloc()[%s] : memref<10x10xf32, (d0,d1)[s0] -> (10*d0 + d1)>`).

We cannot move [s0] to the load arguments because it has been bound at allocation, and we obviously cannot just drop it. If it is somehow present in the memref, how do we differentiate, after normalization, between memref<?x?xf32> that was normalized from a map with a symbol, and form a map without a symbol?

2. How does normalization work with external functions, which are defined in another module and cannot be duplicated in principle (be it beneficial to performance or not)?

3. If we don't normalize, we need an LLVM or C runtime data structure to store the data associated with this memref and the related affine map. What would that be? Assuming it still has a base pointer, what is the formula to compute the linear offset in number of elements of a specific element given the indices and the data present in the dynamic structure?

For the "strided memref", the motivation is rather vague

A. Is there a specific example that cannot be expressed with memrefs and affine maps? Which one? Is there a possibility to relax affine constraints to support it?

B. Have you considered extracting stride information from affine maps? What is the benefit of having stride information available directly, rather than extracted from the affine maps or otherwise computed?

More generally, I have always been concerned with the lack of trade-off discussion between analysis complexity and expressivity. It may be justifiable to restrict expressivity to enable analyses and transformations. Affine modeling does exactly that, so if having a representation that is more expressive is reason enough to not have another representation for the sake of transformations, we should reconsider having affine. From the pure expressivity standpoint, any access pattern can be expressed using LLVM-level pointer arithmetics after all. It would be great to see a rationale for the expressivity/analysis trade-off here.

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/ad37f7db-e80f-4859-9e12-66372803c32d%40tensorflow.org.

--

-- Alex

Uday Bondhugula

unread,

Sep 17, 2019, 3:22:49 PM9/17/19

to MLIR

On Tuesday, September 17, 2019 at 6:49:17 PM UTC+5:30, Alex Zinenko wrote:

On Tue, Sep 17, 2019 at 10:22 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

On Tuesday, September 17, 2019 at 12:46:03 PM UTC+5:30, Mehdi AMINI wrote:

Hi Uday,

Thanks for the very thorough email!

On Mon, Sep 16, 2019 at 1:14 PM 'Uday Kumar Reddy B' via MLIR <ml...@tensorflow.org> wrote:

...

Thanks for the detailed arguments on both sides. From my perspective, it looks like there are many valid points _and_ some missing aspects in both cases.

For the "just keep the memref", there is not enough practical considerations in the LangRef or in the lowering for it to be compared to the proposal in question. In particular, the following questions

1. How does memref normalization work for layout maps with symbols?
def @foo(%A: memref<?x?xf32, (d0,d1)[s0] -> (s0*d0 + d1)>, %i: index, %j: index) {
load %A[%i, %j] : memref<?x?xf32, (d0,d1)[s0] -> (s0*d0 + d1)>
/*...*/
}
I checked the current implementation, and it does not do anything.

This map is semi-affine; normalizeMemRefs doesn't handle it yet (TODO). If there is an immediate use case, I'll be happy to implement it. The memref replacement part will be the same as with pure affine maps (a fwd substitution/composition of the layout map into any affine load/stores); the other part needed is the range determination (s0*d0 + d1 here lies between 0 and (dim_size_d1 - 1) + s0 * (dim_size_d0 - 1) + 1. For the inter-procedural part, I mentioned about it in the previous two emails.

(Side note, it crashes on `alloc()[%s] : memref<10x10xf32, (d0,d1)[s0] -> (10*d0 + d1)>`).

Thanks for reporting this. Fixed in https://github.com/tensorflow/mlir/pull/139

We cannot move [s0] to the load arguments because it has been bound at allocation, and we obviously cannot just drop

I'm assuming you are still referring to the intra-procedural case. The load's map/operands get composed with the layout map + alloc's operands via replaceAllMemRefUsesWith. (Note that all load's and store's on that memref within a function are dominated by the alloc that defines the memref). For the inter-procedural case, with the normalizeMemRef approach, you get extra arguments on the function call that correspond to those symbols. If it's handled via LLVM lowering, it gets treated the same way as shape SSA values.

it. If it is somehow present in the memref, how do we differentiate, after normalization, between memref<?x?xf32> that was normalized from a map with a symbol, and form a map without a symbol?

It can't be somehow present in the memref :-) (what's printed in the IR is all there is). If it's normalized, it becomes memref<?x?xf32>, a map without a symbol. Note that the load's and store's (and similarly dma op's) that use that memref have their maps + operands already composed/updated to use that symbol (via mlir::replaceAllMemRefUsesWith) - so its layout map is just identity. If your question was as to what happens in the interprocedural case: in such a case where the memref was being passed to a function, the function argument list is to be updated to pass along the SSA value bound to s0 as an extra argument; normalizeMemRefs is supposed to be able to perform such a rewrite. (It currently doesn't.)

2. How does normalization work with external functions, which are defined in another module and cannot be duplicated in principle (be it beneficial to performance or not)?

If we can't update the callee, there is no way to normalize :-( unless this is performed at link time when you have access to the other module. To normalize, one needs to (a) rewrite the load's and stores on the memref *everywhere*, (b) update the callees' argument lists. As an analogy, how would the LLVM lowering work when dynamic memref's are passed along to functions in another module? (let's say with just identity layouts) The same approach would apply here, right?

3. If we don't normalize, we need an LLVM or C runtime data structure to store the data associated with this memref and the related affine map. What would that be? Assuming it still has a base pointer, what is the formula to compute the linear offset in number of elements of a specific element given the indices and the data present in the dynamic structure?

I think you've already answered the previous question here :-) As for the formula: compose the layout map with the load/store map: (layout_map o load_store_map). The result of this map on the operands gives you the multi-dimensional indices that are to be linearized with the respective dimension sizes (which again need to be provided by the same logic normalizeMemRefs uses). For the current LLVM lowering code we have, since the layout map is assumed to be identity, you are just getting those sizes to linearize from the memref shape. But this won't be the case with a non-identity layout map. (A utility will provide the new sizes.)

Best,

~ Uday

Nicolas Vasilache

unread,

Sep 18, 2019, 9:41:09 AM9/18/19

to Uday Kumar Reddy B, MLIR

On Mon, Sep 16, 2019 at 4:14 PM Uday Kumar Reddy B <ud...@iisc.ac.in> wrote:

Hi Nicolas

I'm going to respond to your numerous questions/comments in bulk here
instead of inline :-), since I think they are all inter-related, and I
wanted a particular order to it. (I'm skipping some of the questions
that I found digressing and unrelated to this strided memref proposal.)

Hi Uday,

Thanks for your comments, I'll adapt to your structure to facilitate communication.

1) I think my first question was on clarifying separately on
*representational power* and *ease of representation/transformation*.
Although I don't see an explicit response to the first part, it's clear
(and also with Alex confirming) that this proposal deals with a subset
of what affine maps can already represent. In your earlier response to
River's question, you stated
"I am much more concerned that memrefs just cannot represent
non-continuous input memory regions.", i.e., the current affine map
layout couldn't represent non-contiguous regions and that this proposal
did; but that is clearly inaccurate. It would be good to reconsider
things in view of that. Your previous email has thus been all about ease
of representation - so let's get to that.

A memref *by itself* does not represent "non-contiguous input memory regions" because it does not always pass function boundaries without moving information to ops. I disagree with the way the equivalence if formulated (the i.e. part) since a single modulo or a mult in the layout can be enough to get non-contiguous addresses.

Looking forward, I agree that the unification can be done without introducing a proper subset of the layout and/or new syntax.

2) Coming to the ease of representation: I think your point that it is
hard to pull out individual aspects from compound maps like strides,
tile sizes, offsets (for whatever reasons you need them) doesn't make
sense at all because you are really comparing apples to oranges there!
:-) This proposal in discussion is on strided memrefs, and if you look
at how those strides get into and are represented in affine maps, it's
just trivial to extract them out if the strides are what you wanted to
represent (just with a pattern match on the affine maps - see further
below).

Sure, the less trivial part is to guarantee the layouts remain in that form, which is what prompted the proposal of this proper subset.

Now, if the affine map is a compound map representing a composition of
several things, the comparison should be made to a design that is
roughly as powerful (representational power-wise), and is able to
represent all of that (strides, tiled layouts, nested tiled layout, all
useful compositions thereof such as strided intra-tile + permuted tile
space), and then evaluate which one is better.

Not sure why, the proposal is not to replace the existing.

The representation in
this proposal is really adding a very thin layer of abstraction to
represent a proper subset of what's possible with affine maps, and at
the expense of new syntax and a new representation, and providing an
alternate/duplicate lowering path.

3) Reg. your comment on the different number of dimensional arguments
and symbolic arguments to the AllocOp: I'm not sure why that's an issue.

It is an issue in my use case because this impacts the signatures of lowered function calls with memref arguments.

One should let the compiler do its job here! :-) If there are really
going to be 42 symbolic arguments binding to a layout maps (albeit
unrealistic), there's probably a reason, and let the machinery propagate
stuff. The key is that the current affine map based design is unified
with the rest of the SSA IR and as such SSA values bound to the
dimensions, symbols of the alloc all benefit from the existing infra on
constant propagation into affine maps, and algebraic simplification and
canonicalization of maps.

4) Reg. your comment on hearing about a counter proposal on the affine
dialect, this is a bit surprising, since the counter proposal you are
asking for is already the current memref design we have! :-)

That's not the question :) , the question is whether there is a proposal to "achieve the same generic loop + library compilation that Linalg does but in the pure affine world", which in turn would render this unification effort less useful.

This is out of context of the current proposal because it moves information from the type to a predefined set of affine ops.

What if there is no "affine stuff"?

(b) keep track of the symbols the same way as the shape SSA values
associated with the dynamic dimensions in the LLVM lowering.

If symbol information exists in the memref descriptor when lowering to LLVM, then I can indeed use and extract that from an (asserted) normal form and get it into the shape I want to pass to external library calls and achieving the unification I am after.

Thanks for raising that point, it wasn't clear to me that this information would be present in the type.

You need
this design only to preserve/hide complex non-identity layouts *all the
way* up until the LLVM lowering - normally, one would like to just
normalize during -lower-affine, because the linearized stuff can be left
alone as we are anyway out of affine.load/stores. Or you could do that
just before LLVM lowering in order to separate concerns and not make the
really nice LLVM lowering we currently have heavy.

With all these design points, the LLVM lowering will NOT see any symbolic operands on the
alloc's or views!

5) At several points, you claim that affine is limiting in many ways and
give orthogonal examples: but the thing in discussion is *this* layout
proposal and what is being proposed here is a small subset of what can
already be achieved with affine maps. It doesn't make sense to make the
affine restriction argument unless the proposal can deal with something
that can't already be represented with affine maps (for eg. for
affine.for, the case of non-constant steps is a valid one, but this
proposal isn't about loop structure). So let's stick to the topic at hand.

This is precisely the topic at hand: "unify existing codegen strategies (Affine and Linalg dialects) by using the same standard type".

I am not sure how this comment applies.

6) The danger with the proposed representation is that if someone needs
something more in the future, this specialized representation and syntax
are hard to adapt/extend to it, and you'll have to go back to the
drawing board. And consequently, one gets a pot pourri of different
syntax/representations attached with the memref grammar looking like:

memref<? x ? x f32, rep1 | rep2 | rep3 | ...>,

Yes, I am sympathetic to that.

The corollary seems to be that no other layout information should be attached to memref unless it encompasses / generalizes the layout map ?

The comment on this point is "I see similarities with extracting elemental transformations from an affine schedule which is still an open problem AFAIK.".

Why is this interpreted as a claim that "you can't do such things"?

To conclude, I think this proposal is really adding a very thin layer of
abstraction at the expense of new syntax and a new representation; it
also only achieves what is already possible with current memref design,
and it is also in a manner not as unified with the rest of MLIR as
current memrefs are. Without augmenting it further to represent more
stuff, I think it doesn't pay the price of changing the IR and the type
system.

I think a similar result can be achieved without modifying the syntax by asserting a layout map in strided form (as you pointed out) and bailing out.

The downside is there is no guarantee by construction and extra work will be needed but at least progress could be made.

I'll give that a shot.

Thanks!

--

N

Uday Bondhugula

unread,

Sep 19, 2019, 3:28:18 PM9/19/19

to MLIR

Well, the extraction utility could be made more powerful to deal with the cases of interest. At the moment, no pass/utility updates the layout (except normalizeMemRefs). That said, there is more work to be done on canonicalization of maps (esp. semi-affine ones). But I'm confident that extraction of things like strides, tile sizes in case of tiled layouts, padding/shifts, etc. in all the use cases of interest is actually not difficult at all and is simplified by the tree structure of AffineExpr (the matching will be recursive).

If there are no affine ops, you'll just end up with affine.apply's in front of those op's (let's say in front of standard load/store's), and then -lower-affine as you know will just turn them into add/mul, etc.

(b) keep track of the symbols the same way as the shape SSA values
associated with the dynamic dimensions in the LLVM lowering.

If symbol information exists in the memref descriptor when lowering to LLVM, then I can indeed use and extract that from an (asserted) normal form and get it into the shape I want to pass to external library calls and achieving the unification I am after.
Thanks for raising that point, it wasn't clear to me that this information would be present in the type.

Reg. this part on extracting the strides to pass into library calls: is the LLVM lowering the only/right place to do this? Don't you want to do this early? Does it result from the fact that you go straight from the Linalg dialect to LLVM dialect? With regard to the affine dialect, my thinking right now is that the memref layouts should be normalized (turned into identity maps) when going out of the affine dialect itself (in the general case, normalizeMemRefs would be part of a module pass). As such, passing/extracting stride info to external library calls should be dealt there. The lowering to LLVM shouldn't be burdened with these concerns: so, linalg -> standard/affine -> llvm appears smoother just like we do affine -> standard -> llvm.

You need
this design only to preserve/hide complex non-identity layouts *all the
way* up until the LLVM lowering - normally, one would like to just
normalize during -lower-affine, because the linearized stuff can be left
alone as we are anyway out of affine.load/stores. Or you could do that
just before LLVM lowering in order to separate concerns and not make the
really nice LLVM lowering we currently have heavy.
With all these design points, the LLVM lowering will NOT see any symbolic operands on the
alloc's or views!

5) At several points, you claim that affine is limiting in many ways and
give orthogonal examples: but the thing in discussion is *this* layout
proposal and what is being proposed here is a small subset of what can
already be achieved with affine maps. It doesn't make sense to make the
affine restriction argument unless the proposal can deal with something
that can't already be represented with affine maps (for eg. for
affine.for, the case of non-constant steps is a valid one, but this
proposal isn't about loop structure). So let's stick to the topic at hand.

This is precisely the topic at hand: "unify existing codegen strategies (Affine and Linalg dialects) by using the same standard type".
I am not sure how this comment applies.

I see. My comments were mainly focusing on the layout proposal instead of your larger goal.

6) The danger with the proposed representation is that if someone needs
something more in the future, this specialized representation and syntax
are hard to adapt/extend to it, and you'll have to go back to the
drawing board. And consequently, one gets a pot pourri of different
syntax/representations attached with the memref grammar looking like:

memref<? x ? x f32, rep1 | rep2 | rep3 | ...>,

Yes, I am sympathetic to that.
The corollary seems to be that no other layout information should be attached to memref unless it encompasses / generalizes the layout map ?

No, I don't think I meant exactly that. If the current "memref with affine layout map" design is found to be inadequate, I feel we should either fix/reconsider the infrastructure around it or perhaps replace it with something else that at least roughly captures most of the useful subsets that the affine map handles. In particular, there was a tiled layout proposal thread from @andydavis a month ago, and it would be good to see if at all this stride proposal could be unified with it. Has this been explored? Are they just better completely separate?

I don't think I've understood this, but it's probably related to the other comment above on canonicalizing/matching patterns on AffineExpr trees while assuming that we care about the patterns of interest and just bail out on others / enforce them by construction.

To conclude, I think this proposal is really adding a very thin layer of
abstraction at the expense of new syntax and a new representation; it
also only achieves what is already possible with current memref design,
and it is also in a manner not as unified with the rest of MLIR as
current memrefs are. Without augmenting it further to represent more
stuff, I think it doesn't pay the price of changing the IR and the type
system.

I think a similar result can be achieved without modifying the syntax by asserting a layout map in strided form (as you pointed out) and bailing out.
The downside is there is no guarantee by construction and extra work will be needed but at least progress could be made.
I'll give that a shot.

Thanks! I think detection of such strided form layout maps is useful for normalizeMemRefs when strides are symbols (i.e., semi-affine layouts maps) in order to easily determine allocation sizes. (For constant stride sizes, the current affine machinery computes the allocation sizes correctly, but as you know, it isn't of use for the semi-affine ones. ) The memref rewriting part works as is irrespective of what the layout map has.

~ Uday

Andy Davis

unread,

Sep 20, 2019, 1:13:08 PM9/20/19

to Uday Bondhugula, MLIR

Re: If the current "memref with affine layout map" design is found to be inadequate..... tiled layout

I think that semi-affine maps can express many of the layouts expressible by the tiled layout, however I'm not sure if semi-affine maps can express the implicit padding that the tiled layout does (In the tiled layout, if you choose a tile size which is not a multiple of the dimension size, the last tile will be padded out to the tile size).

For semi-affine maps to be the one universal layout, they would need to handle this case as well (perhaps by using an IntegerSet to express the unpadded/valid data region). For example:

Let’s say we have the following allocation memref with scalar element type:

%0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1)>

Now, let's say we want to tile the last dimension by 8, and pad out the remainder to the tile size. So we create this memref:

%0 = alloc() : memref<10 x 38 x vector<8 x f32>, (d0, d1) -> (d0, d1)>

Now the last 4 elements of the last tile are padded. However, if you look at this memref descriptor, we've completely lost track of where the underlying valid data ends and the padding begins. So we need to retain that information (somehow and somewhere).

One option would be to use an IntegerSet to capture the valid data region (option A):

#set0 = (d0, d1) : (d0 >= 0, -d0 + 9 >= 0, d1 >= 0, -d1 + 299 >= 0)

%0 = alloc() : memref<10 x 38 x vector<8 x f32>, (d0, d1) -> (d0, d1), #set0>

The nice thing about option A, is that the integer set can be directly used by affine analysis passes (along with the composed affine access and layout functions on the same memref).

Another option, would be to list the un-tiled data shape as a list of integers (option B):

%0 = alloc() : memref<10 x 38 x vector<8 x f32>, (d0, d1) -> (d0, d1), [10, 300]>

The nice thing about option B, is that it is more succinct.

The XLA Tiled Layout does not support dynamic tile sizes. If we'd like one "standard" memref layout descriptor, we'll need to either have semi-affine layout maps support padding (see above), or make the MLIR version of XLA's tiled layout support dynamic tile sizes (not sure if this is possible), or invent a new layout descriptor that handles all these cases cleanly...

Andy

--

You received this message because you are subscribed to the Google Groups "MLIR" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/fbb9e2e9-250b-486f-9f37-615ff7f93b4a%40tensorflow.org.

Message has been deleted

Andy Davis

unread,

Sep 24, 2019, 2:46:37 PM9/24/19

to Uday Bondhugula, MLIR

Thanks Uday. In your example

%0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8)

I think your calculation for the allocation size also assumes implicit padding when the dim size is not a multiple of the tile size (similar to XLA's tiled layout).

By using the view, you can "record" the scalar shape information in the base memref type, but the issues we've always had with derived views like this is how they cross function boundaries. If you pass this vectorized view memref type into a function, then lowering inside this function will likely still need to see the "base" memref type to see where the padding is.

Don't get me wrong, I've argued for views for quite some time. We tend to get stuck with the function boundary issue, how pas the chain of types. Here is an example where I append the base memref type onto the view type (not saying this is the right thing to do):

%0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8)

%1 = view %0 : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), memref<10 x 38 x vector<8xf32>>

call @func(%1)

And the function signature would look like this:

func (%arg0 : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), memref<10 x 38 x vector<8xf32>>) {

%v = load %arg0[%i, %j] : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), memref<10 x 38 x vector<8xf32>>

}

This would work, but if we had dynamic dims/symbols, then the runtime struct for memref would need to keep a stack of dims/symbols for the chain of derived memref types.

Alternatively, if we could compose the chain of maps for these derived memref types, into a single memref type that could pass function boundaries, but also supported padding, that would be great.

Andy

Uday Bondhugula

unread,

Sep 24, 2019, 4:48:37 PM9/24/19

to MLIR

Hi Andy,

On Wednesday, September 25, 2019 at 12:16:37 AM UTC+5:30, Andy Davis wrote:

Thanks Uday. In your example

%0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8)

I think your calculation for the allocation size also assumes implicit padding when the dim size is not a multiple of the tile size (similar to XLA's tiled layout).

Yes, that's correct.

By using the view, you can "record" the scalar shape information in the base memref type, but the issues we've always had with derived views like this is how they cross function boundaries. If you pass this vectorized view memref type into a function, then lowering inside this function will likely still need to see the "base" memref type to see where the padding is.

Don't get me wrong, I've argued for views for quite some time. We tend to get stuck with the function boundary issue, how pas the chain of types. Here is an example where I append the base memref type onto the view type (not saying this is the right thing to do):

%0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8)

%1 = view %0 : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), memref<10 x 38 x vector<8xf32>>

call @func(%1)

And the function signature would look like this:

func (%arg0 : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), memref<10 x 38 x vector<8xf32>>) {
%v = load %arg0[%i, %j] : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), memref<10 x 38 x vector<8xf32>>

}

This would work, but if we had dynamic dims/symbols, then the runtime struct for memref would need to keep a stack of dims/symbols for the chain of derived memref types.

That's right. Having such a chain during lowering of the memref (for eg. to LLVM IR) is going to be too complex and clumsy. But would the lowering ever need to see the chain - can't it always be composed away before lowering to the LLVM dialect. I think the key is to ensure the chain stays at least until one needs to extract padding information, and this is only needed earlier? So, the runtime struct for memref would never need to see the chain and will never have such a stack of symbols?

Alternatively, if we could compose the chain of maps for these derived memref types, into a single memref type that could pass function boundaries, but also supported padding, that would be great.

Yes, I agree. Does composing into a single memref once we no longer need the padding info help?

~ Uday

Andy

Andy Davis

unread,

Sep 25, 2019, 4:34:41 PM9/25/19

to Uday Bondhugula, MLIR

Thanks Uday. In my example above, the view memref with vector element type, has its innermost dimension indexed by tile index in that dimension (not scalar element dim index), so we've lost information and I'm not sure how we would compose the view map with the base memref map.

I suppose we could allow the just use the base memref as is, and use the affine map "range_sizes" property (that we got rid of) to represent the vector element shape. For example:

A. %0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), range_sizes = [*, *, 8]>

Alternatively, we could add element type "transfer size" explicitly:

B. %0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), vector<8xf32>>

I don't have any clean solutions, hence my two proposals above.

Andy

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/7c10a0da-3b2c-45eb-9d3f-918fce07f1c9%40tensorflow.org.

Uday Bondhugula

unread,

Sep 26, 2019, 5:09:06 AM9/26/19

to MLIR

On Thursday, September 26, 2019 at 2:04:41 AM UTC+5:30, Andy Davis wrote:

Thanks Uday. In my example above, the view memref with vector element type, has its innermost dimension indexed by tile index in that dimension (not scalar element dim index), so we've lost information and I'm not sure how we would compose the view map with the base memref map.

Thanks Andy. In your example, you had:

%1 = view %0 : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), memref<10 x 38 x vector<8xf32>>

So, while you index this memref in the logical space 10 x 38 (innermost dimension here is the index of the tile / vector of eight elements as you point out), the information on padding is actually recoverable from the chain. In this case, 8 * ceil(300/8) - 300 = 4 gives us the padding info, right? Reg. composition, you are right here that it's not meaningful straight away: because the target space of the first affine map (10 x 38) and the logical (input) space of the second one (10 x 300) don't match readily. First, the elemental types are different, and second, the range of the first exceeds the logical space of the second (the difference is the padding here). But if we adjust for the elemental types, the composition is possible. In this case, one could multiply to make the base elemental types the same (the identity affine map of 10 x 38 would be treated as (d0, 8 * d1) to match f32 elemental type), and then composition will finally yield (d0, 8 * d1 floordiv 8, 8 d1 mod 8), i.e., (d0, d1, 0). I haven't thought through other implications.

I suppose we could allow the just use the base memref as is, and use the affine map "range_sizes" property (that we got rid of) to represent the vector element shape. For example:

A. %0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), range_sizes = [*, *, 8]>

The range size of 8 is already known / derivable here - so I'm not sure what additional information range_sizes adds here. You wanted the base elemental type to be changed to vector, while preserving padding info, right?

Alternatively, we could add element type "transfer size" explicitly:

B. %0 = alloc() : memref<10 x 300 x f32, (d0, d1) -> (d0, d1 floordiv 8, d1 mod 8), vector<8xf32>>

It's not clear here what the logical (indexed) space for %0 is. Is it 10 x 300 or 10 x 38? If it's the former, one won't get vectors from it as desired, and if it's the latter, the representation does not appear true to the semantics?

~ Uday

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

Reply all

Reply to author

Forward