[llvm-dev] [RFC] Proposal for TLX: Tensor LLVM eXtensions

90 views
Skip to first unread message

Kothari, Akash via llvm-dev

unread,
Nov 13, 2021, 6:26:01 PM11/13/21
to llvm...@lists.llvm.org, Mendis, Charith, Adve, Vikram Sadanand, Luo, Yuanke, Khaldi, Dounia, Girkar, Milind, Noor, Abdul Rafae, Sengupta, Sudipta

**** Proposal for TLX: Tensor LLVM eXtensions
===================================================================================

 

Authors: Akash Kothari (UIUC), Abdul Rafae Noor (UIUC), Dounia Khaldi (Intel),

    Vikram Adve (UIUC), Yuanke Luo(Intel), Sudipta Sengupta (Amazon AWS),

    Milind Girkar (Intel), Charith Mendis (UIUC)

 

------------------------------------------------------------------------------------

 

 

 

*** RATIONALE
========================================================================================
Diverse hardware vendors are developing new hardware support for (mostly dense)

tensor computations, which have become increasingly important for machine

learning applications. These include both ISA extensions on CPUs and GPUs (such

as Intel AMX, Power MMA, NVIDIA’s tensor cores, AMD’s matrix cores, and 

Qualcomm’s HVX vector ISA) and dedicated accelerators for compute offload (such

as NVIDIA’s NVDLA, Amazon’s Inferentia and Trainium, and numerous ML 

accelerators from smaller companies). While ML workloads are the primary 

motivation and likely to be the dominant use cases, other tensor-intensive 

application domains, such as image processing, scientific computing, quantum 

simulations, financial modeling, and others can benefit from this hardware 

support as well, via languages like C++, DPC++, Julia, Fortran, Halide, CUDA, OpenCL,

and others.

 

LLVM can play a crucial role in making it easier for these vendors to create 

optimizing compiler back-ends for their emerging hardware (if the existing 

vector and matrix support in LLVM were generalized to support tensor operations).

LLVM is already widely-used today by many of the vendors that develop these 

tensor architectures, e.g., to target CPUs and GPUs. LLVM is highly 

retargetable, by design. For the CPU targets, LLVM allows an integrated code 

generation framework for tensor operations with optimized intermixing of scalar, 

1-D vector and 2-D matrix operations in the same code section (e.g., loop 

body). And LLVM has front-ends for a wide range of high-level languages, 

including essentially all the languages used widely for relevant application 

domains today.

 

No existing infrastructure we know of meets these needs. MLIR is likely 

the best option, and we believe it is entirely complementary to LLVM. 

MLIR provides strong support for high-level tensor operations in  

TOSA, relevant optimizations in Affine and Linalg, and lowering paths to 

accelerators, GPUs and (via the LLVM dialect) CPUs. Crucially, however, MLIR 

does not have a separate low-level code generation framework that is 

retargetable to diverse hardware: it relies on LLVM for this purpose. If LLVM 

could be extended with tensor operations and a corresponding retargetable 

tensor code generation framework, MLIR could leverage this as well. Moreover, 

there are enough vendors and also languages that rely heavily

on LLVM (but do not use MLIR) that it seems worthwhile to have a high-quality 

tensor code generation in both LLVM as well as in MLIR.  Ideally, both systems 

would largely share the same code.

 

The broad goal of our project is to add a retargetable tensor code generation 

framework to LLVM. We are currently working on a prototype implementation 

with our collaborators at Amazon AWS, Intel, IBM and Qualcomm. This RFC 

focuses on the first stage: extending the LLVM IR with tensor operations

which we refer to as TLX (Tensor LLVM eXtensions).

 



*** OVERALL PROJECT OBJECTIVES
===============================================================================
* A unified retargetable code generation and optimization framework for LLVM 

to target diverse tensor architectures with a common set of IR extensions, 

instead of using target-specific solutions.

 

* (Subject of this RFC.)A single set of target-agnostic tensor extensions in

LLVM that higher-level tensor code generation frameworks such as XLA, Halide, 

TVM, MLIR, etc. can target instead of lowering to target-specific intrinsics 

in LLVM, while retaining the optimizations in these high-level frameworks.

 

* A pathway for LLVM-based languages such as  C/C++, DPC++, Fortran, Rust, 

Julia, etc. that do not have frontends for compiler systems like MLIR, TVM, XLA, 

etc. to target modern tensor architectures by lowering to our tensor extensions 

in LLVM.

 

* Target-independent optimizations (e.g. peephole and generic SSA-based 

optimizations) and also flexible code generation capabilities in LLVM that 

could involve mixing instructions operating on vector and rectangular 

registers, and involve developing cost models which could help reduce register 

spills and maximize usage of available hardware resources.

 

* Contribute our tensor extensions (this RFC) and retargetable code generation 

framework (as a followup) to the LLVM project for the community to experiment with 

and provide feedback.

 


*** RFC: INTRODUCTION OF TENSOR CONCEPT IN LLVM
======================================================================
To achieve our objectives, we need to introduce the concept of tensors. 

To do this, we need to add a tensor type - N-dimensional data 

type, generalizing 1-D vectors and 2-D matrices. We also need crucial tensor 

operations which front-ends for high-level languages can target, and which 

represent or can be implemented via ISAs of different tensor architectures.


*** IMPLEMENTATION OF TENSOR TYPE IN LLVM
=======================================================================

** OVERVIEW:
----------------------
The concept of dense tensors can be implemented as a new, first-class 

n-dimensional vector type in LLVM. However, doing this would be extremely 

intrusive since it will require changes to hundreds of files in LLVM. While 

this may be the correct option in the long term, once the design has been 

properly evaluated and refined, the effort required to do so for an initial 

prototype and evaluation is not justified. So we propose to implement the tensor 

concept as an LLVM intrinsic called llvm.tensor.typeinfo while representing tensor 

data in “flattened” form as ordinary LLVM vector types. The intrinsic takes as operands 

a “flattened” LLVM vector, together with shape, layout and padding vectors, and 

returns a value of LLVM token type. By returning a value of token type, this intrinsic 

avoids the risk of being eliminated by optimizations (especially, dead code elimination) 

when it has uses. This intrinsic is marked with the 'readnone' and 'speculatable' 

attributes so that it does not inhibit optimizations like redundancy elimination, dead 

code elimination, code motion, etc.

token llvm.tensor.typeinfo(<llvm-vector-type> %tensor, <n x i32> %shape, 

                                <n x i32> %layout, <n x i32> %padding)

** OPERANDS:
-----------------------
==============================================================================|
Operand     | Description                                                     |
============|=================================================================|
%tensor     | n-dimensional tensor value represented as a “flattened” vector  |
------------|-----------------------------------------------------------------|
%shape      | Vector of dimension values of a tensor                          |
------------|-----------------------------------------------------------------|
%layout     | Vector of permutation of dimension indices ranging from 0 to n-1|
------------|-----------------------------------------------------------------|
%padding    | Vector of padding values along every dimension of a tensor      |
==============================================================================|

** RESULT:
-----------------------
==============================================================================|
Result           | Description                                                |
=================|============================================================|
token value      | LLVM value of token type associated with a tensor value    |
==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.typeinfo’ intrinsic is used to produce a unique token value 

associated with a tensor value represented as a “flattened” vector. The layout 

operand of this intrinsic is expressed as a permutation of dimension indices 

(from 0 to n-1 for an n-dimensional tensor). This represents tensor layouts in 

LLVM in a generic way. The number of elements in shape, layout and padding 

vectors must be the same and equal to the number of dimensions of the given 

tensor.

Note that this intrinsic is only meant to hold information such as shape, 

layout and padding of a tensor value in LLVM IR. It does not read nor write 

memory nor perform any computations, and it does not exhibit any kind of 

undefined behavior.


** EXAMPLE:
-----------------------
; The first argument (%tensor) is the tensor that is being modelled as a 

flattened vector. The second argument is the shape (16 x 5 x 3), the third 

argument is layout (<0, 1, 2>) and the fourth argument is padding 

(<3, 2, 1> along the corresponding dimensions) for the given tensor.


%input = call token @llvm.tensor.typeinfo(<240 x float> %tensor, 

  <3 x i32> <i32 16, i32 5, i32 3>, <3 x i32> <i32 0, i32 1, i32 2>, 

  <3 x i32> <i32 3, i32 2, i32 1>)

; The first argument is the input virtual tensor register and the second 

argument is the new permutation of the layout of the input tensor. This 

operation produces a tensor of layout <2, 0, 1>.  


%output = call <240 x float> @llvm.tensor.transpose(token %input, 

                                          <3 x i32> <i32 2, i32 0, i32 1>)

; The first argument (%output) the output tensor that is being modelled as a 

flattened vector. The second argument is the new shape (3 x 16 x 5), the third 

argument is the new layout (<2, 0, 1>) and the fourth argument is the new padding 

(<1, 3, 2> along the corresponding dimensions) for the output tensor.


%typed_output = call token @llvm.tensor.typeinfo(<240 x float> %output, 

  <3 x i32> <i32 3, i32 16, i32 5>, <3 x i32> <i32 2, i32 0, i32 1>, 

  <3 x i32> <i32 1, i32 3, i32 2>)



*** TENSOR OPERATIONS IN LLVM
=================================================================================

** INTRINSIC: llvm.tensor.load
================================

** OVERVIEW:
---------------------
This operation loads a tensor or sub-tensor with the given shape, layout and 

padding from memory into a register. This operation is strided, unlike the 

existing load instruction in LLVM, to be able to load sub-tensors from memory. 

This intrinsic is marked with 'speculatable' attribute to prevent it from 

inhibiting optimizations like redundancy elimination, dead code elimination, 

code motion, etc.

token llvm.tensor.load(<element_type>* %mem_ptr, <n x i32> %shape, 

                        <n x i32> %layout, <n x i32> %pad, <n xi32> %strides)

** OPERANDS:
---------------------

==============================================================================|
Operand      | Description                                                    |
=============|================================================================|
%mem_ptr     | Starting address of a tensor/subtensor in memory               |
-------------|----------------------------------------------------------------|
%shape       | Vector of dimension values of the loaded tensor/sub-tensor     |
-------------|----------------------------------------------------------------|
%layout      | Vector of permutation of dimension indices ranging from 0 to   |

             | n-1                                                            | 
-------------|----------------------------------------------------------------|
%padding     | Vector of padding values along every dimension of              |

             | the loaded tensor/sub-tensor                                   |
-------------|----------------------------------------------------------------|
%strides     | Vector of strides in memory along every dimension of the loaded|

             | tensor/sub-tensor                                              |

==============================================================================|

** RESULT:
------------------
==============================================================================|
Result       | Description                                                    |
=============|================================================================|
token value  | LLVM value of token type associated with a tensor value        |
==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.load' intrinsic loads a tensor or subtensor with the given 

shape, layout and padding from memory into a register. This operation is 

strided based on %strides, unlike the existing load instruction in LLVM, to be 

able to load subtensors from memory since sub-tensors are not laid out 

contiguously in memory. This intrinsic reads from memory, but does not write 

to memory.

** EXAMPLE:
---------------------
; This loads a sub-tensor from the memory location pointed to by %mem_ptr. The 

; sub-tensor has the shape <16 x 6 x 4> (second argument), layout <0, 1, 2>  

; (third argument) and zero padding (fourth argument). The strides in memory 

; along every dimension are <0, 0, 8>, which means that the rows of the loaded 

; sub-tensor have a distance of 8 bytes in memory. This produces a unique token 

; %tensor.


%tensor = call token @llvm.tensor.load(i8* %mem_ptr, 

                                     <3 x i32> <i32 16, i32 6, i32 4>, 

                                     <2 x i32> <i32 0, i32 1, i32 2>, 

                                     <3 x i32> <i32 0, i32 0, i32 0>, 

                                     <3 x i32> <i32 0, i32 0, i32 8>)



** INTRINSIC: llvm.tensor.store
=================================

** OVERVIEW:
---------------------
This operation stores a tensor or subtensor from a register into memory. This 

operation is strided, unlike the existing store instruction in LLVM, to be 

able to store sub-tensors into memory. This intrinsic is marked with 'readnone' 

attribute to prevent it from inhibiting optimizations like redundancy 

elimination, dead code elimination, code motion, etc.

void llvm.tensor.store(<element_type>* %mem_ptr,token %tensor, 

                                                <n xi32> %strides)

** OPERANDS:
----------------------
==============================================================================|
Operand      | Description                                                    |
=============|================================================================|
%mem_ptr     | Starting address of a tensor/subtensor in memory               |
-------------|----------------------------------------------------------------|
%tensor      | Stored subtensor/tensor                                        |
-------------|----------------------------------------------------------------|
%strides     | Vector of strides in memory along every dimension of the stored|

             | tensor/sub-tensor                                              |
==============================================================================|

** RESULT:
------------------
Intrinsic does not return anything.


** SEMANTICS:
-----------------------
The ‘llvm.tensor.store' intrinsic stores a tensor or subtensor from a register 

into memory. This operation is strided based on %strides, unlike the existing 

store instruction in LLVM, to be able to store sub-tensors to memory since 

sub-tensors are not laid out contiguously in memory. This intrinsic writes to 

memory, but does not read from memory.


** EXAMPLE:
---------------------
%tensor = call token @llvm.tensor.typeinfo(<240 x float> %tensor, 

                                           <3 x i32> <i32 16, i32 6, i32 4>, 

                                           <3 x i32> <i32 0, i32 1, i32 2>, 

                                           <3 x i32> <i32 0, i32 0, i32 0>)

; This stores a tensor from the memory location pointed to by %mem_ptr and 

; the second argument is the stored tensor itself. The strides in memory along 

; every dimension are <0, 12, 10> (third argument), which means that the rows 

; of %tensor are stored 10*sizeof(float) bytes apart and columns of %tensor 

; are 12*sizeof(float) bytes apart in memory.

 
call void @llvm.tensor.store(float* %mem_ptr, token %tensor, 

                                          <3 x i32> <i32 0, i32 12, i32 10>)



** INTRINSIC: llvm.tensor.matmul
=================================

** OVERVIEW:
---------------------
This intrinsic performs batched matrix multiplication between the inner 

dimensions of two multidimensional tensors. This intrinsic is marked with 

the 'readnone' and 'speculatable' attributes to prevent it from inhibiting 

optimizations like redundancy elimination, dead code elimination, 

code motion, etc.

<vector_ty> llvm.tensor.matmul(token %input1, token %input2)


** OPERANDS:
----------------------
==============================================================================|
Operand      | Description                                                    |
=============|================================================================|
%input1      | Token value representing the first input tensor                |
-------------|----------------------------------------------------------------|
%input2      | Token value representing the second input tensor               |
==============================================================================|

** RESULT:
------------------
==============================================================================|
Result       | Description                                                    |
=============|================================================================|
token value  | LLVM value of token type associated with a tensor value        |
==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.matmul' intrinsic performs batched matrix multiplication 

between two input tensors. The inner two dimensions of the input tensors must 

have valid matrix multiplication dimensions, and any further outer dimensions 

must be of matching batch size. This intrinsic does not read nor write memory, 

nor does it exhibit any kind of undefined behavior.


** EXAMPLE:
---------------------
%input1 = call token @llvm.tensor.typeinfo(<12 x float> %tensor1, 

                                          <2 x i32> <i32 3, i32 4>, 

                                          <2 x i32> <i32 0, i32 1>, 

                                          <2 x i32> <i32 0, i32 0>)


%input2 = call token @llvm.tensor.typeinfo(<12 x float> %tensor2, 

                                           <2 x i32> <i32 4, i32 3>, 

                                           <2 x i32> <i32 0, i32 1>, 

                                           <2 x i32> <i32 0, i32 0>)

%output = call <9 x float> @llvm.tensor.matmul(token %input1, 

                                                     token %input2)

%typed_output = call token @llvm.tensor.typeinfo(<9 x float> %output, 

                                       <2 x i32> <i32 3, i32 3>, 

                                       <2 x i32> <i32 0, i32 1>, 

                                       <2 x i32> <i32 0, i32 0>)



** INTRINSIC: llvm.tensor.contract
===================================

** OVERVIEW:
---------------------
This intrinsic performs tensor contraction on two multidimensional tensors. 

This intrinsic is marked with the 'readnone' and 'speculatable' attributes 

to prevent it from inhibiting optimizations like redundancy elimination, 

dead code elimination, code motion, etc.

<vector_ty> llvm.tensor.contract(token %input1, token %input2, 

                             <m x i32> %in1_contraction_axes, 

                             <m x i32> %in2_contraction_axes)


** OPERANDS:
----------------------

==============================================================================|
Operand               | Description                                           |
======================|=======================================================|
%input1               | Token value representing the first input tensor       |
----------------------|-------------------------------------------------------|
%input2               | Token value representing the second input tensor      |
----------------------|-------------------------------------------------------|
%in1_contraction_axes | Vector for m axes for the first input tensor along    |

                      | which contraction/reduction is performed              |  
----------------------|-------------------------------------------------------|
%in2_contraction_axes | Vector for m axes for the second input tensor along   |

                      | which contraction/reduction is performed              |

==============================================================================| 

** RESULT:
------------------

==============================================================================|
Result               | Description                                            |
=====================|========================================================|
token value          | LLVM value of token type associated with a tensor value|

==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.contract' intrinsic multiplies and contracts two input tensors 

along given axes, %in1_contraction_axes and %in2_contraction_axes. The axes 

vectors contain a list of dimension indices for the input tensors along which 

the reduction takes place. This intrinsic does not read nor write memory, nor 

does it exhibit any kind of undefined behavior.


** EXAMPLE:
---------------------
%input1 = call token @llvm.tensor.typeinfo(<8 x float> %tensor1, 

                                          <3 x i32> <i32 2, i32 2, i32 2>, 

                                          <3 x i32> <i32 0, i32 1, i32 2>, 

                                          <3 x i32> <i32 0, i32 0, i32 0>)


%input2 = call token @llvm.tensor.typeinfo(<8 x float> %tensor2, 

                                           <3 x i32> <i32 2, i32 2, i32 2>, 

                                           <3 x i32> <i32 0, i32 1, i32 2>, 

                                           <3 x i32> <i32 0, i32 0, i32 0>)

%output = call <8 x float> @llvm.tensor.contract(token %input1, token %input2, 

                                                <2 x i32> <i32 0, i32 2>, 

                                                <2 x i32> <i32 0, i32 1>)

%typed_output = call token @llvm.tensor.typeinfo(<8 x float> %output, 

                                             <3 x i32> <i32 2, i32 2, i32 2>, 

                                             <3 x i32> <i32 0, i32 1, i32 2>, 

                                             <3 x i32> <i32 0, i32 0, i32 0>)



** INTRINSIC: llvm.tensor.umma
================================

** OVERVIEW:
---------------------
This intrinsic performs matrix multiplication between two given input tensors 

and then accumulates the result using the third input tensor. This intrinsic 

is marked with the 'readnone' and 'speculatable' attributes to prevent it 

from inhibiting optimizations like redundancy elimination, dead code 

elimination, code motion, etc.

<vector_ty> llvm.tensor.umma(token %acc, token %op1, token %op2)

** OPERANDS:
----------------------

==============================================================================|

Operand      | Description                                                    |
=============|================================================================|
%acc         | Token value representing a tensor which accumulates results    |
-------------|----------------------------------------------------------------|
%op1         | Token value representing the first input tensor                |
-------------|----------------------------------------------------------------|
%op2         | Token value representing the second input tensor               |
==============================================================================|

** RESULT:
------------------
==============================================================================|

Result       | Description                                                    |
=============|================================================================|
token value  | LLVM value of token type associated with a tensor value        |

==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.umma' intrinsic performs matrix-multiply-accumulation between 

two unsigned operands and accumulates the result with the given register:
%output = %acc + matmul(%op1, %op2)

When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be 

sign-extended.
Note that this intrinsic does not read nor write memory, nor does it exhibit 

any kind of undefined behavior.


** EXAMPLE:
---------------------
%acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, 

                                        <2 x i32> <i32 3, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, 

                                        <2 x i32> <i32 3, i32 4>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, 

                                        <2 x i32> <i32 4, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)

; The first argument is the accumulator virtual tensor register,  and the 

; second and third arguments are the input virtual tensor registers. 
%output = call <9 x i32> @llvm.tensor.umma(token %acc, token %op1, token %op2)



** INTRINSIC: llvm.tensor.smma
===============================

** OVERVIEW:
---------------------
This intrinsic performs matrix multiplication between two given input tensors 

and then accumulates the result using the third input tensor. This intrinsic is 

marked with the 'readnone' and 'speculatable' attributes to prevent it from 

inhibiting optimizations like redundancy elimination, dead code elimination, 

code motion, etc.

<vector_ty> llvm.tensor.smma(token %acc, token %op1, token %op2)


** OPERANDS:
----------------------

==============================================================================|
Operand      | Description                                                    |
=============|================================================================|
%acc         | Token value representing a tensor which accumulates results    |
-------------|----------------------------------------------------------------|
%op1         | Token value representing the first input tensor                |
-------------|----------------------------------------------------------------|
%op2         | Token value representing the second input tensor               |
==============================================================================|

** RESULT:
------------------
==============================================================================|
Result       | Description                                                    |
=============|================================================================|
token value  | LLVM value of token type associated with a tensor value        |
==============================================================================|

** SEMANTICS:
---------------------
The ‘llvm.tensor.smma' intrinsic performs matrix-multiply-accumulation between 

two signed operands and accumulates the result with the given register:
%output = %acc + matmul(%op1, %op2)

When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be 

sign-extended.
Note that this intrinsic does not read nor write memory, nor does it exhibit 

any kind of undefined behavior.

** EXAMPLE:
---------------------
%acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, 

                                        <2 x i32> <i32 3, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, 

                                        <2 x i32> <i32 3, i32 4>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, 

                                        <2 x i32> <i32 4, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)

; The first argument is the accumulator virtual tensor register,  

; and the second and third arguments are the input virtual tensor registers. 
%output = call <9 x i32> @llvm.tensor.smma(token %acc, token %op1, token %op2)



** INTRINSIC: llvm.tensor.usmma
=================================

** OVERVIEW:
---------------------
This intrinsic performs matrix multiplication between two given input tensors 

and then accumulates the result using the third input tensor. This intrinsic 

is marked with the 'readnone' and 'speculatable' attributes to prevent it from 

inhibiting optimizations like redundancy elimination, dead code elimination, 

code motion, etc.

<vector_ty> llvm.tensor.usmma(token %acc, token %op1, token %op2)


** OPERANDS:
----------------------
==============================================================================|
Operand      | Description                                                    |
=============|================================================================|
%acc         | Token value representing a tensor which accumulates results    |
-------------|----------------------------------------------------------------|
%op1         | Token value representing the first input tensor                |
-------------|----------------------------------------------------------------|
%op2         | Token value representing the second input tensor               |
==============================================================================|

** RESULT:
------------------

==============================================================================|
Result       | Description                                                    |
=============|================================================================|
token value  | LLVM value of token type associated with a tensor value        |
==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.usmma' intrinsic performs matrix-multiply-accumulation between 

unsigned and signed operands and accumulates the result with the given register:
%output = %acc + matmul(%op1, %op2)

When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be 

sign-extended.
Note that this intrinsic does not read nor write memory, nor does it exhibit any 

kind of undefined behavior.

** EXAMPLE:
---------------------
%acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, 

                                        <2 x i32> <i32 3, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, 

                                        <2 x i32> <i32 3, i32 4>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, 

                                        <2 x i32> <i32 4, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)

; The first argument is the accumulator virtual tensor register,  

; and the second and third arguments are the input virtual tensor 

; registers. 
%output = call <9 x i32> @llvm.tensor.usmma(token %acc, 

                                         token %op1, token %op2)



** INTRINSIC: llvm.tensor.summa
=================================

** OVERVIEW:
---------------------
This intrinsic performs matrix multiplication between two given 

input tensors and then accumulates the result using the third input 

tensor. This intrinsic is marked with the 'readnone' and 'speculatable' 

attributes to prevent it from inhibiting optimizations like redundancy 

elimination, dead code elimination, code motion, etc.

<vector_ty> llvm.tensor.summa(token %acc, token %op1, token %op2)


** OPERANDS:
----------------------

==============================================================================|
Operand      | Description                                                    |
=============|================================================================|
%acc         | Token value representing a tensor which accumulates results    |
-------------|----------------------------------------------------------------|
%op1         | Token value representing the first input tensor                |
-------------|----------------------------------------------------------------|
%op2         | Token value representing the second input tensor               |
==============================================================================|

** RESULT:
------------------

==============================================================================|
Result       | Description                                                    |
=============|================================================================|
token value  | LLVM value of token type associated with a tensor value        |
==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.summa' intrinsic performs matrix-multiply-accumulation between 

signed and unsigned operands and accumulates the result with the given 

register:


%output = %acc + matmul(%op1, %op2)

When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be 

sign-extended.
Note that this intrinsic does not read nor write memory, nor does it exhibit any 

kind of undefined behavior.


** EXAMPLE:
---------------------
%acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, 

                                        <2 x i32> <i32 3, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, 

                                        <2 x i32> <i32 3, i32 4>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)


%op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, 

                                        <2 x i32> <i32 4, i32 3>, 

                                        <2 x i32> <i32 0, i32 1>, 

                                        <2 x i32> <i32 0, i32 0>)

; The first argument is the accumulator virtual tensor register,  and the 

; second and third arguments are the input virtual tensor registers. 
%output = call <9 x i32> @llvm.tensor.summa(token %acc, token %op1, 

                                                        token %op2)



** INTRINSIC: llvm.tensor.convolution
======================================

** OVERVIEW:
---------------------
This intrinsic performs convolution between input and kernel tensors. This 

intrinsic is marked with the 'readnone' and 'speculatable' attributes to 

prevent it from inhibiting optimizations like redundancy elimination, dead 

code elimination, code motion, etc.

<vector_ty> llvm.tensor.convolution(token %input, token %kernel, 

                                   <vector_ty> %strides, 

                                   <vector_ty> %input_dilations, 

                                   <vector_ty> %kernel_dilations)


** OPERANDS:
----------------------
==============================================================================|
Operand           | Description                                               |
==================|===========================================================|
%input            | Token value representing the input tensor                 |
------------------|-----------------------------------------------------------|
%kernel           | Token value representing the kernel tensor                |
------------------|-----------------------------------------------------------|
%strides          | Vector containing strides values for the sliding kernel   |
------------------|-----------------------------------------------------------|
%input_dilations  | Vector containing dilation values for the input           |
------------------|-----------------------------------------------------------|
%kernel_dilations | Vector containing dilation values for the kernel          |
==============================================================================|

** RESULT:
------------------

 

==============================================================================|
Result            | Description                                               |
==================|===========================================================|
token value       | LLVM value of token type associated with a tensor value   |
==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.convolution' intrinsic performs the convolution between input 

and kernel tensors.
This intrinsic does not read nor write memory, nor does it exhibit any kind of 

undefined behavior.


** EXAMPLE:
---------------------
%input = call token @llvm.tensor.typeinfo(<12 x float> %tensor1, 

                                          <2 x i32> <i32 3, i32 4>, 

                                          <2 x i32> <i32 0, i32 1>, 

                                          <2 x i32> <i32 0, i32 0>)


%kernel = call token @llvm.tensor.typeinfo(<12 x float> %tensor2, 

                                           <2 x i32> <i32 4, i32 3>, 

                                           <2 x i32> <i32 0, i32 1>, 

                                           <2 x i32> <i32 0, i32 0>)

; The third argument is a vector of kernel stride values along every dimension 

; and the fourth argument is a vector of dilation values along every dimension.
%output = call <9 x float> @llvm.tensor.convolution(token %input, token %kernel, 

                                                    <2 x i32> <i32 1, i32 2>, 

                                                    <2 x i32> <i32 0, i32 0>, 

                                                    <2 x i32> <i32 0, i32 0>)



** INTRINSIC: llvm.tensor.transpose
====================================

** OVERVIEW:
---------------------
This intrinsic changes the layout of a given tensor by permuting the indices of 

its dimensions. This intrinsic is marked with the 'readnone' and 'speculatable' 

attributes to prevent it from inhibiting optimizations like redundancy 

elimination, dead code elimination, code motion, etc.

<vector_ty> llvm.tensor.transpose(token %input, <n x i32> %new_layout)


** OPERANDS:
----------------------

==============================================================================|
Operand      | Description                                                    |
=============|================================================================|
%input       | Token value representing the input tensor                      |
-------------|----------------------------------------------------------------|
%new_layout  | This is the new permutation of tensor layout                   |
==============================================================================|

** RESULT:
------------------


==============================================================================|
Result       | Description                                                    |
=============|================================================================|
token value  | LLVM value of token type associated with a tensor value        |
==============================================================================|

** SEMANTICS:
-----------------------
The ‘llvm.tensor.transpose' intrinsic operates on the given tensor and produces 

an output tensor with the given layout. This operation changes the physical layout of

the input tensor and leads to changes to the shape and padding. Note that the operation

does not lead to any change in the number of dimensions.

Note that this intrinsic does not read nor write memory, nor does it exhibit 

any kind of undefined behavior.


** EXAMPLE:
---------------------

%input = call token @llvm.tensor.typeinfo(<240 x float> %tensor, 

  <3 x i32> <i32 16, i32 5, i32 3>, <3 x i32> <i32 0, i32 1, i32 2>, 

  <3 x i32> <i32 3, i32 2, i32 1>)

; The first argument is the input virtual tensor register and the second 

argument is the new permutation of the layout of the input tensor. This 

operation produces a tensor of layout <2, 0, 1>.  


%output = call <240 x float> @llvm.tensor.transpose(token %input, 

                                          <3 x i32> <i32 2, i32 0, i32 1>)


%typed_output = call token @llvm.tensor.typeinfo(<240 x float> %output, 

  <3 x i32> <i32 3, i32 16, i32 5>, <3 x i32> <i32 2, i32 0, i32 1>, 

  <3 x i32> <i32 1, i32 3, i32 2>)



*** DESIGN OF TENSOR EXTENSIONS IN LLVM
=========================================================================================
Tensor extensions we have added to LLVM are described in the 

document here 

(https://docs.google.com/document/d/1A3xbrtouckRsPz94v2XttjoaTSqQlz1pSzVe80-Jmro/edit?usp=sharing).




===============================================================================|
LLVM Tensor Intrinsics     |  Frontend Equivalent  |  Target Equivalent        |
===========================|=======================|===========================|
llvm.tensor.matmul         | XLA dot op            |                           |
---------------------------|-----------------------|---------------------------|
llvm.tensor.contract       | XLA dot general op    |                           | 
---------------------------|-----------------------|---------------------------|
llvm.tensor.umma           |                       | Intel AMX mma instruction,|

                           |                       | Power MMA instruction     |
---------------------------|-----------------------|---------------------------|
llvm.tensor.smma           |                       | Intel AMX mma instruction,|

                           |                       | Power MMA instruction     |
---------------------------|-----------------------|---------------------------|
llvm.tensor.usmma          |                       | Intel AMX mma instruction,|

                           |                       | Power MMA instruction     |
---------------------------|-----------------------|---------------------------|
llvm.tensor.summa          |                       | Intel AMX mma instruction,|

                           |                       | Power MMA instruction     |
---------------------------|-----------------------|---------------------------|
llvm.tensor.convolution    | XLA convolution op    | NVDLA convolution         |

                           |                       | instruction               |
---------------------------|-----------------------|---------------------------|
llvm.tensor.tanh           | XLA element-wise op   | NVDLA element-wise        |

                           |                       | instruction               |
---------------------------|-----------------------|---------------------------|
llvm.tensor.sigmoid        |                       | NVDLA element-wise        |

                           |                       |                instruction|
---------------------------|-----------------------|---------------------------|
llvm.tensor.relu           |                       | NVDLA element-wise        |

                           |                       | instruction               |
---------------------------|-----------------------|---------------------------|
llvm.tensor.broadcast      | XLA broadcast op      | Intel AMX fill instruction|
---------------------------|---------------------------------------------------|
llvm.tensor.load           |                       | Intel AMX load instruction|
---------------------------|-----------------------|---------------------------|
llvm.tensor.store          |                       | Intel AMX store           |

                           |                       | instruction|              |
---------------------------|-----------------------|---------------------------|
llvm.tensor.reduce.max     | XLA reduce window op  | NVDLA pooling instruction |
---------------------------|-----------------------|---------------------------|
llvm.tensor.reduce.min     | XLA reduce window op  | NVDLA pooling instruction |  
---------------------------|-----------------------|---------------------------|
llvm.tensor.reduce.add     | XLA reduce window op  |                           |
---------------------------|-----------------------|---------------------------|
llvm.tensor.reduce.mul     | XLA reduce window op  |                           |
---------------------------|-----------------------|---------------------------|
llvm.tensor.reduce.and     | XLA reduce window op  |                           |
---------------------------|-----------------------|---------------------------|
llvm.tensor.reduce.or      | XLA reduce window op  |                           | 
---------------------------|-----------------------|---------------------------|
llvm.tensor.reduce.xor     | XLA reduce window op  |                           |
---------------------------|-----------------------|---------------------------|
lvm.tensor.reshape.block   | OneDNN Layouts        |                           |
---------------------------|-----------------------|---------------------------|
llvm.tensor.reshape.permute| Tensorflow reshape op |                           | 
---------------------------|-----------------------|---------------------------|
lllvm.tensor.transpose     | Tensorflow transpose  | NVDLA reshape instruction |      

                           | op                    |                           |
---------------------------|-----------------------|---------------------------|
llvm.tensor.pad            | XLA pad op            |                           |
---------------------------|-----------------------|---------------------------|
llvm.tensor.concat         | XLA concat op         | NVDLA reshape instruction |
---------------------------|-----------------------|---------------------------|
llvm.tensor.tovector       |                       | Power unprime instruction |
---------------------------|-----------------------|---------------------------|
llvm.vector.totensor       |                       | Power prime instruction   |

===============================================================================|



*** LOWERING STRATEGY
================================================================================================

The lowering strategy is divided into three stages:
* Lower the N-dimensional tensor operations to 2-dimensional tensor operations.
* Lower 2-dimensional tensor operations to target-agnostic 1-D and 2-D tensor 

  intrinsics.
* Legalize the target-agnostic 1-D and 2-D tensor intrinsics to target-specific 

  intrnsics.


*** LOWERING EXAMPLE
------------------------------------------------------------

As an example, we want to lower the following code with matmul intrinsic in 

LLVM IR:

%tensor1_token = call token @llvm.tensor.typeinfo(<640000 x i32>* byval(<640000 x i32>) %tensor1, 

                                                  <4 x i32> <i32 1, i32 10, i32 800, i32 800>, 

                                                  <4 x i32> <i32 0, i32 1, i32 2, i32 3>, 

                                                  <4 x i32> <i32 0, i32 0, i32 0, i32 0>)


%tensor2_token = call token @llvm.tensor.typeinfo(<640000 x i32>* byval(<640000 x i32>) %tensor2, 

                                                  <4 x i32> <i32 1, i32 10, i32 800, i32 800>, 

                                                  <4 x i32> <i32 0, i32 1, i32 2, i32 3>, 

                                                  <4 x i32> <i32 0, i32 0, i32 0, i32 0>)

%tensor3 = call <64000 x i32> @llvm.tensor.matmul(token %tensor1_token, token %tensor2_token)


%tensor3_token = call token @llvm.tensor.typeinfo(<640000 x i32>* byval(<640000 x i32>) %tensor3, 

                                                  <4 x i32> <i32 1, i32 10, i32 800, i32 800>, 

                                                  <4 x i32> <i32 0, i32 1, i32 2, i32 3>, 

                                                  <4 x i32> <i32 0, i32 0, i32 0, i32 0>)


The code gets lowered to the following code in the first lowering stage:

%malloc_ptr1 = call i8* @malloc(…)

%malloc_ptr2 = call i8* @malloc(…)

%malloc_ptr3 = call i8* @malloc(…)

%tensor_ptr1 = bitcast i8* 
%malloc_ptr1 to i32*
%tensor_ptr2 = bitcast i8* 
%malloc_ptr2 to i32*
%tensor_ptr3 = bitcast i8* 
%malloc_ptr3 to i32*


for (unsigned I = 0; I < 1; I++)

    for (unsigned J = 0; J < 10; J++) 

     // Compute the indices into the %tensor1 and %tensor2 and load the feature 

     // maps
           ….

          %matrix_ptr1 = getelementptr i32* %tensor_ptr1, ….

          %matrix_ptr2 = getelementptr i32* %tensor_ptr2, ….

          %matrix_ptr3 = getelementptr i32* %tensor_ptr3, ….

          %matrix1_token = call token @llvm.tensor.load(i32* %matrix_ptr1, 

                                                      <2 x i32> <i32 800, i32 800>,  

                                                      ….)
          %matrix2_token = call token @llvm.tensor.load(i32* %matrix_ptr2, 

                                                      <2 x i32> <i32 800, i32 800>,  

                                                      ….)

          %matrix3 = call <640000 x i32> @llvm.tensor.matmul(token %matrix1_token, 

                                                      token %matrix2_token)
          %matrix3_token = call token @llvm.tensor.typeinfo(<640000 x i32> %matrix3, 

                                                      <2 x i32> <32 800, i32 800>, 

                                                      <2 x i32> <i32 0, i32 1>, 

                                                      <2 x i32> <i32 0, i32 0>)

           call void @llvm.tensor.store(i32* %tensor_ptr3, token %matrix3_token, …)

           ….




After the second lowering stage, the 2-dimensional matmul intrinsic gets lowered 

to 2-dimensional target-agnostic intrinsics:

%malloc_ptr1 = call i8* @malloc(…)

%malloc_ptr2 = call i8* @malloc(…)

%malloc_ptr3 = call i8* @malloc(…)

%tensor_ptr1 = bitcast i8* 
%malloc_ptr1 to i32*
%tensor_ptr2 = bitcast i8* 
%malloc_ptr2 to i32*
%tensor_ptr3 = bitcast i8* 
%malloc_ptr3 to i32*


for (unsigned I = 0; I < 1; I++)

    for (unsigned J = 0; J < 10; J++) 

         for (unsigned M = 0; M < 800; M+=16)

                for (unsigned N = 0; N < 800; N+=16) 

                    for (unsigned K = 0; K < 800; K+=16) 

                      // Compute the indices into the %tensor1 and %tensor2 

                      // and load the tiles
                      ….

                      %tile1_token = call token @llvm.tensor.load(i32* %tile_ptr1, 

                                                            <2 x i32> <i32 16, i32 16>,  

                                                            ….)
                      %tile2_token = call token @llvm.tensor.load(i32* %tile_ptr2, 

                                                             <2 x i32> <i32 16, i32 16>,  

                                                            ….)
                      %tile2_token = call token @llvm.tensor.load(i32* %acc_ptr,   

                                                             <2 x i32> <i32 16, i32 16>,  

                                                            ….)

                      %acc = call <256 x i32> @llvm.tensor.smma(token %acc_token, 

                                                             token %tile1_token, 

                                                             token %tile2_token)
                      %new_acc_token = call token @llvm.tensor.typeinfo(<256 x i32> %acc, 

                                                             <2 x i32> <32 16, i32 16>, 

                                                             ...)
                      .…
   }
        call void @llvm.tensor.store(i32* %acc_ptr, token %new_acc_token, …)

        ….



Last stage of lowering is the legalization stage. In this example, we lower down 

to Intel AMX intrinsics:

%malloc_ptr1 = call i8* @malloc(…)

%malloc_ptr2 = call i8* @malloc(…)

%malloc_ptr3 = call i8* @malloc(…)

%tensor_ptr1 = bitcast i8* 
%malloc_ptr1 to i32*
%tensor_ptr2 = bitcast i8* 
%malloc_ptr2 to i32*
%tensor_ptr3 = bitcast i8* 
%malloc_ptr3 to i32*


for (unsigned I = 0; I < 1; I++)

    for (unsigned J = 0; J < 10; J++) 

         for (unsigned M = 0; M < 800; M+=16)

                for (unsigned N = 0; N < 800; N+=16) 

                    for (unsigned K = 0; K < 800; K+=16) 

                      // Compute the indices into the %tensor1 and %tensor2 and 

                      // load the tiles
                      ….

                      %cast_tile_ptr1 = bitcast i32* %tile_ptr1 to i8*
                      %cast_tile_ptr2 = bitcast i32* %tile_ptr2 to i8*
                      %cast_acc_ptr = bitcast i32* %acc_ptr to i8*
                      %tile1_amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 16,

                                                i16 64, i8* %cast_tile_ptr1, i64 3200)
                      %tile2_amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, 

                                                i16 64, i8* %cast_tile_ptr2, i64 3200)
                      %acc_amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, 

                                                i16 64, i8* %cast_acc_ptr, i64 3200)
                      %mma_amx = call x86_amx @llvm.x86.tdpbssd.internal(i16 16, 

                                                i16 64, i16 64, x86_amx %acc_amx, 

                                                x86_amx %tile2_amx , x86_amx %tile1_amx)​​
       .....
 }
       call void @llvm.x86.tilestored64.internal(i16 16, i16 64, i8* %cast_acc_ptr, 

                                                 i64 3200, x86_amx %mma_amx)
       ….



*** COMPATIBILITY WITH AND BENEFITS OVER MATRIX EXTENSIONS
==========================================================================================
The existing matrix extensions model vectors as matrices in LLVM can co-exist and can 

be used with the tensor extensions that we propose. We argue that our tensor extensions 

provide an extensible and flexible long-term solution that LLVM developers can 

experiment with and adopt overtime. We believe that our tensor extensions provide the 

following benefits over the existing matrix extensions:


* Our tensor extensions support an arbitrary number of dimensions for tensors. This 

  affords LLVM developers the flexibility to use higher-dimensional tensors as 

  opposed to confining to rigidly supporting two dimensional tensors only. This 

  support for generality also makes the tensor extensions more easy to maintain in 

  the future.
* Currently, information about matrix shapes and layouts is encoded within the 

  matrix intrinsics in LLVM. They do not provide a separation between the matrix 

  properties and matrix operations. This makes the existing matrix extensions 

  rigid and difficult to extend in the future because if developers decide to 

  encode more matrix properties in the IR, they would have to modify all matrix 

  intrinsics and modify several lines of code using these matrix extensions. 

  Our tensor extensions provide a separation between the tensor concept and tensor 

  operations, thereby providing the flexibility of extending the tensor properties 

  represented in the IR without having to modify all the tensor operations that 

  operate on tensors. Note that this flexibility also allows supporting new kinds 

  of tensors (such as sparse and ragged tensors) more easily in the future as well.
* Matrix padding is modelled using vector shuffle instructions. This requires 

  optimizations, analyses and transformations to infer padding information by 

  carefully inspecting all the shuffle instructions and their masks. We encode 

  tensor padding information as a set of tensor properties directly represented 

  and readily available in the IR and we use an intrinsic to represent a padding 

  operation.


*** CURRENT STATUS OF THE IMPLEMENTATION
======================================================================================
* Lowering of most high-level tensor operations to LLVM scalar and vector 

  instructions is supported.
* Tensor code generation framework is capable of targeting Intel AMX. 

  Support for targeting NVDLA and NVIDIA tensor cores is in progress.
* Lowering support to target Intel VNNI and Hexagon Vector Extension (HVX) is 

  underway.
* Example of lowering from Julia to the proposed tensor extensions is in the 

  design document 

  (https://docs.google.com/document/d/1A3xbrtouckRsPz94v2XttjoaTSqQlz1pSzVe80-Jmro/edit#heading=h.17j13gwxto8i) .


*** CURRENT TESTING SUPPORT
========================================================================================

Currently, the tests are written in C/C++. The tensor operations are written using 

“dummy” functions such as tensor_typeinfo, tensor_matmul and so on as shown in the 

following example:

typedef int _tensor_t  __attribute__((__vector_size__(25600000)));
typedef int _shape_t   __attribute__((__vector_size__(16), __aligned__(4)));

typedef int _layout_t  __attribute__((__vector_size__(16), __aligned__(4)));

typedef int _padding_t __attribute__((__vector_size__(16), __aligned__(4)));

typedef int _token_t;


void example(_tensor_t tensor1, _tensor_t tensor2) {

   _shape_t shape = {1, 10, 800, 800};
   _layout_t layout =  {0, 1, 2, 3};

   _padding_t padding =  {0, 0, 0, 1};

   
  /* Define type information for the input tensors */
  _token_t tensor1_token = tensor_typeinfo(tensor1, shape, layout, padding);

  _token_t tensor2_token = tensor_typeinfo(tensor2, shape, layout, padding);


  /* Perform Matmul */
  _tensor_t tensor3 = tensor_matmul(tensor1_token, tensor2_token); 

 /* Define type information for the output tensor */
 _token_t tensor3_token = tensor_typeinfo(tensor3, shape, layout, padding);

}



The above code gets translated into tensor intrinsics in LLVM IR:

define void @example(<6400000 x i32>* byval(<6400000 x i32>)  %tensor1, 

                     <6400000 x i32>* byval(<6400000 x i32>)  %tensor2) {
   %tensor1_token = call token @llvm.tensor.typeinfo(

                    <6400000 x i32>* byval(<6400000 x i32>) %tensor1, 

                    <4 x i32> <i32 1, i32 10, i32 800, i32 800>, 

                    <4 x i32> <i32 0, i32 1, i32 2, i32 3>, 

                    <4 x i32> <i32 0, i32 0, i32 0, i32 0>)
   %tensor2_token = call token @llvm.tensor.typeinfo(

                    <6400000 x i32>* byval(<6400000 x i32>) %tensor2, 

                    <4 x i32> <i32 1, i32 10, i32 800, i32 800>, 

                    <4 x i32> <i32 0, i32 1, i32 2, i32 3>, 

                    <4 x i32> <i32 0, i32 0, i32 0, i32 0>)

   %tensor3 = call <640000 x i32> @llvm.tensor.matmul(

                                      token %tensor1_token, 

                                      token %tensor2_token)

   %tensor3_token = call token @llvm.tensor.typeinfo(

                                      <6400000 x i32>* byval(<6400000 x i32>) %tensor3, 

                                      <4 x i32> <i32 1, i32 10, i32 800, i32 800>,  

                                      <4 x i32> <i32 0, i32 1, i32 2, i32 3>, 

                                      <4 x i32> <i32 0, i32 0, i32 0, i32 0>)

   ret 
}


Link to the Google doc for this proposal:

https://docs.google.com/document/d/1r3bHerTFqloldHH-OcMNF2kuCaYOB5JEeGtb7ftLmNg/edit?usp=sharing
 
 

 

 

 

 

 

 

 

 

 

Kothari, Akash via llvm-dev

unread,
Nov 15, 2021, 1:18:33 PM11/15/21
to llvm...@lists.llvm.org
For those who may have been having trouble viewing the RFC in plain text format, we have our proposal in a Google doc: https://docs.google.com/document/d/1IW6VIJ4lMYbGRTOle7S5QXP7Sb5UlucZ3gf-L-4Ccfs/edit?usp=sharing. It would be great if y’all could comment in the google doc or respond via email.

Thanks,
Akash Kothari

Florian Hahn via llvm-dev

unread,
Nov 23, 2021, 12:32:25 PM11/23/21
to Kothari, Akash, llvm...@lists.llvm.org
Hi,

Thanks for sharing the proposal! I think the matrix extension has shown that it is feasible to use a ‘flat vector’ encoding to support more complex operations. Decoupling the shape information from the ‘operational’ intrinsics seems very neat!


Below some additional initial questions.

* The proposal itself is very big, both in terms of text as well as in the code that will be required to implement it. Have you thought about how to bring up support in a way that allows using a (smaller) subset of intrinsics end-to-end?

* What will the hardware specific lowering look like? I think you mentioned you are planning to support a set of different hardware architectures. Will this require a separate lowering pass for each of those?

* What’s the motivation for some intrinsics returning a vector and others returning a token type? Could all intrinsics return vector? This would be more consistent and the type info is associated to the value itself in any case.

* Will variable shapes/sizes be supported? IIRC you mentioned that the type intrinsic can take arbitrary values as arguments. But some intrinsics return vectors, so they would need a fixed size?

* You mentioned Julia and Halide as potential adopters. Do you know if there’s concrete interest to switch to using the new intrinsics by the maintainers ? What would the anticipated timeframe be? I think this could be a crucial argument for having this in LLVM directly, if we have people who are going to use this ASAP.

* What will Clang support for arbitrary tensors look like? If Clang won’t support arbitrary tensors, why not?

* AFAICT this should completely subsume the matrix extension and if we decide to add the more general extension the matrix extension should be removed. How will the transition from the current matrix intrinsics to the new tensor intrinsics work? Can existing IR be auto-upgraded?

Cheers,
Florian
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Kothari, Akash via llvm-dev

unread,
Nov 23, 2021, 1:57:38 PM11/23/21
to Florian Hahn, Mendis, Charith, Adve, Vikram Sadanand, llvm...@lists.llvm.org, Khaldi, Dounia, Girkar, Milind, Sengupta, Sudipta
(+ Vikram, Charith, Dounia, Rafae, Milind and Sudipta)

Hello Florian,

Thanks for your feedback and questions. Please see my inlined comments.

On Nov 23, 2021, at 11:32 AM, Florian Hahn <floria...@apple.com> wrote:

Hi,

Thanks for sharing the proposal! I think the matrix extension has shown that it is feasible to use a ‘flat vector’ encoding to support more complex operations. Decoupling the shape information from the ‘operational’ intrinsics seems very neat!


Below some additional initial questions.

* The proposal itself is very big, both in terms of text as well as in the code that will be required to implement it. Have you thought about how to bring up support in a way that allows using a (smaller) subset of intrinsics end-to-end?
  
    We intend to submit code review requests for patches that add the proposed intrinsics such as typeinfo, different MMA intrinsics, matmul, transpose, padding, etc. to LLVM and support for legalizing them to target-specific intrinsics for Intel AMX, NVIDIA Tensor Cores, and other targets — lowering for different targets could be separate code review requests. We can also implement and submit a code review request for pass that converts the current matrix extensions to our intrinsics, so that people using the existing matrix extensions can try out our intrinsics and lowering support to different targets. In the subsequent code review requests, we will submit support for N-D to 2-D lowering as that implementation becomes more mature on our end. 


* What will the hardware specific lowering look like? I think you mentioned you are planning to support a set of different hardware architectures. Will this require a separate lowering pass for each of those?

    Yes, the lowering for different hardware target entails lowering from target-agnostic intrinsics to target-specific ones in LLVM IR. So each target would require a separate pass. 


* What’s the motivation for some intrinsics returning a vector and others returning a token type? Could all intrinsics return vector? This would be more consistent and the type info is associated to the value itself in any case.

     Actually, all intrinsics return vectors except typeinfo and tensor load intrinsics. Tensor load intrinsic returns a token type because it explicitly returns a tensor and the tensor info is known at the time of performing the tensor load. LLVM already has an instruction for loading vectors.


* Will variable shapes/sizes be supported? IIRC you mentioned that the type intrinsic can take arbitrary values as arguments. But some intrinsics return vectors, so they would need a fixed size?

    Variable shape/sizes can be represented with our intrinsics. In this situation, our intrinsics could return values of vector type with vscale. We do not use this vector information when lowering. We use the shape information available in typeinfo when lowering, so use of vscale in the IR should suffice in indicating that the size of tensors is not known at compile-time.


* You mentioned Julia and Halide as potential adopters. Do you know if there’s concrete interest to switch to using the new intrinsics by the maintainers ? What would the anticipated timeframe be? I think this could be a crucial argument for having this in LLVM directly, if we have people who are going to use this ASAP.


     We intend to get in touch with people working on Julia soon. We have not spoken with people working on Halide specifically but the lead of a compiler team at Qualcomm, Anshu Dasgupta, has been involved in our project. Currently, Halide generates target-specific intrinsics in LLVM IR to be able to target different architectures and there is some interest in using the target-agnostic intrinsics. However, we have not discussed any timeline with them yet.

* What will Clang support for arbitrary tensors look like? If Clang won’t support arbitrary tensors, why not?

    Clang would just need to generate our intrinsics with appropriate tensor shape, layout information, etc. that is available in the application written in a frontend language supported by LLVM. I would refer you to an example we have in our documentation about lowering from Julia to our tensor intrinsics here


* AFAICT this should completely subsume the matrix extension and if we decide to add the more general extension the matrix extension should be removed. How will the transition from the current matrix intrinsics to the new tensor intrinsics work? Can existing IR be auto-upgraded?


      We are thinking about implementing a pass that can convert LLVM IR with matrix extensions to IR with our proposed intrinsics and vice versa. If people would be interested in this, we could contribute these passes to LLVM.

Cheers,
Florian

Chris Lattner via llvm-dev

unread,
Nov 27, 2021, 8:57:35 PM11/27/21
to Kothari, Akash, llvm...@lists.llvm.org
Thank you for the interesting proposal Akash (et al).  I have a few other questions:

Florian pointed out that this is a very large proposal which is introducing a bunch of new concepts which makes it difficult to review.  My major concern with it is that it is proposing a single tensor model for LLVM, something that is inappropriate for a wide variety of frameworks, and doesn’t appear to be very general.  For example, it isn’t clear how to model the strided tensor model of pytorch, doesn’t appear to support dynamic shapes, sparse tensors, and it isn’t clear (in a quick reading) what the op-extensibiliy story is.  Further, there are a bunch of design decisions inherent to this approach (e.g. putting the layout information on the ops, instead of in the types) that make certain optimizations (e.g. layout transformations) more complicated.

This isn’t to say that this is the _wrong_ design, merely that it is only one of many plausible and important designs.  Standardizing "one thing" in LLVM can have a chilling effect on innovation (particularly for such a rapidly evolving field) which is one of the reasons that MLIR favors an “open extensibility” approach.

In terms of detailed design, it isn’t clear to me that representing heap allocated things like this as a token type will work out well.  There have been a variety of proposals over the years (incl adding F90 style arrays as a first class entity) that haven’t worked well because of a wide variety of design assumptions in LLVM).  The token type in particular is not composable with control flow, functional calls and other things, and ML models frequently have loops and other controls flow in them - how do you plan to represent that?

In your convolution operation, it doesn’t look like you’re handling the various edge conditions (replicating, mirroring, etc) common in ML frameworks. How do you plan to handle that?  Similarly, how do you handle quantization?

As per the motivation section, you point out "Crucially, however, MLIR does not have a low-level code generation framework that is retargetable to diverse hardware: it relies on LLVM for this purpose.”  I happen to agree with you, but the lack of this in MLIR isn’t evidence that LLVM IR is the natural place to put matrix lowering support.  Why do you think LLVM IR is a better place to put this than a high level IR?  Whether it is MLIR, XLA, or something else, it seems that there is a very clear separation of concerns here, and (as you point out) LLVM is being successfully used as the backend for a wide variety of tensor compilers already.

Finally, I’m also a bit concerned because the IR extensions are not really the meat of this proposal - this is effectively proposing something akin to the entire LLVM “CodeGen” framework but for tensors.  The IR abstractions and framework need to be co-designed together, and it isn’t clear how general or powerful the framework will turn out to be.  We’ve seen a *LOT* of ML compiler frameworks (incl notably Glow, XLA, TVM, etc) that are successful handling important subsets of the ML inference space, but very few have scaled up to solving the full generality of the problem.

-Chris


Kothari, Akash via llvm-dev

unread,
Dec 3, 2021, 6:59:26 PM12/3/21
to clat...@nondot.org, llvm...@lists.llvm.org, Mendis, Charith, Adve, Vikram Sadanand, Sharif, Hashim, llvm-tens...@lists.cs.illinois.edu, Khaldi, Dounia, Luo, Yuanke, Girkar, Milind, Noor, Abdul Rafae, Sengupta, Sudipta
Hi Chris,

Thank you for your questions and comments. I have reordered your questions/comments to respond to them in a logical progression. In particular, you have some questions about the technical design of TLX (extensibility, memory allocation, etc.) and some about *whether a tensor / matrix code-gen framework belongs in LLVM in the first place*. I thought I should address the latter first. 

-Akash

On Nov 27, 2021, at 7:57 PM, Chris Lattner <clat...@nondot.org> wrote:

Thank you for the interesting proposal Akash (et al).  I have a few other questions:


As per the motivation section, you point out "Crucially, however, MLIR does not have a low-level code generation framework that is retargetable to diverse hardware: it relies on LLVM for this purpose.”  I happen to agree with you, but the lack of this in MLIR isn’t evidence that LLVM IR is the natural place to put matrix lowering support.  Why do you think LLVM IR is a better place to put this than a high level IR?  Whether it is MLIR, XLA, or something else, it seems that there is a very clear separation of concerns here, and (as you point out) LLVM is being successfully used as the backend for a wide variety of tensor compilers already.

I think LLVM is the natural place for put support for tensor lowering for the following reasons:

  • Compilers such as TVM, Halide, XLA,Glow, etc. use LLVM for backend code generation for different hardware architectures, so it makes sense to add tensor lowering support in an abstraction layer shared across multiple compilers. Today, compilers such as TVM and Halide, for instance, have separate backends to generate target-specific intrinsics for different targets, which is a serious weakness. These compilers can target a common set of target-agnostic intrinsics instead to target multiple tensor architectures and benefit from community-wide shared improvements and efforts. 
  • Languages such as C/C++, Rust, DPC++, Julia, etc. do  not have frontends for compilers like MLIR, XLA, TVM, etc.yet. Developing frontends for these languages for production use requires non-trivial engineering effort. Extending LLVM with our extensions and getting the existing languages frontends to target our extensions would require relatively less engineering effort and time. 
  • TLX could be added to the LLVM dialect in MLIR. Also, lessons learned from the experience of supporting retargetable code generation in LLVM for modern tensor architectures could be very valuable and could help inspire ideas for new dialects and abstractions to make MLIR retargetable, too.


Finally, I’m also a bit concerned because the IR extensions are not really the meat of this proposal - this is effectively proposing something akin to the entire LLVM “CodeGen” framework but for tensors.  The IR abstractions and framework need to be co-designed together, and it isn’t clear how general or powerful the framework will turn out to be.  We’ve seen a *LOT* of ML compiler frameworks (incl notably Glow, XLA, TVM, etc) that are successful handling important subsets of the ML inference space, but very few have scaled up to solving the full generality of the problem.

I agree with you that IR extensions and the code generation framework must be co-designed together. We have done exactly just that on our end in collaboration with folks from Intel, Qualcomm, IBM and AWS. There are three main parts to this end-to-end tensor support in LLVM: (1) tensor IR (TLX), (2) code generation that includes lowering from N-d to 2-d and the legalization support for target-agnostic to target-specific intrinsics, (3) extensions to Target Transform Info (TTI) for efficient matrix (2d) code generation In the future, we will post two more RFCs about TTI enhancements and lowering strategies. Our core proposal is here: https://docs.google.com/document/d/1IW6VIJ4lMYbGRTOle7S5QXP7Sb5UlucZ3gf-L-4Ccfs/edit?usp=sharing. People who may be interested in learning more details about the IR extensions and the other parts of this support can take a look at the specification document here: https://docs.google.com/document/d/1A3xbrtouckRsPz94v2XttjoaTSqQlz1pSzVe80-Jmro/edit?usp=sharing.


Florian pointed out that this is a very large proposal which is introducing a bunch of new concepts which makes it difficult to review.  My major concern with it is that it is proposing a single tensor model for LLVM, something that is inappropriate for a wide variety of frameworks, and doesn’t appear to be very general.  For example, it isn’t clear how to model the strided tensor model of pytorch, doesn’t appear to support dynamic shapes, sparse tensors, and it isn’t clear (in a quick reading) what the op-extensibiliy story is.  Further, there are a bunch of design decisions inherent to this approach (e.g. putting the layout information on the ops, instead of in the types) that make certain optimizations (e.g. layout transformations) more complicated.


I think there is some misunderstanding about some of the extensions we are proposing. We do not propose that the tensor operations have layout information on them; in fact, we propose an intrinsic called llvm.tensor.typeinfo should have the layout information (and other tensor type information) and help decouple tensor type information from the tensor operations. The tensor load intrinsic also has layout information embedded. Current matrix extensions have type information for matrices embedded in intrinsics for matrix operations such as transpose, matrix multiply, etc. We also support strided tensor loads and stores — these are akin to strided tensors in Pytorch.

Decoupling tensor type information and intrinsics for tensor operations, allows us to extend the tensor type information, if needed, without having to make any changes to other intrinsics for tensor operations. We do not propose extensions for sparse tensors, but one can support sparse tensors by introducing a new variant of typeinfo intrinsic to describe sparse tensors and continue using the intrinsics for tensor operations we propose. Supporting sparse tensors would also require adding new intrinsics for operations such as coalescing, for example. Same applies for ragged tensors. 

By representing shape information in llvm.tensor.typeinfo as a vector of dimension sizes, dynamic shapes can be represented in this intrinsic using a vector of SSA values of dimension sizes (and not just constant values). 


This isn’t to say that this is the _wrong_ design, merely that it is only one of many plausible and important designs.  Standardizing "one thing" in LLVM can have a chilling effect on innovation (particularly for such a rapidly evolving field) which is one of the reasons that MLIR favors an “open extensibility” approach.


While it’s true that this is only one of many plausible designs, that will be true of every LLVM extension that is ever proposed. We do not think that adding TLX is going to stifle innovation. It is absolutely true that LLVM itself has only limited support for extensibility, whereas MLIR is inherently more extensible, but that is not a reason not to add new functionality in LLVM within the limitations of what is possible. Our tensor extensions we propose are extensible for the aforementioned reasons. We expect these extensions to be experimented with and refined further by the community. We are open to any ideas that you and other folks in the LLVM community may have to make these extensions more general and more extensible. Note that we have added a section on the methodology to extend TLX in our proposal document: https://docs.google.com/document/d/1IW6VIJ4lMYbGRTOle7S5QXP7Sb5UlucZ3gf-L-4Ccfs/edit#heading=h.ltfq7r4wczwl.



In terms of detailed design, it isn’t clear to me that representing heap allocated things like this as a token type will work out well.  There have been a variety of proposals over the years (incl adding F90 style arrays as a first class entity) that haven’t worked well because of a wide variety of design assumptions in LLVM).  The token type in particular is not composable with control flow, functional calls and other things, and ML models frequently have loops and other controls flow in them - how do you plan to represent that?

We are *not* proposing to represent heap-allocated objects using token type. Values of token type merely represent SSA tensor values. Tensor loads and stores contain the information about the tensors they read from or write to memory. We have implemented these intrinsics in LLVM already and have not encountered any problems so far. 

In order to handle cases where tensor information from block1 and block2 has to be used in block3:

block1:
   ….
    %tensor1_info = call token @llvm.tensor.typeinfo(<256 x i8> %tensor1, <3 x i32> %shape1, <3 x i32> %layout1, <3 x i32> %padding1)
    br %block3

block2:
    .....
    %tensor2_info = call token @llvm.tensor.typeinfo(<256 x i8> %tensor2, <3 x i32> %shape2, <3 x i32> %layout2, <3 x i32> %padding2)
    br %block3

block3:
  ….
  %tensor3 = phi < 256 x i8> [%tensor1, %block1], [%tensor2, %block2]
  %shape3 = phi <3 x i32> [%shape1, %block1], [%shape2, %block2]
  %layout3 = phi <3 x i32> [%layout1, %block1], [%layout2, %block2]
  %padding3 = phi <3 x i32> [%padding1, %block1], [%padding2, %block2]
  %tensor3_info = call token @llvm.tensor.typeinfo(<256 x i8> %tensor3, <3 x i32> %shape3, <3 x i32> %layout3, <3 x i32> %padding3)
  …..


We do not discuss how tensor information could be passed across function call boundaries, but we could use 3 new parameter attributes: tensorshape, tensorlayout, tensorpad. These attributes indicate what property of a tensor parameter they represent. We could also introduce an attribute named tensorargid to give each set of parameters representing a tensor and its shape, layout and padding a unique ID.

define void @callee (<256 x i8> tensorargid 0 %tensor1, <2 x i32> tensorshape tensorargid 0 %shape1, <2 x i32> tensorlayout tensorargid 0 %layout1, <2 x i32> tensorpad tensorargid 0 %pad1, <256 x i8> tensorargid 1 %tensor2, <2 x i32> tensorshape tensorargid 1 %shape2, <2 x i32> tensorlayout tensorargid 1 %layout2, <2 x i32> tensorpad tensorargid 1 %pad2) {

  ; Define typed input tensors
  %typed_tensor1 = call <256 x i8> llvm.tensor.typeinfo(<256 x i8> %tensor1, <2 x i32> %shape1, <2 x i32> %layout1, <2 x i32> %pad1) 
  %typed_tensor2 = call <256 x i8> llvm.tensor.typeinfo(<256 x i8> %tensor2, <2 x i32> %shape2, <2 x i32> %layout2, <2 x i32> %pad2)
   ….
}



In your convolution operation, it doesn’t look like you’re handling the various edge conditions (replicating, mirroring, etc) common in ML frameworks. How do you plan to handle that?  Similarly, how do you handle quantization?

We could extend our convolution intrinsic to include one more operand with number of feature groups along every outer dimension to support depthwise convolutions. I am not sure what you mean by other edge conditions such as mirroring, replicating, etc. 

We do not propose any intrinsics for quantization in this proposal, but quantization could be added as a set of additional intrinsics that use the same tensor typeinfo intrinsic. We are looking for feedback/ideas from the LLVM community on which specific intrinsics should be absolutely added along with this RFC. As Aditya Atluri pointed out in the comment section of the specification document, we may have to think more about how to allow vendors to support custom types and how to allow vendors to specify the legal conversions between other custom and existing LLVM types. So this is an open question that merits more discussion. However, this concern is relatively minor and should not impact most of the rest of the design details in the RFC.  
Reply all
Reply to author
Forward
0 new messages