Hi,
Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and accelerators able to operate on tiles. Capability of Intel AMX implementation is enumerated by palettes. Two palettes are supported: palette 0 represents the initialized state and palette 1 consists of 8 tile registers of up to 1 KB size, which is controlled by a tile control register.
The instruction manual is posted at https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
The AMX abi proposal is posted at https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
This email is to discuss the programming model for AMX. Florian has introduced the matrix type and intrinsics in LLVM community. We’d like to adopt some ideas from it.
Here is what we propose for the AMX programming model.
1. Data type.
We’d like to have fixed vector type for AMX. Since the shape to AMX register can be configurable, the vector size is the maximum size of AMX register. That means the vector size is 1024 bytes.
The C code may look like this.
typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));
_tile_data tile;
And the LLVM IR may look like this.
@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer, align 64
For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX registers.
2. AMX Intrinsics.
The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n, k identifies the shape of the tile. The shape can be variable, but it cannot exceed the size that AMX HW can support. Compiler can deduce shape of the tile from the AMX intrinsics.
_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);
_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);
_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);
void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data tile);
3. User interfaces.
The tile shape and tile data are combined into a struct in C language. The shape of the tile is only allowed to be initialized once. The user interface looks as this.
3 #define __DEFAULT_FN_AMX \
4 __attribute__((__always_inline__, __nodebug__, __target__("amx-int8")))
9 typedef struct __tile_str {
10 const char row;
11 const short col;
12 _tile_data tile;
13 }__tile;
14
15 __DEFAULT_FN_AMX
16 void __tile_loadd(__tile *dst, const void *base, long stride) {
17 dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);
18 }
19
20 __DEFAULT_FN_AMX
21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
22 dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, dst->tile, src1.tile, src2.tile);
23 }
24
25 __DEFAULT_FN_AMX
26 void __tile_stored(void *base, long stride, __tile src) {
27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);
28 }
4. Example code
The example shows how to use the user interface in a function.
51 void api(int cond, short row, short col) {
52 __tile a = {row, col};
53 __tile b = {row, col};
54 __tile c = {row, col};
55
56 if(cond) {
57 __tile_loadd(&a, buf, STRIDE);
58 __tile_loadd(&b, buf, STRIDE);
59 __tile_loadd(&c, buf, STRIDE);
60 } else {
61 __tile_loadd(&a, buf2, STRIDE);
62 __tile_loadd(&b, buf2, STRIDE);
63 __tile_loadd(&c, buf2, STRIDE);
64 }
65 __tile_dpbsud(&c, a, b);
66 __tile_stored(buf, STRIDE, c);
67 }
5. LLVM IR
The LLVM intrinsics IR take the row and column information as the input parameter, so that compiler can deduce the shape of tile data. The remaining parameters are what AMX instructions require. This is the LLVM IR corresponding to the example code.
12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col) local_unnamed_addr #2 {
13 entry:
14 %tobool = icmp eq i32 %cond, 0
15 %sext = shl i16 %col, 8
16 %conv.i31 = ashr exact i16 %sext, 8
17 br i1 %tobool, label %if.else, label %if.then
18
19 if.then: ; preds = %entry
20 %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
21 %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
22 %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
23 br label %if.end
24
25 if.else: ; preds = %entry
26 %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
27 %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
28 %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
29 br label %if.end
30
31 if.end: ; preds = %if.else, %if.then
32 %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
33 %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
34 %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
35 %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
36 tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3
37 ret void
38 }
6. Shape propagation
When in -O0 build, some general load/store for tile vector is generated by front-end. We need to root from AMX intrinsics to propagate the shape information to the virtual tile register. If the an AMX intrinsic use the result of load instruction, the shape is propagated to the load and the load is transformed to tile load intrinsic. If the store instruction uses any result of AMX intrinsic, the shape is propagated to store instruction and the store is transformed to tile store intrinsic
7. Machine IR
Since the AMX intrinsics take the row and column as the input parameters, we can create a pseudo instruction corresponding to it. The AMX intrinsics are lowered to the pseudo AMX instruction which has extra row and column operands corresponding to AMX intrinsic. The real AMX instructions don’t need the row and column operands. The row and column information should be configured by ldtilecfg before executing any AMX instruction.
8. Register allocation
AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.
9. Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at the entry of the function entry and inline function as much as possible. The AMX instructions focus on computation instead of storage, so global variable for tile data is not recommended.
Thanks
Yuanke
This interface look convenient, but what happens if one of these
types appears on a function-call boundary? Does this force
everything to be spilled and restored from the stack? Maybe this
type needs some additional attribute to give it a custom
register-passing convention?
Can you take advantage of our IPRA capability so that internal function calls might avoid this reconfiguration if the necessary configuration is always done in the caller?
How will the implementation of __builtin_setjmp/longjmp be affected?
Thanks again,
Hal
9. Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at the entry of the function entry and inline function as much as possible. The AMX instructions focus on computation instead of storage, so global variable for tile data is not recommended.
Thanks
Yuanke
_______________________________________________ LLVM Developers mailing list llvm...@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
This seems complicated.
Reading through the documentation, there appears to be a single global tile config for all tile registers at any time.
Why not simply model this tile config as a designated special register and the tile instructions as having an implicit use of this register? That would seem to ensure that the register allocator has all the constraints needed. You'd need to teach it how to spill the special registers with the appropriate instructions, but that seems a lot more straight forward?
9. Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at the entry of the function entry and inline function as much as possible. The AMX instructions focus on computation instead of storage, so global variable for tile data is not recommended.
Thanks
Yuanke
[Yuanke] We prefer the tile data is passed through memory across function call, because passing though register is not as efficient as passing through memory. Compiler allocate the tile register and configure it in callee, and the tile register is re-configured in callee and all the tile data register is clear to zero. So yes, this force everything to be spilled and restored from the stack.
[Yuanke] I don’t know IPRA capability and I am very interesting on it. Would you post some linkage that introduce IPRA?
How will the implementation of __builtin_setjmp/longjmp be affected?
[Yuanke] That depends on the ABI. We propose all tile register is caller saved, so I think setjmp/longjmp is not affected.
[Yuanke] AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.
[Philip]
This seems complicated.
Reading through the documentation, there appears to be a single global tile config for all tile registers at any time.
Why not simply model this tile config as a designated special register and the tile instructions as having an implicit use of this register? That would seem to ensure that the register allocator has all the constraints needed. You'd need to teach it how to spill the special registers with the appropriate instructions, but that seems a lot more straight forward?
[Yuanke] In that case user need to configure the tile register by themselves. Spilling configure register is very expensive, because it clears all the tile data register to zero. In our proposal, compiler is responsible to deduce the shape for virtual of tile data register, allocate physical registers for them and then configure those physical register. We may build the dependency as you proposed and it can be used for machine IR check to ensure tile data register is configured before use.
From: Philip Reames <list...@philipreames.com>
Sent: Saturday, August 15, 2020 1:17 AM
To: Luo, Yuanke <yuank...@intel.com>; llvm...@lists.llvm.org; floria...@apple.com; Kaylor, Andrew <andrew...@intel.com>; Topper, Craig <craig....@intel.com>; Lu, Hongjiu <hongj...@intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.
On 8/14/20 6:27 AM, Luo, Yuanke via llvm-dev wrote:
From: Hal Finkel <hfi...@anl.gov>
Sent: Friday, August 14, 2020 11:27 PM
To: Luo, Yuanke <yuank...@intel.com>; llvm...@lists.llvm.org; floria...@apple.com; Kaylor, Andrew <andrew...@intel.com>; Topper, Craig <craig....@intel.com>; Lu, Hongjiu <hongj...@intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.
On 8/14/20 8:27 AM, Luo, Yuanke via llvm-dev wrote:
Hi,
...
8. Register allocation
AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.
Can you take advantage of our IPRA capability so that internal function calls might avoid this reconfiguration if the necessary configuration is always done in the caller?
[Yuanke] I don’t know IPRA capability and I am very interesting on it. Would you post some linkage that introduce IPRA?
Interestingly, it looks like some documentation was written but
never committed: https://reviews.llvm.org/D23980
- in general, if you search for IPRA in LLVM, you'll see the
relevant pieces. The really short description is that functions
are emitted in topological order, leaves of the call graph first,
so that customized clobber register masks can be attached to call
sites of relevant internal functions.
-Hal
I find your answer unconvincing. I'm not going to debate it as I don't wish to take the time to build the appropriate context, but my initial response is skepticism.
Philip
From: Hal Finkel <hfi...@anl.gov>
Sent: Saturday, August 15, 2020 8:46 AM
To: Luo, Yuanke <yuank...@intel.com>; llvm...@lists.llvm.org; floria...@apple.com; Kaylor, Andrew <andrew...@intel.com>; Topper, Craig <craig....@intel.com>; Lu, Hongjiu <hongj...@intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.
On 8/14/20 6:39 PM, Luo, Yuanke wrote:
From: Hal Finkel <hfi...@anl.gov>
Sent: Friday, August 14, 2020 11:27 PM
To: Luo, Yuanke <yuank...@intel.com>; llvm...@lists.llvm.org; floria...@apple.com; Kaylor, Andrew <andrew...@intel.com>; Topper, Craig <craig....@intel.com>; Lu, Hongjiu <hongj...@intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.
On 8/14/20 8:27 AM, Luo, Yuanke via llvm-dev wrote:
Hi,
...
1. Register allocation
AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.
Can you take advantage of our IPRA capability so that internal function calls might avoid this reconfiguration if the necessary configuration is always done in the caller?
[Yuanke] I don’t know IPRA capability and I am very interesting on it. Would you post some linkage that introduce IPRA?
Interestingly, it looks like some documentation was written but never committed: https://reviews.llvm.org/D23980 - in general, if you search for IPRA in LLVM, you'll see the relevant pieces. The really short description is that functions are emitted in topological order, leaves of the call graph first, so that customized clobber register masks can be attached to call sites of relevant internal functions.
[Yuanke] Thank you. I think IPRA should help to reduce tile register re-config. I need more time to understand the detail of it. I also notice there is explicit cc discussion at http://lists.llvm.org/pipermail/llvm-dev/2019-January/129195.html, but it seems it doesn’t land on LLVM.
-Hal
How will the implementation of __builtin_setjmp/longjmp be affected?
[Yuanke] That depends on the ABI. We propose all tile register is caller saved, so I think setjmp/longjmp is not affected.
Thanks again,
Hal
2. Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at the entry of the function entry and inline function as much as possible. The AMX instructions focus on computation instead of storage, so global variable for tile data is not recommended.
Thanks
Yuanke
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
Hi Philip,
Your idea make sense to me in my first thought. Thank you for the idea. I will take more time to think it over to see it can help to reduce the complexity of tile register allocation.
Yuanke
Has some thought gone into how to make the config instruction less expensive?
I have, for a long time, thought that we need cleverer RAM.
E.g. A single read request that would, for example, return 64 bytes,
with each byte having been spaced out. I.e. Byte 1, skip 99 bytes,
Byte 2, skip 99 bytes Byte 3.
Or, instead of "read the next instruction", "read the next basic block
in one operation". (group of instructions).
This would massively reduce the amount of transactions between the CPU
and the RAM chips.
It would be the RAM chip itself that would do the operation, and not the CPU.
It could also be expanded to have the RAM chip do some simple
computations. E.g. Atomic loads/saves/counters/xor/not/xchg, if they
were cheap to do.
Essentially making the RAM chip able to work better, more efficiently,
with larger chunks of data per transaction.
Kind Regards
James
The AMX registers are complicated. The single configuration register (which is mostly used implicitly, similar to MXCSR for floating point) controls the shape of all the tile registers, and if you change the tile configuration every single tile register is cleared. In practice, if we have to change the the configuration while any of the tile registers are live, performance is going to be terrible. We need to handle this case for correctness, but users of this programming interface will need to have enough awareness of the performance issues and the hardware details to prevent this. We’ll also want a diagnostic that lets the user know when this has happened.
When the tile configuration is set, the shape of each tile is locked in, so the individual tile registers aren’t interchangeable at that point. If a function needs 2x4 tiles, 4x2 tiles, and 4x4 tiles, the configuration needs to be set with this in mind. The shape isn’t explicit in every instruction and intrinsic. It must be deduced. And again, we’ll need a way to tell the user when efficient allocation can’t be done. In practice, I don’t expect any function to be using more than three tile shapes.
The implication of all this is that I don’t think the greedy register allocator is well suited to figure all of this out. We need a special pass to pre-allocate these registers. If the function is written in a way that makes good performance possible, it should be a relatively simple task to allocate everything with minimal spilling. If it isn’t possible to get good performance, we don’t need to do anything especially clever. We can just do something straightforward that is correct and let the user know that they aren’t going to be happy with the results.
-Andy
Hi, Andy,
I don't quite understand everything that's going on here. Could we model this as:
1. Define a collection of register classes, one for 2x4 tiles, one for 4x2 tiles, etc. each populated with a set of tile registers. Registers can have aliasing relationships (instead of worrying of any kind of subregister/superregister relationships -- these won't be useful anyway).
2. Define the tile-configuration instructions so that they implicitly define all of the registers in all of the classes.
Then you would still need to pre-schedule the tile operations as
you've described, and collect the configuration information in
order to add the ldtilecfgs, but the regular register allocator
can handle the allocation itself in the usual way. What do you
think?
-Hal
Hi Hal,
There is 3 aspect to be solved.
1. The HW support max shape 16x16, so there are many register classes from 1x1 to 16x16. We need 256 register classes.
2. We want to support variable shape, so compiler don’t know what register class to fit tile shape as it is only known in runtime.
3. The tile configure is to configure physical tile register, so we need to allocate register and then we know the shape of each physical tile register and configure the tile register.
I think your suggestion is helpful to reduce the complexity if we only support fixed (constant) tile shape.
-Yuanke
Hi Hal,
There is 3 aspect to be solved.
1. The HW support max shape 16x16, so there are many register classes from 1x1 to 16x16. We need 256 register classes.
2. We want to support variable shape, so compiler don’t know what register class to fit tile shape as it is only known in runtime.
3. The tile configure is to configure physical tile register, so we need to allocate register and then we know the shape of each physical tile register and configure the tile register.
I think your suggestion is helpful to reduce the complexity if we only support fixed (constant) tile shape.
-Yuanke
Thanks, Yuanke.
It's not clear to me that having 256 register classes is, in itself, a problem. Is it?
What does it mean to support variable-shape tiles in this context? Do you do something other than conservatively assume that they are 16x16 for register-allocation purposes?
-Hal
There is no problem to have 256 register classes. Just a lot of register classes to me.
We don’t assume the shape of each physical register be 16x16, it is defined by user. For variable shape, I mean the shape is known in runtime and in compile time the shape is unknown. Take below code as an example, the %row and %col are variable instead of constant. Compiler recognizes llvm.x86.tileloadd64 and deduce the shape of %0 is %row x %col.
%0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
There is no problem to have 256 register classes. Just a lot of register classes to me.
We don’t assume the shape of each physical register be 16x16, it is defined by user. For variable shape, I mean the shape is known in runtime and in compile time the shape is unknown. Take below code as an example, the %row and %col are variable instead of constant. Compiler recognizes llvm.x86.tileloadd64 and deduce the shape of %0 is %row x %col.
%0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)
When the tile shape is unknown at compile time, how do you plan to do the register allocation of the tiles? My question is: do you do the allocation for this case in the same way as you would if you knew the size was 16x16 (i.e., conservatively assume the largest size)?
Thanks again,
Hal
We can do some basic analysis to classify the several groups of virtual tile registers. Each group share the same shape. So, the tile registers with the constant shape 16x16 are in a group. Within the group the register can be allocated by general RA scheme. To your question, we do the allocation for this case in the same way even if we knew the size was 16x16.
> When the tile shape is unknown at compile time, how do you plan to do the register allocation of the tiles? My question is: do you do the allocation for this case in the same way as you would if you knew the size was 16x16 (i.e., conservatively assume the largest size)?
I think what will happen is that the registers are allocated based on a number of runtime values that are assumed to be different from one another but less than or equal to 16. So, for example, we’ll allocate registers for MxN tiles, NxM tiles and MxM tiles without knowing what M and N are. Then at runtime the values of these variables will be used to create the actual tile configuration. The instructions that need to know the shape take these runtime values as operands.
There may be some artifacts coming from the front end that conservatively assume a 16x16 tile, but I think those generally go away in SROA or later specialized passes. Yuanke can confirm or correct my understanding of this.
> When the tile shape is unknown at compile time, how do you plan to do the register allocation of the tiles? My question is: do you do the allocation for this case in the same way as you would if you knew the size was 16x16 (i.e., conservatively assume the largest size)?
I think what will happen is that the registers are allocated based on a number of runtime values that are assumed to be different from one another but less than or equal to 16. So, for example, we’ll allocate registers for MxN tiles, NxM tiles and MxM tiles without knowing what M and N are. Then at runtime the values of these variables will be used to create the actual tile configuration. The instructions that need to know the shape take these runtime values as operands.
So you're going to multiversion the code?
In any case, my point is that you probably don't need a custom register allocator. If you just define the tile registers and make sure that the ldtilecfgs implicitly defines them all, then the regular infrastructure likely works. You'll have a bunch of register classes, but that's not necessarily a problem. I recommend trying this, and let us know what you discover, before we go down the road of a new, dedicated allocator just for these registers.
-Hal
The width and height can be runtime values that we would just copy into 64 byte configuration block we pass to ldtilecfg. So the code doesn’t need to be multiversioned. The user code would also use those values to update pointers in the loops they write using the tiles. If we can’t determine that two tiles were defined with the same width and height we need to assume the shape is different and try to avoid ever giving the same tile.
Hal, for your suggestion would which physical registers are in which register class be defined dynamically before register allocation?
This is quite interesting. Did you put some thoughts on how this
extension will be exposed besides going through Clang? There has been
a lot of work recently on the MLIR side to represent multidimensional
vectors. Nicolas Vasilache presented it in the open design meeting
last week (slides:
https://drive.google.com/file/d/1_zPPxOILAIHOWoSM7GALwioYOGEgD2Xe/view,
recording: https://drive.google.com/file/d/13jY4GTe7ZjFxqh3TCMBUh15HWoSGcswj/view).
It would be great to have MLIR also target AMX in the future. Looking
at the design I think a lot of it would match well with the direction
MLIR has taken. One thing that is not supported at the time - even
though it has been discussed - is dynamic vector size. Do you expect
this to be a common use case or is it supported for completeness?
It would be great to hear your thoughts on how AMX could be targeted
by MLIR if you have looked at it at all.
Thanks,
Thomas
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Thomas
The width and height can be runtime values that we would just copy into 64 byte configuration block we pass to ldtilecfg. So the code doesn’t need to be multiversioned. The user code would also use those values to update pointers in the loops they write using the tiles. If we can’t determine that two tiles were defined with the same width and height we need to assume the shape is different and try to avoid ever giving the same tile.
Hal, for your suggestion would which physical registers are in which register class be defined dynamically before register allocation?
Here's my thought:
First, you have a set of intrinsics that take tile values along
with tile configuration parameters (which, presently, seem just to
be the sizes). These get lowered into pseudo-instructions that do
the same. Thus, you have some register class that represents these
arbitrarily-sized tile registers that you'll assign to these
pseudo-instruction operands (i.e., they take virtual tile
registers right after instruction selection). You might use the
16x16 tile register class for this purpose, but it shouldn't
really matter.
Second, you run this configuration-placement pass. This pass looks at all of the AMX pseudo-instructions and identifies regions in which the pseudo-instructions use the same configuration parameters (i.e., the same SSA values and/or constants). This pass might reorder the pseudo-instructions when legal in order to form larger regions. Then it places the ldtilecfg at the start of each region (in some common dominating position). ldtilecfg implicitly defines all of the tile registers in every concrete class of tile registers (all 256 of them, or whatever). The pseudo-instructions are replaced by real MI instructions taking a tile register class appropriate for the configuration (which will default to the 16x16 class for cases where the configuration is not a compile-time-known constant). When the configuration is a known constant, the instructions take operands with a register class appropriate for that configuration (e.g., 1x1, 4x4).
Third, the rest of the framework runs as usual. Tile registers
from the appropriate class are allocated by the register
allocator. No live range of any virtual tile register can pass
through the ldtilecfg (because it defines them all), but that's
okay, none of live ranges will by construction (the
configuration-placement pass ensures this).
I think I’m still missing something here. The configuration is per tile. The multiply instructions take a MxK tile and multiply it by a KxN tile and accumulate into an MxN tile. So the configuration needs to know how many of each size of tile it needs to avoid a spill. Wouldn’t the register allocator then need to know which physical tiles have been configured to which sizes so that it only chooses those tiles for an operand that needs that size?
~Craig
Ignore my spill comment for now. That’s more of an optimization.
Lets say I have a 2x3 tile a 3x2 tile and I multiply them to make a 2x2 tile. I have 3 different sizes of tiles. So my instruction uses 3 different register classes for its virtual registers.
The pass that inserts the ldtilecfg needs to configure the physical tiles so lets say it configures tmm0 to 2x3, tmm1 to 3x2 and tmm2 to 2x2.
Register classes as I know them in llvm have a static list of physical registers in them. So all 3 of the register classes for my virtual registers contain all 8 physical tmm registers? How does the register allocator know to use tmm0 for the 2x3 virtual register, and tmm1 for the 3x2 virtual register, and tmm2 for the 2x2 virtual register.
~Craig
From: Hal Finkel <hfi...@anl.gov>
Sent: Thursday, August 20, 2020 1:27 PM
To: Topper, Craig <craig....@intel.com>; Kaylor, Andrew <andrew...@intel.com>; Luo, Yuanke <yuank...@intel.com>; Philip Reames <list...@philipreames.com>; llvm...@lists.llvm.org; floria...@apple.com; Lu, Hongjiu <hongj...@intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.
On 8/20/20 2:47 PM, Topper, Craig wrote:
I think I’m still missing something here. The configuration is per tile. The multiply instructions take a MxK tile and multiply it by a KxN tile and accumulate into an MxN tile. So the configuration needs to know how many of each size of tile it needs to avoid a spill. Wouldn’t the register allocator then need to know which physical tiles have been configured to which sizes so that it only chooses those tiles for an operand that needs that size?
Yes, I think so. But it will because that information is essentially encoded in the virtual register classes. I certainly could be missing something. It seems like you first figure that out, and then you assign virtual tile registers corresponding to the correct tile sizes. Perhaps this comes down to what you mean by "avoid a spill." We still might spill, and I assume that the infrastructure always needs to deal with that. We should continue to do instruction scheduling in order to minimize register pressure. Once we assign the right virtual register classes to the AMX instructions, shouldn't this automatically happen? If we do spill, since none of the original live ranges cross the ldtilecfg, then there shouldn't be any fundamental issue with using a regular load/store spill implementation.
I'm definitely not an expert in this instruction set, so I may just not understand some aspect of this. If there's something I'm overlooking, a little example would be helpful.
Thanks again,
Hal
I think I’m still missing something here. The configuration is per tile. The multiply instructions take a MxK tile and multiply it by a KxN tile and accumulate into an MxN tile. So the configuration needs to know how many of each size of tile it needs to avoid a spill. Wouldn’t the register allocator then need to know which physical tiles have been configured to which sizes so that it only chooses those tiles for an operand that needs that size?
Yes, I think so. But it will because that information is essentially encoded in the virtual register classes. I certainly could be missing something. It seems like you first figure that out, and then you assign virtual tile registers corresponding to the correct tile sizes. Perhaps this comes down to what you mean by "avoid a spill." We still might spill, and I assume that the infrastructure always needs to deal with that. We should continue to do instruction scheduling in order to minimize register pressure. Once we assign the right virtual register classes to the AMX instructions, shouldn't this automatically happen? If we do spill, since none of the original live ranges cross the ldtilecfg, then there shouldn't be any fundamental issue with using a regular load/store spill implementation.
I'm definitely not an expert in this instruction set, so I may
just not understand some aspect of this. If there's something I'm
overlooking, a little example would be helpful.
Thanks again,
Hal
Ignore my spill comment for now. That’s more of an optimization.
Lets say I have a 2x3 tile a 3x2 tile and I multiply them to make a 2x2 tile. I have 3 different sizes of tiles. So my instruction uses 3 different register classes for its virtual registers.
The pass that inserts the ldtilecfg needs to configure the physical tiles so lets say it configures tmm0 to 2x3, tmm1 to 3x2 and tmm2 to 2x2.
Register classes as I know them in llvm have a static list of physical registers in them. So all 3 of the register classes for my virtual registers contain all 8 physical tmm registers? How does the register allocator know to use tmm0 for the 2x3 virtual register, and tmm1 for the 3x2 virtual register, and tmm2 for the 2x2 virtual register.
~Craig
Ah, okay. I think I see why we're not on the same page. The architectural definition has 8 files registers, tmm0-tmm7, but I was thinking that you would not model it that way. Instead, we could have registers:
tmm0_1x1 ... tmm7_1x1
...
tmm0_16x16 ... tmm7_16x16
where tmm0_1x1 as aliases of tmm0_1x2, ... tmm0_16x16, and so on.
and corresponding register classes RegClassTmm1x1, ..., RegClassTmm16x16 (I don't mean to imply this exact naming convention). So, within each region, you assign the relevant virtual registers to have a register class of RegClassTmm1x1, or whatever, and then once register allocation is done, you adjust the ldtilecfg data for each region so that it actually makes whatever registers were assigned by the right tile sizes.
You would not want to have N^2 version of all of the instructions either, but I think you can just have the instructions defined to take some overall register class (containing all of the registers) and then you can call constrainRegClass in the configuration-placement pass.
Thinking about it however, maybe having the different physical registers isn't actually needed. If you know which tile config each register needed based on the instructions, maybe you can have only 8 of them and just update the ldtilecfg based on the usage information after allocation regardless.
Hi Hal,
The proposal is attractive to me, but there is something I still can’t figure out. Let’s take below MIR as an example. We assume we have 256 register classes (vtile1x1, vtile1x2, …, tile16x16).
1. After instruction selection, the pseudo AMX instruction is generated. The name of pseudo instructions have ‘P’ prefix. Now all the AMX pseudo instruction take vtile as register class. Let’s assume %13 is constant 3, %10 is constant 4 and %14 is variable.
%1:vtile = PTILELOADDV %13:gr16, %10:gr16, %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%2:vtile = PTILELOADDV %10:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%3:vtile = PTILELOADDV %13:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%21:vtile = PTDPBSSDV %13:gr16, %10:gr16, %14:gr16, %3:vtile(tied-def 0), %1:vtile, %2:vtile
2. The configuration-placement pass looks at all of the AMX pseudo-instructions and identifies regions in which the pseudo-instructions use the same configuration parameters. It first replaces the register class for all tile registers whose shape is known in compile-time. Since the shape of %1 is constant, so it replaces %1:vtile with %1:vtile3x4 which change the register class and morph pseudo instruction into AMX real instruction. The shape of %2 and %3 is unknown in compile-time, so it arbitrarily picks up a tile register class which is not assigned before and assign the register class to %2 and %3. After register class allocation, the code is transformed as this. The register class for %2:vtile1x1 and %3:vtile1x2 is allocated.
PLDTILECFG
%1:vtile3x4 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4, %2:vtile1x1
Something I am not figured out.
a. I not sure if we can have AMX instruction’s inputs and outputs fit multiple register classes (vtile1x1, …, vtile16x16), otherwise we need 256 pseudo instructions.
b. Whether 256 register class is enough to be allocated. There may be more 256 unknow shape tile registers.
c. In this pass we also find the proper pointer (common dominator) to insert ldtilecfg, but at this time the register is allocated, we don’t know the shape of each physical tile register. So we just insert a pseudo tile config instruction.
3. All tile register class share the same register unit. We do register allocation by the framework, and the code is transformed as this.
$tmm0 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1
4. Run config pass to collect the shape of each physical tile register and config them. The code can be generated as below. Here is the problem, how can we know the shape of the physical tile register?
MOV row, col info to %stack.0 for each physical tile register ??????
LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, implicit-def $tmm1, implicit-def $tmm2, implicit-def $tmm3, implicit-def $tmm4, implicit-def $tmm5, implicit-def $tmm6, implicit-def $tmm7
$tmm0 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1
Thanks
Yuanke
It seems I make a mistake on sharing register unit. Can we share register unit for tile register that is within different tile register class (different register class has different tile shape)? Think about two virtual tile register %2:vtile1x1 and %3:vtile1x2. First %2 is allocated to $tmm0, after that %2 is killed and %t3 is allocated to $tmm0. This is not allowed, because when $tmm0 is allocated to %2, its shape is configured to 1x1. If we reallocated $tmm0 to %3, then we need to re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be clobbered.
Yuanke
Hi, Yuanke,
Thanks for writing this up. Let me back up a bit because the scheme I proposed last week doesn't work without further modification: within a particular "configuration region" (i.e., the code in between the LDTILECFG and the TILERELEASE (or next LDTILECFG)), each tile register can only be used with one shape, and in addition, no register can have its shape changed without zeroing out all of the tile registers. Thus, just using different register classes for the different shapes, as I had suggested, isn't sufficient to model the allocation requirements. That would not prevent the same register from essentially being assigned to differently-shaped virtual registers with non-overlapping live ranges within one configuration region.
Also, as you point out, when multiple non-static tile shapes are in use, if you use one register class for each shape, you would need different register classes for these too. Luckily, I don't think that using the separate register classes actually buys us anything, so please disregard that suggestion of mine. Use only one register class.
Once the configuration regions are identified, you'll know how many tile register shapes are required. If this number is greater than eight, then you'll need to cut the region (requiring all live tiles to be spilled and restored around each re-configuration point). After that, we'll assume that we have eight or fewer distinct shapes.
Now the problem is that you need to allocate registers, satisfying all of the usual constraints (non-overlapping live ranges, etc.), but with an additional constraint: once a physical register has been used with some particular tile shape, it cannot be assigned to any other tile shape.
I think that the current infrastructure can support this as follows:
1. Add an override X86RegisterInfo::getRegAllocationHints. Like SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting the tile registers, the function will return true (to indicate a hard constraint). As registers are assigned in RegAllocGreedy, getRegAllocationHints is called for each virtual register. For virtual tile registers, look at the passed VirtRegMap, etc. for already-assigned tile virtual registers with different shape requirements as the current virtual register (you'll need to cache the shape requirements in X86MachineFunctionInfo for this to be efficient), and return a hints list consisting of all other non-reserved tile registers.
2. To support RegAllocFast, which doesn't use getRegAllocationHints, you would need to make the configuration regions small enough that it doesn't matter (and if you're doing this around every tile instruction, this is automatically true).
3. To support RegAllocPBQP (which is likely a good thing to do,
but probably not required), I believe you can support this by
adding custom constraints to the solver (kind of like what
AArch64PBQPRegAlloc.cpp does).
Once the allocation process is complete, you'll need to go back
and update the LDTILECFG data to reflect the chosen shape ->
register mapping.
What I don't know, however, is how well the getRegAllocationHints
method will work. The benefit is that you don't need to write a
custom pre-allocator allocator. On the other hand, it might visit
the virtual registers to assign in a suboptimal order because it
doesn't really understand the constraint being imposed (generally,
we just assign larger live ranges first). On the other hand, it is
a greedy algorithm and if you want something systematically closer
to optimal, maybe you should be using PBQP anyway. If you do end
up needing a custom allocator for these, I recommend looking at
the PBQP solver (which, as I recall, is independently reusable).
Hopefully, this is more-helpful advice.
-Hal
...
Hi Hal,
Thank you for the ideas that help us to improve the design, and sorry for replying late. There is something I am not able to figure out and there some special trait for tile RA.
1. X86RegisterInfo::getRegAllocationHints can tell RA which physical register is preferred, but it can’t force RA to just allocate the hinted register. If the hinted register is not meet, RA would allocate other register.
2. The shape information should be attached to each virtual register and physical register which is allocated. How to store and get the shape information with limited code change on existing RA? When a tile register is spilled, the shape should also be bound the corresponding spill stack slot, so that it can be assigned the physical tile register with the same shape.
3. There is no mov/copy instruction for tile register. To copy tile register, we need to store the tile register to memory and load the data from memory to another register. So a lot of code for live interval split in Greedy RA is unnecessary for tile register allocation.
4. Compiler can support register spill, but spill should be avoided for performance benefit. We prefer reporting warning on register spill, so that user can realize it and adjust their code to avoid register spill.
If there is no easy way to take the advantage of current RA infrastructure, there are some pros to have a separate RA for tile register.
1. We can limit the risk to break RA for general register on each arch. If there are some bugs on tile RA, only application that use AMX is affected.
2. We can customize the special trait (config, spilt, spill) of tile register in the sperate RA more freely.
For RegAllocFast, I agree with you. Each region of register is small, and since the performance is not the first priority, we can insert multiply config for each small region.
As you recommend looking at the PBQP solver, I’ll take some time to investigate it and go back to you.
Thanks
-Yuanke
1. I not sure if we can have AMX instruction’s inputs and outputs fit multiple register classes (vtile1x1, …, vtile16x16), otherwise we need 256 pseudo instructions.
2. Whether 256 register class is enough to be allocated. There may be more 256 unknow shape tile registers.
3. In this pass we also find the proper pointer (common dominator) to insert ldtilecfg, but at this time the register is allocated, we don’t know the shape of each physical tile register. So we just insert a pseudo tile config instruction.
3. All tile register class share the same register unit. We do register allocation by the framework, and the code is transformed as this.
$tmm0 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1
4. Run config pass to collect the shape of each physical tile register and config them. The code can be generated as below. Here is the problem, how can we know the shape of the physical tile register?
MOV row, col info to %stack.0 for each physical tile register ??????
LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, implicit-def $tmm1, implicit-def $tmm2, implicit-def $tmm3, implicit-def $tmm4, implicit-def $tmm5, implicit-def $tmm6, implicit-def $tmm7
$tmm0 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1
Thanks
Yuanke
...
Hi Hal,
Thank you for the ideas that help us to improve the design, and sorry for replying late. There is something I am not able to figure out and there some special trait for tile RA.
You're quite welcome.
1. X86RegisterInfo::getRegAllocationHints can tell RA which physical register is preferred, but it can’t force RA to just allocate the hinted register. If the hinted register is not meet, RA would allocate other register.
I addressed this below, but I could have been clearer. Like
SystemZRegisterInfo::getRegAllocationHints does sometimes, when
hinting the tile registers, the function will return true. This
turns the preference into a hard constraint, and the allocator
will not allocate any other register. That's my understanding from
reading the code.
2. The shape information should be attached to each virtual register and physical register which is allocated. How to store and get the shape information with limited code change on existing RA?
For each virtual register, getRegAllocationHints could just recompute the shape information. If this isn't a constant-time operation, however, you'll probably want to cache the computed shape requirements in X86MachineFunctionInfo. You can add a map from registers to shape information in that class, and accesses it from getRegAllocationHints. You can store information about the physical registers there too.
Regarding the physical registers, you can grab this information
in the pre-rewrite phase. Override addPreRewrite in
X86TargetMachine.cpp. You'll need a small pass that records
relevant information about the assignments (which, I imagine, is
the same small pass that updates the LDTILECFG instructions). For
an example of such a pass, see AMDGPU/GCNNSAReassign.cpp
When a tile register is spilled, the shape should also be bound the corresponding spill stack slot, so that it can be assigned the physical tile register with the same shape.
I'm not sure what you mean. If you don't want to just be
conservative about the spill size allocation, you do need to know
the shape in order to compute the spill-location size. I assume
that you can grab that out of X86MachineFunctionInfo from
storeRegToStackSlot/loadRegFromStackSlot or eliminateFrameIndex
(or copyPhysReg) as needed.
3. There is no mov/copy instruction for tile register. To copy tile register, we need to store the tile register to memory and load the data from memory to another register. So a lot of code for live interval split in Greedy RA is unnecessary for tile register allocation.
Yes, but this just means that you need to support copying through
memory. Setting CopyCost = -1 in X86RegisterInfo.td might help as
well.
4. Compiler can support register spill, but spill should be avoided for performance benefit. We prefer reporting warning on register spill, so that user can realize it and adjust their code to avoid register spill.
If you want to emit a diagnostic, you may be able to do that from
storeRegToStackSlot. In any case, please make use of the
optimization-remark infrastructure. For an example of how to do
this, see RAGreedy::reportNumberOfSplillsReloads in
RegAllocGreedy.cpp.
If there is no easy way to take the advantage of current RA infrastructure, there are some pros to have a separate RA for tile register.
1. We can limit the risk to break RA for general register on each arch. If there are some bugs on tile RA, only application that use AMX is affected.
That's true. But I also worry about that. Any time you need to
write non-trivial code that will be used relatively rarely, it's
likely to have bugs that take a long time to show up. If you can
plug into the generic infrastructure, you benefit from the fact
that it's highly-covered, often-used code. Not that you might not
run into bugs, of course, especially if you're using it in a new
way, but the base logic is likely to already be robust.
2. We can customize the special trait (config, spilt, spill) of tile register in the sperate RA more freely.
True.
-Hal
Hi Hal,
Generally, your proposal to adapt tile RA to Greedy RA looks good to me. Thank you! I plan to do some prototype for the proposal. Since there is 3 RA in LLVM infrastructure, we need 3 schemes to adapt tile RA to each existing RA. Do you like to finalize the 3 schemes first, or you would like to review the left part of the AMX programming model? We have some limitation to support dynamic shape and I’d like to hear your advice. The dynamic shape requires the ldtilecfg post-dominate the point that define shape, so we encourage user to define their shape in the entry of the function. Take below code as example. Ideally, we hope to insert ldtilecfg at line 57 to config a, b, c, but in this function the c’s shape {row, col} is defined in each if/else clause. So that line 57, we don’t the shape of c. Do you have any advice for such problem?
52 void kernel(int cond) {
53 _tile a = {row, 8};
54 _tile b = {8, col};
55
56 // copy shape to stack slot
57 // ldtilecfg
58 if(cond) {
59 short row = get_row();
60 short col = get_row();
61 _tile c = {row, col};
62 __tile_loadd(&a, buf, STRIDE);
63 __tile_loadd(&b, buf, STRIDE);
64 __tile_loadd(&c, buf, STRIDE);
65 } else {
66 short row = get_row();
67 short col = get_row();
68 _tile c = {row, col};
69 __tile_loadd(&a, buf2, STRIDE);
70 __tile_loadd(&b, buf2, STRIDE);
71 __tile_loadd(&c, buf2, STRIDE);
72 }
73 __tile_dpbsud(&c, a, b);
74 __tile_stored(buf, STRIDE, c);
75 }
Thanks
Yuanke
From: Hal Finkel <hfi...@anl.gov>
Sent: Friday, September 4, 2020 5:59 PM
To: Luo, Yuanke <yuank...@intel.com>; Topper, Craig <craig....@intel.com>; Kaylor, Andrew <andrew...@intel.com>; Philip Reames <list...@philipreames.com>; llvm...@lists.llvm.org; floria...@apple.com; Lu, Hongjiu <hongj...@intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.
On 9/4/20 3:37 AM, Luo, Yuanke wrote:
Hi Hal,
Thank you for the ideas that help us to improve the design, and sorry for replying late. There is something I am not able to figure out and there some special trait for tile RA.
You're quite welcome.
1. X86RegisterInfo::getRegAllocationHints can tell RA which physical register is preferred, but it can’t force RA to just allocate the hinted register. If the hinted register is not meet, RA would allocate other register.
I addressed this below, but I could have been clearer. Like SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting the tile registers, the function will return true. This turns the preference into a hard constraint, and the allocator will not allocate any other register. That's my understanding from reading the code.
2. The shape information should be attached to each virtual register and physical register which is allocated. How to store and get the shape information with limited code change on existing RA?
For each virtual register, getRegAllocationHints could just recompute the shape information. If this isn't a constant-time operation, however, you'll probably want to cache the computed shape requirements in X86MachineFunctionInfo. You can add a map from registers to shape information in that class, and accesses it from getRegAllocationHints. You can store information about the physical registers there too.
Regarding the physical registers, you can grab this information in the pre-rewrite phase. Override addPreRewrite in X86TargetMachine.cpp. You'll need a small pass that records relevant information about the assignments (which, I imagine, is the same small pass that updates the LDTILECFG instructions). For an example of such a pass, see AMDGPU/GCNNSAReassign.cpp
3. When a tile register is spilled, the shape should also be bound the corresponding spill stack slot, so that it can be assigned the physical tile register with the same shape.
I'm not sure what you mean. If you don't want to just be conservative about the spill size allocation, you do need to know the shape in order to compute the spill-location size. I assume that you can grab that out of X86MachineFunctionInfo from storeRegToStackSlot/loadRegFromStackSlot or eliminateFrameIndex (or copyPhysReg) as needed.
4. There is no mov/copy instruction for tile register. To copy tile register, we need to store the tile register to memory and load the data from memory to another register. So a lot of code for live interval split in Greedy RA is unnecessary for tile register allocation.
Yes, but this just means that you need to support copying through memory. Setting CopyCost = -1 in X86RegisterInfo.td might help as well.
5. Compiler can support register spill, but spill should be avoided for performance benefit. We prefer reporting warning on register spill, so that user can realize it and adjust their code to avoid register spill.
Fix typo
From: Luo, Yuanke
Sent: Friday, September 4, 2020 9:47 PM
To: 'Hal Finkel' <hfi...@anl.gov>; Topper, Craig <craig....@intel.com>; Kaylor, Andrew <andrew...@intel.com>; Philip Reames <list...@philipreames.com>; llvm...@lists.llvm.org; floria...@apple.com; Lu, Hongjiu <hongj...@intel.com>
Subject: RE: [llvm-dev] Intel AMX programming model discussion.
Hi Hal,
Generally, your proposal to adapt tile RA to Greedy RA looks good to me. Thank you! I plan to do some prototype for the proposal. Since there is 3 RA in LLVM infrastructure, we need 3 schemes to adapt tile RA to each existing RA. Do you like to finalize the 3 schemes first, or you would like to review the left part of the AMX programming model? We have some limitation to support dynamic shape and I’d like to hear your advice. The dynamic shape requires the ldtilecfg post-dominate the point that define shape, so we encourage user to define their shape in the entry of the function. Take below code as example. Ideally, we hope to insert ldtilecfg at line 57 to config a, b, c, but in this function the c’s shape {row, col} is defined in each if/else clause. So at line 57, the shape of c in unknown. Do you have any advice for such problem?
52 void kernel(int cond) {
53 _tile a = {row, 8};
54 _tile b = {8, col};
55
56 // copy shape to stack slot
57 // ldtilecfg a, b, c
Fix typo
From: Luo, Yuanke
Sent: Friday, September 4, 2020 9:47 PM
To: 'Hal Finkel' <hfi...@anl.gov>; Topper, Craig <craig....@intel.com>; Kaylor, Andrew <andrew...@intel.com>; Philip Reames <list...@philipreames.com>; llvm...@lists.llvm.org; floria...@apple.com; Lu, Hongjiu <hongj...@intel.com>
Subject: RE: [llvm-dev] Intel AMX programming model discussion.
Hi Hal,
Generally, your proposal to adapt tile RA to Greedy RA looks good to me. Thank you! I plan to do some prototype for the proposal. Since there is 3 RA in LLVM infrastructure, we need 3 schemes to adapt tile RA to each existing RA. Do you like to finalize the 3 schemes first, or you would like to review the left part of the AMX programming model? We have some limitation to support dynamic shape and I’d like to hear your advice. The dynamic shape requires the ldtilecfg post-dominate the point that define shape, so we encourage user to define their shape in the entry of the function. Take below code as example. Ideally, we hope to insert ldtilecfg at line 57 to config a, b, c, but in this function the c’s shape {row, col} is defined in each if/else clause. So at line 57, the shape of c in unknown. Do you have any advice for such problem?
In the example below, I'm going to assume that the function calls are actually to get_row1() and get_row2(), neither of which can be hoisted.
Just to think about this: First, we're starting the MIR with intrinsics that take the shape parameters directly. Now you need to:
1. Identify "configuration regions". Because reconfiguring must be done for all registers at once, and because reconfiguring zeros all of the tile registers, each configuration region is a connected component in the union of the live ranges of all virtual tile registers. Thus, first collect the configuration regions via trivial clustering (two instructions are part of the same configuration region is they share any live range of a tile register).
2. If the region will require more than eight types of shapes,
then you'll need to calculate a min cut of the region, split the
region by inserting spill/restores, so that the region requires
only <= 8 number of shapes.
3. If you do it this way, all of the instructions in your code
below will be part of one, big configuration region. Generally,
you want to put the ldtilecfg at the common dominating point of
all of the tile instructions in the region. Now, as you point out
in your example below, we can't simply put the ldtilecfg at the
common dominating point: that point might not actually be
dominated by the definitions of all of the shape inputs needed.
4. One thing that you might do is iterative splitting. If not all of the definitions of the shape inputs dominate the desired insertion point, first you might try iteratively hosting the defining instructions to make it so the definitions do dominate. If they still don't, then split the ldtilecfg into each successor of the desired insertion point. Do this recursively until, for each ldtilecfg, the inputs for each dynamic-shape tile register size dominate the insertion point.
5. This procedure, alone, might fail in the case where the
ldtilecfg is sunk past the point of definition of one of the tile
registers. Imagine, in your example below, that there was some use
of the tile registers a and b before the if. In that case, you'll
need to split those live ranges by spilling into memory around the
desired ldtilecfg insertion point. That creates a new
configuration region that you'll insert into the queue of
configuration regions to process.
I'm sure that this is not the only possible heuristic. This would be easier, I think, if the hardware did not zero all of the registers when you reconfigured any of them, but I suppose that it is what it is at this point.
-Hal
Hi Hal,
The RA that you proposed basically works. I have a prototyped patch at https://reviews.llvm.org/D87981. I’d like to put the configuration region identifying and splitting in the next phase prototyping, since it is a little bit complex. In language level is it possible to force end user to define their shape at the entry of function?