Just a few initial comments.
Graham Hunter <Graham...@arm.com> writes:
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
> bytes.
"scalable" instead of "scalable x."
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not need to
> deal with scalable types, another function (getFixedSizeExpressionInBits) that
> only returns unscaled bits will be provided, with a debug assert that the type
> isn't scalable.
Can you explain a bit about what the two integers represent? What's the
"unscaled" part for?
The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale). I would be surprised to get a pair of integers back. Do
clients actually need constant integer values or would a ConstantExpr
sufffice? We could add a ConstantVScale or something to make it work.
> Comparing two of these sizes together is straightforward if only unscaled sizes
> are used. Comparisons between scaled sizes is also simple when comparing sizes
> within a function (or across functions with the inherit flag mentioned in the
> changes to the type), but cannot be compared otherwise. If a mix is present,
> then any number of unscaled bits will not be considered to have a greater size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled bits
> (since the runtime multiple is at least one).
If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison. We have code here that adds ConstantExpr support to
ScalarEvolution. We just didn't know if anyone else would be interested
in it since we added it solely for our Fortran frontend.
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
I think this may be a case where added a full-fledged Instruction might
be worthwhile. Because vscale is intimately tied to addressing, it
seems like things such as ScalarEvolution support will be important. I
don't know what's involved in making intrinsics work with
ScalarEvolution but it seems strangely odd that a key component of IR
computation would live outside the IR proper, in the sense that all
other fundamental addressing operations are Instructions.
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
This is another case where an Instruction might be better, for the same
reasons as with vscale.
Also, "iota" is the name Cray has traditionally used for this operation
as it is the mathematical name for the concept. It's also used by C++
and go and so should be familiar to many people.
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
As above, we could add ConstantVScale and also ConstantStepVector (or
ConstantIota). They won't fold to compile-time values but the
expressions could be simplified. I haven't really thought through the
implications of this, just brainstorming ideas. What does your
downstream compiler require in terms of constant support. What kinds of
queries does it need to do?
-David
Same here.
> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison.
This sounds like a cleaner solution.
> I think this may be a case where added a full-fledged Instruction might
> be worthwhile. Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important. I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.
...
> This is another case where an Instruction might be better, for the same
> reasons as with vscale.
This is a side-effect of the original RFC a few years ago. The general
feeling was that we can start with intrinsics and, if they work, we
change the IR.
We can have a work-in-progress implementation before fully committing
SCEV and other more core ideas in, and then slowly, and with more
certainty, move where it makes more sense.
> Also, "iota" is the name Cray has traditionally used for this operation
> as it is the mathematical name for the concept. It's also used by C++
> and go and so should be familiar to many people.
That sounds better, but stepvector is more than C++'s iota and it's
just a plain scalar evolution sequence like {start, op, step}. In
C++'s iota (AFAICS), the step is always 1.
Anyway, I don't mind any name, really. Whatever is more mnemonic.
> ;; Create sequence for scalable vector
> %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
> %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
> %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
Once stepvetor (or iota) becomes a proper IR instruction, I'd like to
see this restricted to inlined syntax. The sequence { step*vec +
offset } only makes sense in the scalable context and the intermediate
results should not be used elsewhere.
--
cheers,
--renato
>> This is another case where an Instruction might be better, for the same
>> reasons as with vscale.
>
> This is a side-effect of the original RFC a few years ago. The general
> feeling was that we can start with intrinsics and, if they work, we
> change the IR.
>
> We can have a work-in-progress implementation before fully committing
> SCEV and other more core ideas in, and then slowly, and with more
> certainty, move where it makes more sense.
Ok, that makes sense. I do think the goal should be making these proper
Instructions.
>> Also, "iota" is the name Cray has traditionally used for this operation
>> as it is the mathematical name for the concept. It's also used by C++
>> and go and so should be familiar to many people.
>
> That sounds better, but stepvector is more than C++'s iota and it's
> just a plain scalar evolution sequence like {start, op, step}. In
> C++'s iota (AFAICS), the step is always 1.
I thought stepvector was also always step one, as Graham states a
multiply by a constant splat must be used for other step values.
> Anyway, I don't mind any name, really. Whatever is more mnemonic.
Me neither. I was simply noting some of the history surrounding the
operation and naming in other familiar places.
>> ;; Create sequence for scalable vector
>> %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>> %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>> %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>
> Once stepvetor (or iota) becomes a proper IR instruction, I'd like to
> see this restricted to inlined syntax. The sequence { step*vec +
> offset } only makes sense in the scalable context and the intermediate
> results should not be used elsewhere.
I'm not so sure. iota is a generally useful operation and scaling it to
various step values is also useful. It's used often for strided memory
access, which would be done via gather/scatter in LLVM but generating a
vector GEP via stepvector would be convenient and convey more semantic
information than, say, loading a constant vector of indices to feed the
GEP.
-David
Thanks for taking a look.
> On 5 Jun 2018, at 16:23, d...@cray.com wrote:
>
> Hi Graham,
>
> Just a few initial comments.
>
> Graham Hunter <Graham...@arm.com> writes:
>
>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>> bytes.
>
> "scalable" instead of "scalable x."
Yep, missed that in the conversion from the old <n x m x ty> format.
>
>> For derived types, a function (getSizeExpressionInBits) to return a pair of
>> integers (one to indicate unscaled bits, the other for bits that need to be
>> scaled by the runtime multiple) will be added. For backends that do not need to
>> deal with scalable types, another function (getFixedSizeExpressionInBits) that
>> only returns unscaled bits will be provided, with a debug assert that the type
>> isn't scalable.
>
> Can you explain a bit about what the two integers represent? What's the
> "unscaled" part for?
'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
multiplied by vscale'.
>
> The name "getSizeExpressionInBits" makes me think that a Value
> expression will be returned (something like a ConstantExpr that uses
> vscale). I would be surprised to get a pair of integers back. Do
> clients actually need constant integer values or would a ConstantExpr
> sufffice? We could add a ConstantVScale or something to make it work.
I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.
Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
compile time like <4 x i32> the size would be (128, 0).
For a scalable type like <scalable 4 x i32> the size would be (0, 128).
For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
When calculating the offset for memory addresses, you just need to multiply the scaled
part by vscale and add the unscaled as is.
>
>> Comparing two of these sizes together is straightforward if only unscaled sizes
>> are used. Comparisons between scaled sizes is also simple when comparing sizes
>> within a function (or across functions with the inherit flag mentioned in the
>> changes to the type), but cannot be compared otherwise. If a mix is present,
>> then any number of unscaled bits will not be considered to have a greater size
>> than a smaller number of scaled bits, but a smaller number of unscaled bits
>> will be considered to have a smaller size than a greater number of scaled bits
>> (since the runtime multiple is at least one).
>
> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison. We have code here that adds ConstantExpr support to
> ScalarEvolution. We just didn't know if anyone else would be interested
> in it since we added it solely for our Fortran frontend.
We added a dedicated SCEV expression class for vscale instead; I suspect it works
either way.
>
>> We have added an experimental `vscale` intrinsic to represent the runtime
>> multiple. Multiplying the result of this intrinsic by the minimum number of
>> elements in a vector gives the total number of elements in a scalable vector.
>
> I think this may be a case where added a full-fledged Instruction might
> be worthwhile. Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important. I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.
We've tried it as both an instruction and as a 'Constant', and both work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.
>
>> For constants consisting of a sequence of values, an experimental `stepvector`
>> intrinsic has been added to represent a simple constant of the form
>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>> start can be added, and changing the step requires multiplying by a splat.
>
> This is another case where an Instruction might be better, for the same
> reasons as with vscale.
>
> Also, "iota" is the name Cray has traditionally used for this operation
> as it is the mathematical name for the concept. It's also used by C++
> and go and so should be familiar to many people.
Iota would be fine with me; I forget the reason we didn't go with that initially. We
also had 'series_vector' in the past, but that was a more generic form with start
and step parameters instead of requiring additional IR instructions to multiply and
add for the result as we do for stepvector.
>
>> Future Work
>> -----------
>>
>> Intrinsics cannot currently be used for constant folding. Our downstream
>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>> for good code generation, so we will need to find new ways to recognize and
>> fold these values.
>
> As above, we could add ConstantVScale and also ConstantStepVector (or
> ConstantIota). They won't fold to compile-time values but the
> expressions could be simplified. I haven't really thought through the
> implications of this, just brainstorming ideas. What does your
> downstream compiler require in terms of constant support. What kinds of
> queries does it need to do?
It makes things a little easier to pattern match (just looking for a constant to start
instead of having to match multiple different forms of vscale or stepvector multiplied
and/or added in each place you're looking for them).
The bigger reason we currently depend on them being constant is that code generation
generally looks at a single block at a time, and there are several expressions using
vscale that we don't want to be generated in one block and passed around in a register,
since many of the load/store addressing forms for instructions will already scale properly.
We've done this downstream by having them be Constants, but if there's a good way
of doing them with intrinsics we'd be fine with that too.
-Graham
You're right! Sorry, it's been a while. Step-vector is a simple iota,
the multiplier and offset come form the operations.
> I'm not so sure. iota is a generally useful operation and scaling it to
> various step values is also useful. It's used often for strided memory
> access, which would be done via gather/scatter in LLVM but generating a
> vector GEP via stepvector would be convenient and convey more semantic
> information than, say, loading a constant vector of indices to feed the
> GEP.
My point is that those patterns will be generated by C-level
intrinsics or IR optimisation passes (like vectorisers), so they have
a specific meaning in that context.
What I fear is if some other pass like CSE finds the patterns out and
common them up at the top of the function/BB and then the back-end
loses sight of what that was and can't generate the step increment
instruction in the end.
--
cheers,
--renato
+CC Sanjoy to confirm: I think intrinsics should be fine to add support for in SCEV.
>> Can you explain a bit about what the two integers represent? What's the
>> "unscaled" part for?
>
> 'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
> multiplied by vscale'.
Right, but what do they represent? If I have <scalable 4 x i32> is "32"
"unscaled" and "4" "scaled?" Or is "128" "scaled?" Or something else?
I see you answered this below.
>> The name "getSizeExpressionInBits" makes me think that a Value
>> expression will be returned (something like a ConstantExpr that uses
>> vscale). I would be surprised to get a pair of integers back. Do
>> clients actually need constant integer values or would a ConstantExpr
>> sufffice? We could add a ConstantVScale or something to make it work.
>
> I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
> integers representing the known-at-compile-time terms in an expression:
> '(scaled_bits * vscale) + unscaled_bits'.
>
> Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
> compile time like <4 x i32> the size would be (128, 0).
>
> For a scalable type like <scalable 4 x i32> the size would be (0, 128).
>
> For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
>
> When calculating the offset for memory addresses, you just need to multiply the scaled
> part by vscale and add the unscaled as is.
Ok, now I understand what you're getting at. A ConstantExpr would
encapsulate this computation. We alreay have "non-static-constant"
values for ConstantExpr like sizeof and offsetof. I would see
VScaleConstant in that same tradition. In your struct example,
getSizeExpressionInBits would return:
add(mul(256, vscale), 64)
Does that satisfy your needs?
Is there anything about vscale or a scalable vector that requires a
minimum bit width? For example, is this legal?
<scalable 1 x double>
I know it won't map to an SVE type. I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above vscale
would == maxvl. Not that we make any such things anymore. But maybe
someone else does?
>> If we went the ConstantExpr route and added ConstantExpr support to
>> ScalarEvolution, then SCEVs could be compared to do this size
>> comparison. We have code here that adds ConstantExpr support to
>> ScalarEvolution. We just didn't know if anyone else would be interested
>> in it since we added it solely for our Fortran frontend.
>
> We added a dedicated SCEV expression class for vscale instead; I suspect it works
> either way.
Yes, that's probably true. A vscale SCEV is less invasive.
> We've tried it as both an instruction and as a 'Constant', and both work fine with
> ScalarEvolution. I have not yet tried it with the intrinsic.
vscale as a Constant is interesting. It's a target-dependent Constant
like sizeof and offsetof. It doesn't have a statically known value and
maybe isn't "constant" across functions. So it's a strange kind of
constant.
Ultimately whatever is easier for LLVM to analyze in the long run is
best. Intrinsics often block optimization. I don't know whether vscale
would be "eaiser" as a Constant or an Instruction.
>> As above, we could add ConstantVScale and also ConstantStepVector (or
>> ConstantIota). They won't fold to compile-time values but the
>> expressions could be simplified. I haven't really thought through the
>> implications of this, just brainstorming ideas. What does your
>> downstream compiler require in terms of constant support. What kinds of
>> queries does it need to do?
>
> It makes things a little easier to pattern match (just looking for a constant to start
> instead of having to match multiple different forms of vscale or stepvector multiplied
> and/or added in each place you're looking for them).
Ok. Normalization could help with this but I certainly understand the
issue.
> The bigger reason we currently depend on them being constant is that code generation
> generally looks at a single block at a time, and there are several expressions using
> vscale that we don't want to be generated in one block and passed around in a register,
> since many of the load/store addressing forms for instructions will already scale properly.
This is kind of like X86 memop folding. If a load has multiple uses, it
won't be folded, on the theory that one load is better than many folded
loads. If a load has exactly one use, it will fold. There's explicit
predicate code in the X86 backend to enforce this requirement. I
suspect if the X86 backend tried to fold a single load into multiple
places, Bad Things would happen (needed SDNodes might disappear, etc.).
Codegen probably doesn't understand non-statically-constant
ConstantExprs, since sizeof of offsetof can be resolved by the target
before instruction selection.
> We've done this downstream by having them be Constants, but if there's a good way
> of doing them with intrinsics we'd be fine with that too.
If vscale/stepvector as Constants works, it seems fine to me.
-David
Got it. Graham hit this point as well. I took your suggestion as
"fusing" the iota/scale/offset together. I would still want to be able
to generate things like stepvector without scaling and adding offsets (I
suppose a scale of 1 and offset of 0 would be ok, but ugly). I don't
really care if we prevent CSE of such things.
-David
>>> The name "getSizeExpressionInBits" makes me think that a Value
>>> expression will be returned (something like a ConstantExpr that uses
>>> vscale). I would be surprised to get a pair of integers back. Do
>>> clients actually need constant integer values or would a ConstantExpr
>>> sufffice? We could add a ConstantVScale or something to make it work.
>>
>> I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
>> integers representing the known-at-compile-time terms in an expression:
>> '(scaled_bits * vscale) + unscaled_bits'.
>>
>> Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
>> compile time like <4 x i32> the size would be (128, 0).
>>
>> For a scalable type like <scalable 4 x i32> the size would be (0, 128).
>>
>> For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
>>
>> When calculating the offset for memory addresses, you just need to multiply the scaled
>> part by vscale and add the unscaled as is.
>
> Ok, now I understand what you're getting at. A ConstantExpr would
> encapsulate this computation. We alreay have "non-static-constant"
> values for ConstantExpr like sizeof and offsetof. I would see
> VScaleConstant in that same tradition. In your struct example,
> getSizeExpressionInBits would return:
>
> add(mul(256, vscale), 64)
>
> Does that satisfy your needs?
Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
isn't for expressing the size in IR, where you would indeed just multiply by vscale and
add any fixed-length size.
This is for the analysis code around the IR -- lots of code asks for the size of a Type in
bits to determine what it can do to a Value with that type. Some of them are specific to
scalar Types, like determining whether a sign/zero extend is needed. Others would
apply to vector types (including scalable vectors), such as checking whether two
Types have the exact same size so that a bitcast can be used instead of a more
expensive operation like copying to memory and back to convert.
See 'getTypeSizeInBits' and 'getTypeStoreSizeInBits' in DataLayout -- they're used
a few hundred times throughout the codebase, and to properly support scalable
types we'd need to return something that isn't just a single integer. Since most
backends won't support scalable vectors I suggested having a 'FixedSize' method
that just returns the single integer, but it may be better to just leave the existing method
as is and create a new method with 'Scalable' or 'VariableLength' or similar in the
name to make it more obvious in common code.
There's a few places where changes in IR may be needed; 'lifetime.start' markers in
IR embed size data, and we would need to either add a scalable term to that or
find some other way of indicating the size. That can be dealt with when we try to
add support for the SVE ACLE though.
>
> Is there anything about vscale or a scalable vector that requires a
> minimum bit width? For example, is this legal?
>
> <scalable 1 x double>
>
> I know it won't map to an SVE type. I'm simply curious because
> traditionally Cray machines defined vectors in terms of
> machine-dependent "maxvl" with an element type, so with the above vscale
> would == maxvl. Not that we make any such things anymore. But maybe
> someone else does?
That's legal in IR, yes, and we believe it should be usable to represent the vectors for
RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
length within the loop so that they can perform the last iterations of a loop within vector
registers when there's less than a full register worth of data remaining. SVE uses
predication (masking) to achieve the same effect.
For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
dynamic vector length -- reducing the vector length is effectively equivalent to an implied
predicate on all operations. This avoids needing to add a token operand to all existing
instructions that work on vector types.
-Graham
>> Ok, now I understand what you're getting at. A ConstantExpr would
>> encapsulate this computation. We alreay have "non-static-constant"
>> values for ConstantExpr like sizeof and offsetof. I would see
>> VScaleConstant in that same tradition. In your struct example,
>> getSizeExpressionInBits would return:
>>
>> add(mul(256, vscale), 64)
>>
>> Does that satisfy your needs?
>
> Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
> isn't for expressing the size in IR, where you would indeed just multiply by vscale and
> add any fixed-length size.
Ok, thanks for clarifying. The use of "expression" is confusing.
> This is for the analysis code around the IR -- lots of code asks for the size of a Type in
> bits to determine what it can do to a Value with that type. Some of them are specific to
> scalar Types, like determining whether a sign/zero extend is needed. Others would
> apply to vector types (including scalable vectors), such as checking whether two
> Types have the exact same size so that a bitcast can be used instead of a more
> expensive operation like copying to memory and back to convert.
If this method returns two integers, how does LLVM interpret the
comparison? If the return value is { <unscaled>, <scaled> } then how
do, say { 1024, 0 } and { 0, 128 } compare? Doesn't it depend on the
vscale? They could be the same size or not, depending on the target
characteristics.
Are bitcasts between scaled types and non-scaled types disallowed? I
could certainly see an argument for disallowing it. I could argue that
for bitcasting purposes that the unscaled and scaled parts would have to
exactly match in order to do a legal bitcast. Is that the intent?
>> Is there anything about vscale or a scalable vector that requires a
>> minimum bit width? For example, is this legal?
>>
>> <scalable 1 x double>
>>
>> I know it won't map to an SVE type. I'm simply curious because
>> traditionally Cray machines defined vectors in terms of
>> machine-dependent "maxvl" with an element type, so with the above vscale
>> would == maxvl. Not that we make any such things anymore. But maybe
>> someone else does?
>
> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
> length within the loop so that they can perform the last iterations of a loop within vector
> registers when there's less than a full register worth of data remaining. SVE uses
> predication (masking) to achieve the same effect.
>
> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
> predicate on all operations. This avoids needing to add a token operand to all existing
> instructions that work on vector types.
Right. In that way the RISC V method is very much like what the old
Cray machines did with the Vector Length register.
So in LLVM IR you would have "setvl" return a predicate and then apply
that predicate to operations using the current select method? How does
instruction selection map that back onto a simple setvl + unpredicated
vector instructions?
For conditional code both vector length and masking must be taken into
account. If "setvl" returns a predicate then that predicate would have
to be combined in some way with the conditional predicate (typically via
an AND operation in an IR that directly supports predicates). Since
LLVM IR doesn't have predicates _per_se_, would it turn into nested
selects or something? Untangling that in instruction selection seems
difficult but perhaps I'm missing something.
> On 6 Jun 2018, at 17:36, David A. Greene <d...@cray.com> wrote:
>
> Graham Hunter via llvm-dev <llvm...@lists.llvm.org> writes:
>
>>> Ok, now I understand what you're getting at. A ConstantExpr would
>>> encapsulate this computation. We alreay have "non-static-constant"
>>> values for ConstantExpr like sizeof and offsetof. I would see
>>> VScaleConstant in that same tradition. In your struct example,
>>> getSizeExpressionInBits would return:
>>>
>>> add(mul(256, vscale), 64)
>>>
>>> Does that satisfy your needs?
>>
>> Ah, I think the use of 'expression' in the name definitely confuses the issue then. This
>> isn't for expressing the size in IR, where you would indeed just multiply by vscale and
>> add any fixed-length size.
>
> Ok, thanks for clarifying. The use of "expression" is confusing.
>
>> This is for the analysis code around the IR -- lots of code asks for the size of a Type in
>> bits to determine what it can do to a Value with that type. Some of them are specific to
>> scalar Types, like determining whether a sign/zero extend is needed. Others would
>> apply to vector types (including scalable vectors), such as checking whether two
>> Types have the exact same size so that a bitcast can be used instead of a more
>> expensive operation like copying to memory and back to convert.
>
> If this method returns two integers, how does LLVM interpret the
> comparison? If the return value is { <unscaled>, <scaled> } then how
> do, say { 1024, 0 } and { 0, 128 } compare? Doesn't it depend on the
> vscale? They could be the same size or not, depending on the target
> characteristics.
I did have a paragraph on that in the RFC, but perhaps a list would be
a better format (assuming X,Y,etc are non-zero):
{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
functions that inherit vector length. Cannot be
compared across non-inheriting functions.
{ X, 0 } > { 0, Y }: Cannot return true.
{ X, 0 } = { 0, Y }: Cannot return true.
{ X, 0 } < { 0, Y }: Can return true.
{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
terms and try the above comparisons; it
may not be possible to get a good answer.
I don't know if we need a 'maybe' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.
I think in code, we'd have an inline function to deal with the first case
and an likely-not-taken call to a separate function to handle all the
scalable cases.
> Are bitcasts between scaled types and non-scaled types disallowed? I
> could certainly see an argument for disallowing it. I could argue that
> for bitcasting purposes that the unscaled and scaled parts would have to
> exactly match in order to do a legal bitcast. Is that the intent?
I would propose disallowing bitcasts, but allowing extracting a subvector
if the minimum number of scaled bits matches the number of unscaled bits.
My idea is for the RISC-V backend to recognize when a setvl intrinsic has
been used, and replace the use of its value in AND operations with an
all-true value (with constant folding to remove unnecessary ANDs) then
replace any masked instructions (generally loads, stores, anything else
that might generate an exception or modify state that it shouldn't) with
target-specific nodes that understand the dynamic vlen.
This could be part of lowering, or maybe a separate IR pass, rather than ISel.
I *think* this will work, but if someone can come up with some IR where it
wouldn't work then please let me know (e.g. global-state-changing instructions
that could move out of blocks where one setvl predicate is used and into one
where another is used).
Unfortunately, I can't find a description of the instructions included in
the 'V' extension in the online manual (other than setvl or configuring
registers), so I can't tell if there's something I'm missing.
-Graham
Future Work
-----------
Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.
However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.
Alternatives Considered
-----------------------
Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.
We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
%index.next = add i64 %index, mul (i64 %vscale64, i64 4)
;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.
For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.
glad to hear other people are thinking about RVV codegen!
On 7 June 2018 at 18:10, Graham Hunter <Graham...@arm.com> wrote:
>
> Hi,
>
> > On 6 Jun 2018, at 17:36, David A. Greene <d...@cray.com> wrote:
> >
> > Graham Hunter via llvm-dev <llvm...@lists.llvm.org> writes:
> >>> Is there anything about vscale or a scalable vector that requires a
> >>> minimum bit width? For example, is this legal?
> >>>
> >>> <scalable 1 x double>
> >>>
> >>> I know it won't map to an SVE type. I'm simply curious because
> >>> traditionally Cray machines defined vectors in terms of
> >>> machine-dependent "maxvl" with an element type, so with the above vscale
> >>> would == maxvl. Not that we make any such things anymore. But maybe
> >>> someone else does?
> >>
> >> That's legal in IR, yes, and we believe it should be usable to represent the vectors for
> >> RISC-V's 'V' extension. The main problem there is that they have a dynamic vector
> >> length within the loop so that they can perform the last iterations of a loop within vector
> >> registers when there's less than a full register worth of data remaining. SVE uses
> >> predication (masking) to achieve the same effect.
Yes, <scalable 1 x T> should be allowed in the IR type system (even <1
x T> is currently allowed and unlike the scalable variant that's not
even useful) and it would be the sole legal vector types in the RISCV
backend.
>
> >> For the 'V' extension, vscale would indeed correspond to 'maxvl', and I'm hoping that a
> >> 'setvl' intrinsic that provides a predicate will avoid the need for modelling a change in
> >> dynamic vector length -- reducing the vector length is effectively equivalent to an implied
> >> predicate on all operations. This avoids needing to add a token operand to all existing
> >> instructions that work on vector types.
Yes, vscale would be the *maximum* vector length (how many elements
fit into each register), not the *active* vector length (how many
elements are operated on in the current loop iteration).
This has nothing to do with tokens, though. The tokens I proposed were
to encode the fact that even 'maxvl' varies on a function by function
basis. This RFC approaches the same issue differently, but it's still
there -- in terms of this RFC, operations on scalable vectors depend
on `vscale`, which is "not necessarily [constant] across functions".
That implies, for example, that an unmasked <scalable 4 x i32> load or
store (which accesses vscale * 16 bytes of memory) can't generally be
moved from one function to another unless it's somehow ensured that
both functions will have the same vscale. For that matter, the same
restriction applies to calls to `vscale` itself.
The evolution of the active vector length is a separate problem and
one that doesn't really impact the IR type system (nor one that can
easily be solved by tokens).
> >
> > Right. In that way the RISC V method is very much like what the old
> > Cray machines did with the Vector Length register.
> >
> > So in LLVM IR you would have "setvl" return a predicate and then apply
> > that predicate to operations using the current select method? How does
> > instruction selection map that back onto a simple setvl + unpredicated
> > vector instructions?
> >
> > For conditional code both vector length and masking must be taken into
> > account. If "setvl" returns a predicate then that predicate would have
> > to be combined in some way with the conditional predicate (typically via
> > an AND operation in an IR that directly supports predicates). Since
> > LLVM IR doesn't have predicates _per_se_, would it turn into nested
> > selects or something? Untangling that in instruction selection seems
> > difficult but perhaps I'm missing something.
>
> My idea is for the RISC-V backend to recognize when a setvl intrinsic has
> been used, and replace the use of its value in AND operations with an
> all-true value (with constant folding to remove unnecessary ANDs) then
> replace any masked instructions (generally loads, stores, anything else
> that might generate an exception or modify state that it shouldn't) with
> target-specific nodes that understand the dynamic vlen.
I am not quite so sure about turning the active vector length into
just another mask. It's true that the effects on arithmetic, load,
stores, etc. are the same as if everything executed under a mask like
<1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
active vector length. However, actually materializing the masks in the
IR means the RISCV backend has to reverse-engineer what it must do
with the vl register for any given (masked or unmasked) vector
operation. The stakes for that are rather high, because (1) it applies
to pretty much every single vector operation ever, and (2) when it
fails, the codegen impact is incredibly bad.
(1) The vl register affects not only loads, stores and other
operations with side effects, but all vector instructions, even pure
computation (and reg-reg moves, but that's not relevant for IR). An
integer vector add, for example, only computes src1[i] + src2[i] for 0
<= i < vl and the remaining elements of the destination register (from
vl upwards) are zeroed. This is very natural for strip-mined loops
(you'll never need those elements), but it means an unmasked IR level
vector add is a really bad fit for the RISC-V 'vadd' instruction.
Unless the backend can prove that only the first vl elements of the
result will ever be observed, it will have to temporarily set vl to
MAXVL so that the RVV instruction will actually compute the "full"
result. Establishing that seems like it will require at least some
custom data flow analysis, and it's unclear how robust it can be made.
(2) Failing to properly use vl for some vector operation is worse than
e.g. materializing a mask you wouldn't otherwise need. It requires
that too (if the operation is masked), but more importantly it needs
to save vl, change it to MAXVL, and finally restore the old value.
That's quite expensive: besides the ca. 3 extra instructions and the
scratch GPR required, this save/restore dance can have other nasty
effects depending on uarch style. I'd have to consult the hardware
people to be sure, but from my understanding risks include pipeline
stalls and expensive roundtrips between decoupled vector and scalar
units.
To be clear: I have not yet experimented with any of this, so I'm not
saying this is a deal breaker. A well-engineered "demanded elements"
analysis may very well be good enough in practice. But since we
broached the subject, I wanted to mention this challenge. (I'm
currently side stepping it by not using built-in vector instructions
but instead intrinsics that treat vl as magic extra state.)
> This could be part of lowering, or maybe a separate IR pass, rather than ISel.
> I *think* this will work, but if someone can come up with some IR where it
> wouldn't work then please let me know (e.g. global-state-changing instructions
> that could move out of blocks where one setvl predicate is used and into one
> where another is used).
There are some operations that use vl for things other than simple
masking. To give one example, "speculative" loads (which silencing
some exceptions to safely permit vectorization of some loops with
data-dependent exits, such as strlen) can shrink vl as a side effect.
I believe this can be handled by modelling all relevant operations
(including setvl itself) as intrinsics that have side effects or
read/write inaccessible memory. However, if you want to have the
"current" vl (or equivalent mask) around as SSA value, you need to
"reload" it after any operation that updates vl. That seems like it
could get a bit complex if you want to do it efficiently (in the
limit, it seems equivalent to SSA construction).
>
>
> Unfortunately, I can't find a description of the instructions included in
> the 'V' extension in the online manual (other than setvl or configuring
> registers), so I can't tell if there's something I'm missing.
I'm very sorry for that, I know how frustrating it can be. I hope the
above gives a clearer picture of the constraints involved. Exact
instructions, let alone encodings, are still in flux as Bruce said.
Cheers,
Robin
Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.
%index.next = add i64 %index, mul (i64 %vscale64, i64 4)
Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a constant
expression or something? I would have expected:
```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```
How does this work in practice? Let's say I populate a vector with a
splat. Presumably, that gives me a "full length" vector. Let's say the
type is <scalable 2 x double>. How do I split the vector and get
something half the width? What is its type? How do I split it again
and get something a quarter of the width? What is its type? How do I
use half- and quarter-width vectors? Must I resort to predication?
Ths split question comes into play for backward compatibility. How
would one take a scalable vector and pass it into a NEON library? It is
likely that some math functions, for example, will not have SVE versions
available.
Is there a way to represent "double width" vectors? In mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on. I guess the more
fundamental question is about how such loops should be handled.
What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero. How would I, say,
insertelement into the upper half of the vector? Or any arbitrary
place? Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough? It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.
Thanks for your patience. :)
-David
responses inline.
-Graham
Agreed.
I can see where the concern comes from; we had problems reconstructing
semantics when experimenting with search loop vectorization and often
had to fall back on default (slow) generic cases.
My main reason for proposing this was to try and ensure that the size was
consistent from the point of view of the query functions we were discussing
in the main thread. If you're fine with all size queries assuming maxvl (so
things like stack slots would always use the current configured maximum
length), then I don't think there's a problem with dropping this part of the
proposal and letting you find a better representation of active length.
> (1) The vl register affects not only loads, stores and other
> operations with side effects, but all vector instructions, even pure
> computation (and reg-reg moves, but that's not relevant for IR). An
> integer vector add, for example, only computes src1[i] + src2[i] for 0
> <= i < vl and the remaining elements of the destination register (from
> vl upwards) are zeroed. This is very natural for strip-mined loops
> (you'll never need those elements), but it means an unmasked IR level
> vector add is a really bad fit for the RISC-V 'vadd' instruction.
> Unless the backend can prove that only the first vl elements of the
> result will ever be observed, it will have to temporarily set vl to
> MAXVL so that the RVV instruction will actually compute the "full"
> result. Establishing that seems like it will require at least some
> custom data flow analysis, and it's unclear how robust it can be made.
>
> (2) Failing to properly use vl for some vector operation is worse than
> e.g. materializing a mask you wouldn't otherwise need. It requires
> that too (if the operation is masked), but more importantly it needs
> to save vl, change it to MAXVL, and finally restore the old value.
> That's quite expensive: besides the ca. 3 extra instructions and the
> scratch GPR required, this save/restore dance can have other nasty
> effects depending on uarch style. I'd have to consult the hardware
> people to be sure, but from my understanding risks include pipeline
> stalls and expensive roundtrips between decoupled vector and scalar
> units.
Ah, I hadn't appreciated you might need to save/restore the VL like that.
I'd worked through a couple of small example loops and it seemed fine,
but hadn't looked at more complicated cases.
> To be clear: I have not yet experimented with any of this, so I'm not
> saying this is a deal breaker. A well-engineered "demanded elements"
> analysis may very well be good enough in practice. But since we
> broached the subject, I wanted to mention this challenge. (I'm
> currently side stepping it by not using built-in vector instructions
> but instead intrinsics that treat vl as magic extra state.)
>
>> This could be part of lowering, or maybe a separate IR pass, rather than ISel.
>> I *think* this will work, but if someone can come up with some IR where it
>> wouldn't work then please let me know (e.g. global-state-changing instructions
>> that could move out of blocks where one setvl predicate is used and into one
>> where another is used).
>
> There are some operations that use vl for things other than simple
> masking. To give one example, "speculative" loads (which silencing
> some exceptions to safely permit vectorization of some loops with
> data-dependent exits, such as strlen) can shrink vl as a side effect.
> I believe this can be handled by modelling all relevant operations
> (including setvl itself) as intrinsics that have side effects or
> read/write inaccessible memory. However, if you want to have the
> "current" vl (or equivalent mask) around as SSA value, you need to
> "reload" it after any operation that updates vl. That seems like it
> could get a bit complex if you want to do it efficiently (in the
> limit, it seems equivalent to SSA construction).
Ok; the fact that there's more instructions that can change vl and that you might
need to reload it is useful to know.
SVE uses predication to achieve the same via the first-faulting/no-faulting
load instructions and the ffr register.
I think SVE having 16 predicate registers (vs. 8 for RVV and AVX-512) has led
to us using the feature quite widely with our own experiments; I'll try looking for
non-predicated solutions as well when we try to expand scalable vectorization
capabilities.
>> Unfortunately, I can't find a description of the instructions included in
>> the 'V' extension in the online manual (other than setvl or configuring
>> registers), so I can't tell if there's something I'm missing.
>
> I'm very sorry for that, I know how frustrating it can be. I hope the
> above gives a clearer picture of the constraints involved. Exact
> instructions, let alone encodings, are still in flux as Bruce said.
Yes, the above information is definitely useful, even if I don't have a complete
picture yet. Thanks.
Ths split question comes into play for backward compatibility. How
would one take a scalable vector and pass it into a NEON library? It is
likely that some math functions, for example, will not have SVE versions
available.
Is there a way to represent "double width" vectors? In mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on. I guess the more
fundamental question is about how such loops should be handled.
What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero. How would I, say,
insertelement into the upper half of the vector? Or any arbitrary
place? Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough? It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.
> To split a <scalable 2 x double> in half, you'd use a shufflevector in much the
> same way you would for fixed-length vector types.
>
> e.g.
> ``
> %sv = call <scalable 1 x i32> @llvm.experimental.vector.stepvector.nxv1i32()
> %halfvec = shufflevector <scalable 2 x double> %fullvec, <scalable 2 x double> undef, <scalable 1 x i32> %sv
> ``
>
> You can't split it any further than a <scalable 1 x <ty>>, since there may only be
> one element in the actual hardware vector at runtime. The same restriction applies to
> a <1 x <ty>>. This is why we have a minimum number of lanes in addition to the
> scalable flag so that we can concatenate and split vectors, since SVE registers have
> the same number of bytes and will therefore decrease the number of elements per
> register as the element type increases in size.
Right. So let's say the hardware width is 1024. If I have a
<scalable 2 x double> it is 1024 bits. If I split it, it's a
<scalable 1 x double> (right?) with 512 bits. There is no
way to create a 256-bit vector.
It's probably the case that for pure VL-agnostic code, this is ok. Our
experience with the X1/X2, which also supported VL-agnostic code, was
that at times compiler awareness of the hardware MAXVL allowed us to
generate better code, better enough that we "cheated" regularly. The
hardware guys loved us. :)
I'm not at all saying that's a good idea for SVE, just recounting
experience and wondering what the implications would be for SVE and more
generally, LLVM IR. Would the MIPS V people be interested in a
non-VL-agnostic compilation mode?
> If you want to extract something other than the first part of a vector, you need to add
> offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2) allows you
> to reach the high half of a larger register).
Sure, that makes semse.
> For floating point types, we do use predication to allow the use of otherwise illegal
> types like <scalable 1 x double>, but that's limited to the AArch64 backend and does
> not need to be represented in IR.
This is something done during or after isel?
> Ths split question comes into play for backward compatibility. How
> would one take a scalable vector and pass it into a NEON library? It is
> likely that some math functions, for example, will not have SVE versions
> available.
>
> I don't believe we intend to support this, but instead provide libraries with
> SVE versions of functions instead. The problem is that you don't know how
> many NEON-size subvectors exist within an SVE vector at compile time.
> While you could create a loop with 'vscale' number of iterations and try to
> extract those subvectors, I suspect the IR would end up being quite messy
> and potentially hard to recognize and optimize.
Yes, that was my concern. The vscale loop is what I came up with as
well. It is technically feasible, but ugly. I'm a little concerned
about what vendors will do with this. Not everyone is going to have the
resources to convert all of their NEON libraries, certainly not all at
once.
Just something to think about.
> The other problem with calling non-SVE functions is that any live SVE
> registers must be spilled to the stack and filled after the call, which is
> likely to be quite expensive.
Understood.
> Is there a way to represent "double width" vectors? In mixed-data-size
> loops it is sometimes convenient to reason about double-width vectors
> rather than having to split them (to legalize for the target
> architecture) and keep track of their parts early on. I guess the more
> fundamental question is about how such loops should be handled.
>
> For SVE, it's fine to generate IR with types that are 'double size' or larger,
> and just leave it to legalization at SelectionDAG level to split into multiple
> legal size registers.
Ok, great. If something is larger than "double size," how can it be
split, given the "split once" restriction above?
> What do insertelement and extractelement mean for scalable vectors?
> Your examples showed insertelement at index zero. How would I, say,
> insertelement into the upper half of the vector? Or any arbitrary
> place? Does insertelement at index 10 of a <scalable 2 x double> work,
> assuming vscale is large enough? It is sometimes useful to constitute a
> vector out of various scalar pieces and insertelement is a convenient
> way to do it.
>
> So you can insert or extract any element known to exist (in other words, it's
> within the minimum number of elements). Using a constant index outside
> that range will fail, as we won't know whether the element actually exists
> until we're running on a cpu.
In that case to "insert" into the higher elements one would insert into
the known range and then shufflevector, I suppose. Ok.
> Our downstream compiler supports inserting and extracting arbitrary elements
> from calculated offsets as part of our experiment on search loop vectorization,
> but that generates the offsets based on a count of true bits within partitioned
> predicates. I was planning on proposing new intrinsics to improve predicate use
> within llvm at a later date.
Ok, I look forward to seeing them!
> We have been able to implement various types of known shuffles (like the high/low
> half extract, zip, concatention, etc) with vscale, stepvector, and the existing IR
> instructions.
Yes, I can certainly see how all of those would be implemented. The
main case I'm thinking about is something that is "scalarized" within a
vector loop context. I'm wondering about the best way to reconstitute a
vector from scalar pieces (or vice-versa).
Thanks for the explanations!
Oh yeah, sure. The size of a vector is in terms of vscale, even from
the perspective of the RISC-V ISA. The active vector length just
modifies *operations* to produce partially-zero results and access
less memory. It's also perfectly reasonable to "mix" vectors computed
with different active vector lengths. For example, it might be useful
to create a loop-invariant vector in full before the loop and keep
using it in the loop body with varying vl. Add to that IR design
considerations and it's clear-cut to me that scalable vectors should
have *one* size and all operations should produce "full-width" (but
maybe partially-zeroed) vector results.
Partly for that reason (and also out of general hesitation to commit
to *any* model at this stage) I am lukewarm to the "active_vscale"
concept you outlined in a later email.
So, agreed, let's completely leave that part out for now. I'll come
back with specific proposals when the RVV work has matured and we have
more experience with this issue.
Responses inline.
-Graham
>> You can't split it any further than a <scalable 1 x <ty>>, since there may only be
>> one element in the actual hardware vector at runtime. The same restriction applies to
>> a <1 x <ty>>. This is why we have a minimum number of lanes in addition to the
>> scalable flag so that we can concatenate and split vectors, since SVE registers have
>> the same number of bytes and will therefore decrease the number of elements per
>> register as the element type increases in size.
>
> Right. So let's say the hardware width is 1024. If I have a
> <scalable 2 x double> it is 1024 bits. If I split it, it's a
> <scalable 1 x double> (right?) with 512 bits. There is no
> way to create a 256-bit vector.
Correct; you'd need to use predication to force smaller vectors.
> It's probably the case that for pure VL-agnostic code, this is ok. Our
> experience with the X1/X2, which also supported VL-agnostic code, was
> that at times compiler awareness of the hardware MAXVL allowed us to
> generate better code, better enough that we "cheated" regularly. The
> hardware guys loved us. :)
I think if people want to take advantage of knowing the hardware vector
length for their particular machine, the existing fixed-length types will
suffice. We haven't implemented that for SVE at present, though -- a lot
of work would be required.
> I'm not at all saying that's a good idea for SVE, just recounting
> experience and wondering what the implications would be for SVE and more
> generally, LLVM IR. Would the MIPS V people be interested in a
> non-VL-agnostic compilation mode?
>
>> If you want to extract something other than the first part of a vector, you need to add
>> offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2) allows you
>> to reach the high half of a larger register).
>
> Sure, that makes semse.
>
>> For floating point types, we do use predication to allow the use of otherwise illegal
>> types like <scalable 1 x double>, but that's limited to the AArch64 backend and does
>> not need to be represented in IR.
>
> This is something done during or after isel?
A combination of custom lowering and isel.
>> Ths split question comes into play for backward compatibility. How
>> would one take a scalable vector and pass it into a NEON library? It is
>> likely that some math functions, for example, will not have SVE versions
>> available.
>>
>> I don't believe we intend to support this, but instead provide libraries with
>> SVE versions of functions instead. The problem is that you don't know how
>> many NEON-size subvectors exist within an SVE vector at compile time.
>> While you could create a loop with 'vscale' number of iterations and try to
>> extract those subvectors, I suspect the IR would end up being quite messy
>> and potentially hard to recognize and optimize.
>
> Yes, that was my concern. The vscale loop is what I came up with as
> well. It is technically feasible, but ugly. I'm a little concerned
> about what vendors will do with this. Not everyone is going to have the
> resources to convert all of their NEON libraries, certainly not all at
> once.
>
> Just something to think about.
Understood; it may be that using predication (or a kernel call to limit
the maximum length to 128b) would suffice in those cases -- you can't take
advantage of the extra vector length, but since the bottom 128b of each
vector register alias with the NEON registers you could just call existing
NEON code without needing to do anything special that way. I guess it's a
trade-off as to whether you want to reuse existing tuned code or take
advantage of more powerful hardware. We haven't implemented anything like
this though.
>> The other problem with calling non-SVE functions is that any live SVE
>> registers must be spilled to the stack and filled after the call, which is
>> likely to be quite expensive.
>
> Understood.
>
>> Is there a way to represent "double width" vectors? In mixed-data-size
>> loops it is sometimes convenient to reason about double-width vectors
>> rather than having to split them (to legalize for the target
>> architecture) and keep track of their parts early on. I guess the more
>> fundamental question is about how such loops should be handled.
>>
>> For SVE, it's fine to generate IR with types that are 'double size' or larger,
>> and just leave it to legalization at SelectionDAG level to split into multiple
>> legal size registers.
>
> Ok, great. If something is larger than "double size," how can it be
> split, given the "split once" restriction above?
There's not a 'split once' restriction, but 'n' in <scalable n x <ty>> has to
be an integer. A "double width" vector would just double 'n', so
<scalable 4 x double> would be such a type. That's not a legal type for SVE,
so it would be split into 2 <scalable 2 x double> values during legalization
(or nxv2f64, since it's at SDAG at that point).
If actual scalarization is required, SVE needs to use predicates; there are
instructions that allow moving a single predicate bit forward one element at
a time, and you can then extract the element, perform whatever operation is
needed, and insert the result back into a register at the same location.
There's obviously more overhead than just moving directly from/to a fixed
number of lanes, but it works. Hopefully we won't need to use scalarization
often.
I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
Thanks,
-Graham
==========
Background
==========
========
Contents
========
5. An explanation of splitting/concatentating scalable vectors.
6. A brief note on code generation of these new operations for AArch64.
7. An example of C code and matching IR using the proposed extensions.
8. A list of patches demonstrating the changes required to emit SVE instructions
========
1. Types
========
IR Textual Form
---------------
``<scalable <n> x <type>>``
``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
bytes.
IR Bitcode Form
---------------
Alternatives Considered
-----------------------
This is a proposal for how to deal with querying the size of scalable types for
analysis of IR. While it has not been implemented in full, the general approach
works well for calculating offsets into structures with scalable types in a
modified version of ComputeValueVTs in our downstream compiler.
For current IR types that have a known size, all query functions return a single
integer constant. For scalable types a second integer is needed to indicate the
number of bytes/bits which need to be scaled by the runtime multiple to obtain
the actual length.
For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.
For derived types, a function `getScalableSizePairInBits()` will be added, which
returns a pair of integers (one to indicate unscaled bits, the other for bits
that need to be scaled by the runtime multiple). For backends that do not need
to deal with scalable types the existing methods will suffice, but a debug-only
assert will be added to them to ensure they aren't used on scalable types.
Similar functionality will be added to DataLayout.
Comparisons between sizes will use the following methods, assuming that X and
Y are non-zero integers and the form is of { unscaled, scaled }.
{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
functions that inherit vector length. Cannot be
compared across non-inheriting functions.
{ X, 0 } > { 0, Y }: Cannot return true.
{ X, 0 } = { 0, Y }: Cannot return true.
{ X, 0 } < { 0, Y }: Can return true.
{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
terms and try the above comparisons; it
may not be possible to get a good answer.
It's worth noting that we don't expect the last case (mixed scaled and
unscaled sizes) to occur. Richard Sandiford's proposed C extensions
(http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
prohibits mixing fixed-size types into sizeless struct.
I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.
My current intention is to rely on functions that clone or copy values to
check whether they are being used to copy scalable vectors across function
boundaries without the inherit vlen attribute and raise an error there instead
of requiring passing the Function a type size is from for each comparison. If
there's a strong preference for moving the check to the size comparison function
let me know; I will be starting work on patches for this later in the year if
there's no major problems with the idea.
Future Work
-----------
Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.
However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.
Alternatives Considered
-----------------------
Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.
We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
Fixed-Length Code
-----------------
===========================================
5. Splitting and Combining Scalable Vectors
===========================================
Splitting and combining scalable vectors in IR is done in the same manner as
for fixed-length vectors, but with a non-constant mask for the shufflevector.
The following is an example of splitting a <scalable 4 x double> into two
separate <scalable 2 x double> values.
``
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
;; Stepvector generates the element ids for first subvector
%sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
;; Add vscale * 2 to get the starting element for the second subvector
%ec = mul i64 %vscale64, 2
%ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
%ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
%sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
;; Perform the extracts
%res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
%res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
``
==================
6. Code Generation
==================
IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.
All three intrinsics are custom lowered and legalized in the AArch64 backend.
Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.
GlobalISel
----------
Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.
In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.
==========
7. Example
==========
C Code
------
==========
8. Patches
Hi,
i am the main author of RV, the Region Vectorizer
(github.com/cdl-saarland/rv). I want to share our standpoint as
potential users of the proposed vector-length agnostic IR (RISC-V,
ARM SVE).
-- support for `llvm.experimental.vector.reduce.*` intrinsics --
RV relies heavily on predicate reductions (`or` and `and` reduction) to tame divergent loops and provide a vector-length agnostic programming model on LLVM IR. I'd really like to see these adopted early on in the new VLA backends so we can fully support these targets from the start. Without these generic intrinsics, we would either need to emit target specific ones or go through the painful process of VLA-style reduction trees with loops or the like.
I really like the idea of the `inherits_vlen` attribute. Absence
of this attribute in a callee means we can safely stop tracking
the vector length across the call boundary.
However, i think there are some issues with the `vlen token`
approach.
* Why do you need an explicit vlen token if there is a 1 : 1-0
correspondence between functions and vlen tokens?
* My main concern is that you are navigating towards a local
optimum here. All is well as long as there is only one vector
length per function. However, if the architecture supports
changing the vector length at any point but you explicitly forbid
it, programmers will complain, well, i will for one ;-) Once you
give in to that demand you are facing the situation that multiple
vector length tokens are live within the same function. This means
you have to stop transformations from mixing vector operations
with different vector lengths: these would otherwise incur an
expensive state change at every vlen transition. However, there is
no natural way to express that two SSA values (vlen tokens) must
not be live at the same program point.
There are some operations that use vl for things other than simple masking. To give one example, "speculative" loads (which silencing some exceptions to safely permit vectorization of some loops with data-dependent exits, such as strlen) can shrink vl as a side effect. I believe this can be handled by modelling all relevant operations (including setvl itself) as intrinsics that have side effects or read/write inaccessible memory. However, if you want to have the "current" vl (or equivalent mask) around as SSA value, you need to "reload" it after any operation that updates vl. That seems like it could get a bit complex if you want to do it efficiently (in the limit, it seems equivalent to SSA construction).
By relying on memory dependence, this also implies that
arithmetic operations can be re-ordered freely as long as
vlen_state does not change between them (SLP, "loop mix (CGO16)",
..).
Regarding function calls, if the callee does not have the 'inherits_vlen' attribute, the target can use a default value at function entry (max width or "undef"). Otherwise, the vector length needs to be communicated from caller to callee. However, the `vlen_state` variable already achieves that for a first implementation.
Last but not least, thank you all for working on this! I am
really looking forward to playing around with vla architectures in
LLVM.
Regards,
Simon
-- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : mo...@cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll
Replies inline.
-Graham
> i am the main author of RV, the Region Vectorizer (github.com/cdl-saarland/rv). I want to share our standpoint as potential users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).
> -- support for `llvm.experimental.vector.reduce.*` intrinsics --
> RV relies heavily on predicate reductions (`or` and `and` reduction) to tame divergent loops and provide a vector-length agnostic programming model on LLVM IR. I'd really like to see these adopted early on in the new VLA backends so we can fully support these targets from the start. Without these generic intrinsics, we would either need to emit target specific ones or go through the painful process of VLA-style reduction trees with loops or the like.
The vector reduction intrinsics were originally created to support SVE in order to avoid loops, so we'll definitely be using them.
> -- setting the vector length (MVL) --
> I really like the idea of the `inherits_vlen` attribute. Absence of this attribute in a callee means we can safely stop tracking the vector length across the call boundary.
> However, i think there are some issues with the `vlen token` approach.
> * Why do you need an explicit vlen token if there is a 1 : 1-0 correspondence between functions and vlen tokens?
I think there's a bit of a mix-up here... my proposal doesn't feature tokens. Robin's proposal earlier in the year did, but I think we've reached a consensus that they aren't necessary.
We do need to decide where to place the appropriate checks for which function an instruction is from before allowing copying:
1. Solely within the passes that perform cross-function optimizations. Light-weight, but easier to get it wrong.
2. Within generic methods that insert instructions into blocks. Probably more code changes than method 1. May run into problems if an instruction is cloned first (and therefore has no parent to check -- looking at operands/uses may suffice though).
3. Within size queries. Probably insufficient in places where entire blocks are copied without looking at the types of each individual instruction, and also suffers from problems when cloning instructions.
My current idea is to proceed with option 2 with some additional checks where needed.
> * My main concern is that you are navigating towards a local optimum here. All is well as long as there is only one vector length per function. However, if the architecture supports changing the vector length at any point but you explicitly forbid it, programmers will complain, well, i will for one ;-) Once you give in to that demand you are facing the situation that multiple vector length tokens are live within the same function. This means you have to stop transformations from mixing vector operations with different vector lengths: these would otherwise incur an expensive state change at every vlen transition. However, there is no natural way to express that two SSA values (vlen tokens) must not be live at the same program point.
So I think we've agreed that the notion of vscale inside a function is consistent, so that all size comparisons and stack allocations will use the maximum size for that function.
However, use of setvl or predication changes the effective length inside the function. This is already the case for masked loads and stores -- although an AVX512 vector is 512 bits in size, a different amount of data can be transferred to/from memory.
Robin will be working on the best way to represent setvl, whereas SVE will just use <scalable n x i1> predicate vectors to control length.
> On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:
>> There are some operations that use vl for things other than simple
>> masking. To give one example, "speculative" loads (which silencing
>> some exceptions to safely permit vectorization of some loops with
>> data-dependent exits, such as strlen) can shrink vl as a side effect.
>> I believe this can be handled by modelling all relevant operations
>> (including setvl itself) as intrinsics that have side effects or
>> read/write inaccessible memory. However, if you want to have the
>> "current" vl (or equivalent mask) around as SSA value, you need to
>> "reload" it after any operation that updates vl. That seems like it
>> could get a bit complex if you want to do it efficiently (in the
>> limit, it seems equivalent to SSA construction).
>>
> I think modeling the vector length as state isn't as bad as it may sound first. In fact, how about modeling the "hard" vector length as a thread_local global variable? That way there is exactly one valid vector length value at every point (defined by the value of the thread_local global variable of the exact name). There is no need for a "demanded vlen" analyses: the global variable yields the value immediately. The RISC-V backend can map the global directly to the vlen register. If a target does not support a re-configurable vector length (SVE), it is safe to run SSA construction during legalization and use explicit predication instead. You'd perform SSA construction only at the backend/legalization phase.
> Vice versa coming from IR targeted at LLVM SVE, you can go the other way, run a demanded vlen analysis, and encode it explicitly in the program. vlen changes are expensive and should be rare anyway.
This was in response to my suggestion to model setvl with predicates; I've withdrawn the idea. The vscale intrinsic is enough to represent 'maxvl', and based on the IR samples I've seen for RVV, a setvl intrinsic would return the dynamic length in order to correctly update offset/induction variables.
> ; explicit vlen_state modelling in RV could look like this:
>
> @vlen_state = thread_local global token ; this gives AA a fixed point to constraint vlen-dependent operations
>
> llvm.vla.setvl(i32 %n) ; implicitly writes-only %vlen_state
> i32 llvm.vla.getvl() ; implicitly reads-only %vlen_state
>
> llvm.vla.fadd.f64(f64, f64) ; implicitly reads-only %vlen_state
> llvm.vla.fdiv.f64(f64, f64) : .. same
>
> ; this implements the "speculative" load mentioned in the quote above (writes %vlen_state. I suppose it also reads it first?)
> <scalable 1 x f64> llvm.riscv.probe.f64(%ptr)
Having separate getvl and setvl intrinsics may work nicely, but I'll leave that to Robin to decide.
> By relying on memory dependence, this also implies that arithmetic operations can be re-ordered freely as long as vlen_state does not change between them (SLP, "loop mix (CGO16)", ..).
> Regarding function calls, if the callee does not have the 'inherits_vlen' attribute, the target can use a default value at function entry (max width or "undef"). Otherwise, the vector length needs to be communicated from caller to callee. However, the `vlen_state` variable already achieves that for a first implementation.
I got the impression that the RVV team wanted to be able to reconfigure registers (and therefore potentially change max vector length/number of available registers) for each function; if a call to a function is required from inside a vectorized loop then I think maxvl/vscale has to match and the callee must not reconfigure registers. I suspect there will be a complicated cost model to decide whether to change configuration or stick with a default of all registers enabled.
> Last but not least, thank you all for working on this! I am really looking forward to playing around with vla architectures in LLVM.
Glad to hear it; there is an SVE emulator[1] available so once we've managed to get some code committed you'll be able to try some of this out, at least on one of the architectures.
[1] https://developer.arm.com/products/software-development-tools/hpc/arm-instruction-emulator
thanks again. The changes and additions all make sense to me, I just
have one minor comment about shufflevector.
Seems reasonable.
It makes sense to have runtime-computed shuffle masks for some
architectures, especially those with runtime-variable vector lengths,
but lifting the current restriction that the shufflevector mask is a
constant affects all code that inspects the indices. There's lot such
code and as far as I've seen a fair amount of that code crucially
depends on the mask being constant. I'm not opposed to lifting the
restriction, but I want to call attention to it and double-check
everyone's okay with it because it seems like a big step and, unlike
other IR changes in this RFC, it isn't really necessary (we could also
use an intrinsic for these shuffles).
to add to what Graham said...
On 2 July 2018 at 17:08, Simon Moll via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Hi,
>
> i am the main author of RV, the Region Vectorizer
> (github.com/cdl-saarland/rv). I want to share our standpoint as potential
> users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).
Thanks, this perspective is very useful, just as our discussion at EuroLLVM was.
Can you elaborate on what use cases you have in mind for this? I'm
very curious because I'm only aware of one case where changing vscale
at a specific point is desirable: when you have two independent code
regions (e.g., separate loop nests with no vector values flowing
between them) that are substantially different in their demands from
the vector unit. And even in that case, you only need a way to give
the backend the freedom to change it if beneficial, not to exert
actual control over the size of the vector registers [1]. As mentioned
in my RFC in April, I believe we can still support that case
reasonably well with an optimization pass in the backend (operating on
MIR, i.e., no IR changes).
Everything else I know of that falls under "changing vector lengths"
is better served by predication or RISC-V's "active vector length"
(vl) register. Even tricks for running code that is intended for
packed SIMD of a particular width on top of the variable-vector-length
RISC-V ISA only need to fiddle with the active vector length! And to
be clear, the active vector length is a completely separate mechanism
from the concept that's called vscale in this RFC, vlen in my previous
RFC, and MVL in the RISC-V ISA.
The restriction of 1 function : 1 vscale is not one that was adopted
lightly. In some sense, yes, it's just a local optimum and one can
think about IR designs that don't couple vscale to function
boundaries. However, there are multiple factors that make it
challenging to do in LLVM IR, and challenging to implement in the
backend too (some of them outlined in my RFC from April). After
tinkering with the problem for months, I'm fairly certain that
multiple vscales in one function is several orders of magnitude more
complex and difficult to add to LLVM IR (some different IRs would work
better), so I'd really like to understand any and all reasons
programmers might have for wanting to change vscale, and hopefully
find ways to support what they want to do without opening this
pandora's box.
[1] Software has extremely little control about vscale/MVL/etc.
anyway: on packed SIMD machines vscale = 1 is hardwired, on RISC-V you
can only change the vector unit configuration and thereby affect the
MVL in very indirect and uarch-dependent ways, and on SVE it's
implementation-defined which multiples of 128 bit are supported.
> On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:
>
> There are some operations that use vl for things other than simple
> masking. To give one example, "speculative" loads (which silencing
> some exceptions to safely permit vectorization of some loops with
> data-dependent exits, such as strlen) can shrink vl as a side effect.
> I believe this can be handled by modelling all relevant operations
> (including setvl itself) as intrinsics that have side effects or
> read/write inaccessible memory. However, if you want to have the
> "current" vl (or equivalent mask) around as SSA value, you need to
> "reload" it after any operation that updates vl. That seems like it
> could get a bit complex if you want to do it efficiently (in the
> limit, it seems equivalent to SSA construction).
>
> I think modeling the vector length as state isn't as bad as it may sound
> first. In fact, how about modeling the "hard" vector length as a
The term "hard" vector length here makes me suspect you might be
mixing up the physical size of vector registers, which is derived from
vscale in IR terms or the vector unit configuration in the RISC-V ISA,
with the _active_ vector length which is effectively just encoding a
particular kind of predication to limit processing to a subset of
lanes in the physical registers. Since everything below makes more
sense when read as referring to the active vector length, I will
assume you meant that, but just to be sure could you please clarify?
> thread_local global variable? That way there is exactly one valid vector
> length value at every point (defined by the value of the thread_local global
> variable of the exact name). There is no need for a "demanded vlen"
> analyses: the global variable yields the value immediately. The RISC-V
> backend can map the global directly to the vlen register. If a target does
> not support a re-configurable vector length (SVE), it is safe to run SSA
> construction during legalization and use explicit predication instead. You'd
> perform SSA construction only at the backend/legalization phase.
> Vice versa coming from IR targeted at LLVM SVE, you can go the other way,
> run a demanded vlen analysis, and encode it explicitly in the program. vlen
> changes are expensive and should be rare anyway.
For the active vector length, yes, modelling the architectural state
as memory read and written by intrinsics works fine and is in fact
roughly what I'm currently doing. Globals can't be of type token, but
I use the existing "reads/write hidden memory" flag on intrinsics
instead of globals anyway (which increases AA precision, and doesn't
require special casing these intrinsics in AA code).
However, these intrinsics also have some downsides that might make
different solutions better in the long run. For example, the
artificial memory accesses block some optimizations that don't bother
reasoning about memory in detail, and many operations controlled by
the vector length are so similar to the existing add, mul, etc.
instructions that we'd duplicate the vast majority of optimizations
that apply to those instructions (if we want equal optimization power,
that is).
> ; explicit vlen_state modelling in RV could look like this:
>
> @vlen_state = thread_local global token ; this gives AA a fixed point to
> constraint vlen-dependent operations
>
> llvm.vla.setvl(i32 %n) ; implicitly writes-only %vlen_state
> i32 llvm.vla.getvl() ; implicitly reads-only %vlen_state
>
> llvm.vla.fadd.f64(f64, f64) ; implicitly reads-only %vlen_state
> llvm.vla.fdiv.f64(f64, f64) : .. same
>
> ; this implements the "speculative" load mentioned in the quote above
> (writes %vlen_state. I suppose it also reads it first?)
> <scalable 1 x f64> llvm.riscv.probe.f64(%ptr)
>
> By relying on memory dependence, this also implies that arithmetic
> operations can be re-ordered freely as long as vlen_state does not change
> between them (SLP, "loop mix (CGO16)", ..).
>
> Regarding function calls, if the callee does not have the 'inherits_vlen'
> attribute, the target can use a default value at function entry (max width
> or "undef"). Otherwise, the vector length needs to be communicated from
> caller to callee. However, the `vlen_state` variable already achieves that
> for a first implementation.
As Graham said, on RISC-V a vector function call means that the caller
needs to configure the vector unit in a particular way (determined by
the callee's ABI), and the callee uses with that configuration. (And
the configuration determines the hardware vector length / vscale.)
This is a backend concern, except the backend needs to know whether a
function expects to be called in this way, or whether it can and needs
to pick a configuration for itself and set it up on function entry.
That's what motivates this attribute.
In terms of IR semantics, I would say vscale is *unspecified* on entry
into a function without the inherits_vscale (let's rename it to fit
this RFC's terminology) attribute. That means it's a program error to
assume you get the caller's vscale -- in scenarios where that's what
you want, you need to add the attribute everywhere. This corresponds
to the fact that in the absence of the attribute, the RISC-V backend
will make up a configuration for you and you'll get whatever vscale
that configuration implies, rather than the caller's.
Cheers,
Robin
> On 7 Jul 2018, at 14:46, Robin Kruppe <robin....@gmail.com> wrote:
>
> Hi Graham,
>
> thanks again. The changes and additions all make sense to me, I just
> have one minor comment about shufflevector.
Thanks, I think we're getting to the point where reviewing the patches makes sense
as a next step.
>> ===========================================
>> 5. Splitting and Combining Scalable Vectors
>> ===========================================
>>
>> Splitting and combining scalable vectors in IR is done in the same manner as
>> for fixed-length vectors, but with a non-constant mask for the shufflevector.
>
> It makes sense to have runtime-computed shuffle masks for some
> architectures, especially those with runtime-variable vector lengths,
> but lifting the current restriction that the shufflevector mask is a
> constant affects all code that inspects the indices. There's lot such
> code and as far as I've seen a fair amount of that code crucially
> depends on the mask being constant. I'm not opposed to lifting the
> restriction, but I want to call attention to it and double-check
> everyone's okay with it because it seems like a big step and, unlike
> other IR changes in this RFC, it isn't really necessary (we could also
> use an intrinsic for these shuffles).
The way we implemented this was to have a second getShuffleMask function
which took a pointer to a Value instead of a Constant; this function will
return false if the mask is either scalable or not a constant, or true
and set the contents of the provided SmallVectorImpl ref to the constant
values.
This means all existing code for other targets doesn't need to change since
they will just call the existing method. Some common code will require changes,
but would also need to change anyway to add support for a variable shufflevector
intrinsic.
See patch https://reviews.llvm.org/D47775 -- the extra getShuffleMask is
implemented in lib/IR/Instructions.cpp and an example of its use is in
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp.
I did mention a possible VECTOR_SHUFFLE_VAR intrinsic in that patch, to be
used in the same manner as experimental_vector_splatvector as a stand-in
for an ISD node when lowering to SelectionDAG. In our downstream compiler
both of those are generic ISD nodes instead of intrinsics, and I think we'd
try to get that upstream eventually.
If another method is preferred, that's fine, but I think this minimizes the
changes required.
-Graham
> Everything else I know of that falls under "changing vector lengths"
> is better served by predication or RISC-V's "active vector length"
> (vl) register.
Agreed. A "vl" register is slightly more efficient in some cases
because forming predicates can be bothersome.
I also want to caution about predication in LLVM IR. The way it's done
now is, I think, not quite kosher. We use select to represent a
predicated operation, but select says nothing about suppressing the
evaluation of either input. Therefore, there is nothing in the IR to
prevent code motion of Values outside the select. Indeed, I ran into
this very problem a couple of months ago, where a legitimate (according
to the IR) code motion resulted in wrong answers in vectorized code
because what was supposed to be predicated was not. We had to disable
the transformation to get things working.
Another consequence of this setup is that we need special intrinsics to
convey evaluation requirements. We have masked
load/store/gather/scatter intrinsics and will be getting masked
floating-point intrinsics (or something like them).
Years ago we had some discussion about how to represent predication as a
first-class IR construct but at the time it was considered too
difficult. With more and more architectures turning to predication for
performance, perhaps it's time to revisit that conversation.
-David
On 07/09/2018 08:01 AM, David A. Greene via llvm-dev wrote:
> Robin Kruppe <robin....@gmail.com> writes:
>
>> Everything else I know of that falls under "changing vector lengths"
>> is better served by predication or RISC-V's "active vector length"
>> (vl) register.
> Agreed. A "vl" register is slightly more efficient in some cases
> because forming predicates can be bothersome.
>
> I also want to caution about predication in LLVM IR. The way it's done
> now is, I think, not quite kosher. We use select to represent a
> predicated operation, but select says nothing about suppressing the
> evaluation of either input. Therefore, there is nothing in the IR to
> prevent code motion of Values outside the select. Indeed, I ran into
> this very problem a couple of months ago, where a legitimate (according
> to the IR) code motion resulted in wrong answers in vectorized code
> because what was supposed to be predicated was not. We had to disable
> the transformation to get things working.
>
> Another consequence of this setup is that we need special intrinsics to
> convey evaluation requirements. We have masked
> load/store/gather/scatter intrinsics and will be getting masked
> floating-point intrinsics (or something like them).
>
> Years ago we had some discussion about how to represent predication as a
> first-class IR construct but at the time it was considered too
> difficult. With more and more architectures turning to predication for
> performance, perhaps it's time to revisit that conversation.
I've also been seeing an increasing need for at (least some form of)
predication support in the IR/optimizer. At the moment, I'm mainly
concerned with what our canonical form should look like. I think that
adding first class predication is probably overkill at the moment.
In my case, I'm mostly interested in predicated scalar load and store at
the moment. We could trivially extend
@llvm.masked.load/@llvm.masked.store to handle the scalar cases, but the
more I think about it, I'm not sure this is actually a good canonical
form since it requires updating huge portions of the optimizer to handle
what is essentially a new instruction.
One idea I've been playing with is to represent predicated operations as
selects-over-inputs, and a then guaranteed non-faulting operation. For
instance, a predicated load might look like:
%pred_addr = select i1 %cnd, i32* %actual_addr, i32* %safe_addr
%pred_load = load i32 %pred_addr
where %safe_addr is something like an empty alloca or reserved global
variable. The key idea is that this can be pattern matched to an
actually predicated load if available in hardware, but can also be
optimized normally (i.e. prove the select condition).
This basic idea can be extended to any potentially faulting instruction
by selecting a "safe" input (i.e. divide by select(pred, actual, 1)
etc..). The obvious downside is the patterns can be broken up by the
optimizer in arbitrarily complex ways, but I wonder if that might be net
pro not a con.
At the moment, this is just one possible idea. I'm not yet to the point
of making any actual proposals just yet.
Philip
Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.
-Graham
> I strongly suspect that there remains widespread concern with the
> direction of this, I know I have them.
>
> I don't think that many of the people who have that concern have had
> time to come back to this RFC and make progress on it, likely because
> of other commitments or simply the amount of churn around SVE related
> patches and such. That is at least why I haven't had time to return to
> this RFC and try to write more detailed feedback.
We believe ARM SVE will be an important architecture going forward. As
such, it's important to us that these questions and concerns get posted
and discussed, whatever the outcome may be. If there are objections,
alternative proposals would be helpful.
I see a lot of SVE patches on Phab that are described as "not for
review." I don't know how helpful that is. It would be more helpful to
have actual patches intended for review/commit. It is difficult to know
which is which in Phab. Could patches not intended for review either be
removed if not needed, or their subjects updated to indicate they are
not for review but for discussion purposes so that it's easier to filter
search results?
-David
I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full,
-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
> 2. I know that there has been some discussion around support for
> changing the vector length during program execution (e.g., to account
> for some (proposed?) RISC-V feature), perhaps even during the
> execution of a single function. I'm very concerned about this idea
> because it is not at all clear to me how to limit information transfer
> contaminated with the vector size from propagating between different
> regions. As a result, I'm concerned about trying to add this on later,
> and so if this is part of the plan, I think that we need to think
> through the details up front because it could have a major impact on
> the design.
Can you elaborate a bit on your concerns? I'm not sure how allowing
vector length changes impacts the design of this proposal. As far as I
understand things, this proposal is about dealing with unknown vector
lengths, providing types and intrinsics where needed to support
necessary operations. It seems to me that building support for changing
the vscale at runtime is somewhat orthogonal. That is, anyone doing
such a thing will probably have to provide some more intrinsics to
capture the dependency chains and prevent miscompilation, but the basic
types and other intrinsics would remain the same.
What I'm going to say below is from my (narrow) perspective of machines
I've worked on. It's not meant to cover all possibilities, things
people might do with RISC-V etc. I intend it as a (common) example for
discussion.
Changing vector length during execution of a loop (for the last
iteration, typically) is very common for architectures without
predication. Traditional Cray processors, for example, had a vector
length register. The compiler had to manage updates to the vl register
just like any other register and instructions used vl as an implicit
operand.
I'm not sure exactly how the SVE proposal would address this kind of
operation. llvm.experimental.vector.vscale is a vector length read. I
could imagine a load intrinsic that takes a vscale value as an operand,
thus connecting the vector length to the load and the transitive closure
of its uses. I could also imagine an intrinsic to change the vscale.
The trick would be to disallow reordering vector length reads and
writes. None of this seems to require changes to the proposed type
system, only the addition of some (target-specific?) intrinsics.
I think it would be unlikely for anyone to need to change the vector
length during evaluation of an in-register expression. That is, vector
length changes would normally be made only at observable points in the
program (loads, stores, etc.) and probably only at certain control-flow
boundaries (loop back-edges, function calls/returns and so on). Thus we
wouldn't need intrinsics or other new IR for every possible operation in
LLVM, only at the boundaries.
-David
SVE uses predication. The physical number of lanes doesn't have to
change to have the same effect (alignment, tails).
> I think it would be unlikely for anyone to need to change the vector
> length during evaluation of an in-register expression.
The worry here is not within each instruction but across instructions.
SVE (and I think RISC-V) allow register size to be dynamically set.
For example, on the same machine, it may be 256 for one process and
512 for another (for example, to save power).
But the change is via a system register, so in theory, anyone can
write an inline asm in the beginning of a function and change the
vector length to whatever they want.
Worst still, people can do that inside loops, or in a tail loop,
thinking it's a good idea (or this is a Cray machine :).
AFAIK, the interface for changing the register length will not be
exposed programmatically, so in theory, we should not worry about it.
Any inline asm hack can be considered out of scope / user error.
However, Hal's concern seems to be that, in the event of anyone
planning to add it to their APIs, we need to make sure the proposed
semantics can cope with it (do we need to update the predicates again?
what will vscale mean, then and when?).
If not, we may have to enforce that this will not come to pass in its
current form. In this case, changing it later will require *a lot*
more effort than doing it now.
So, it would be good to get a clear response from the two fronts (SVE
and RISC-V) about the future intention to expose that or not.
--
cheers,
--renato
The worry here is not within each instruction but across instructions.
SVE (and I think RISC-V) allow register size to be dynamically set.
For example, on the same machine, it may be 256 for one process and
512 for another (for example, to save power).
But the change is via a system register, so in theory, anyone can
write an inline asm in the beginning of a function and change the
vector length to whatever they want.
Worst still, people can do that inside loops, or in a tail loop,
thinking it's a good idea (or this is a Cray machine :).
AFAIK, the interface for changing the register length will not be
exposed programmatically, so in theory, we should not worry about it.
Any inline asm hack can be considered out of scope / user error.
However, Hal's concern seems to be that, in the event of anyone
planning to add it to their APIs, we need to make sure the proposed
semantics can cope with it (do we need to update the predicates again?
what will vscale mean, then and when?).
If not, we may have to enforce that this will not come to pass in its
current form. In this case, changing it later will require *a lot*
more effort than doing it now.
So, it would be good to get a clear response from the two fronts (SVE
and RISC-V) about the future intention to expose that or not.
> On Mon, 30 Jul 2018 at 20:57, David A. Greene via llvm-dev
> <llvm...@lists.llvm.org> wrote:
>> I'm not sure exactly how the SVE proposal would address this kind of
>> operation.
>
> SVE uses predication. The physical number of lanes doesn't have to
> change to have the same effect (alignment, tails).
Right. My wording was poor. The current proposal doesn't directly
support a more dynamic vscale target but I believe it could be simply
extended to do so.
>> I think it would be unlikely for anyone to need to change the vector
>> length during evaluation of an in-register expression.
>
> The worry here is not within each instruction but across instructions.
> SVE (and I think RISC-V) allow register size to be dynamically set.
I wasn't talking about within an instruction but rather across
instructions in the same expression tree. Something like this would be
weird:
A = load with VL
B = load with VL
C = A + B # VL implicit
VL = <something>
D = ~C # VL implicit
store D
Here and beyond, read "VL" as "vscale with minimum element count 1."
The points where VL would be changed are limited and I think would
require limited, straightforward additions on top of this proposal.
> For example, on the same machine, it may be 256 for one process and
> 512 for another (for example, to save power).
Sure.
> But the change is via a system register, so in theory, anyone can
> write an inline asm in the beginning of a function and change the
> vector length to whatever they want.
>
> Worst still, people can do that inside loops, or in a tail loop,
> thinking it's a good idea (or this is a Cray machine :).
>
> AFAIK, the interface for changing the register length will not be
> exposed programmatically, so in theory, we should not worry about it.
> Any inline asm hack can be considered out of scope / user error.
That's right. This proposal doesn't expose a way to change vscale, but
I don't think it precludes a later addition to do so.
> However, Hal's concern seems to be that, in the event of anyone
> planning to add it to their APIs, we need to make sure the proposed
> semantics can cope with it (do we need to update the predicates again?
> what will vscale mean, then and when?).
I don't see why predicate values would be affected at all. If a machine
with variable vector length has predicates, then typically the resulting
operation would operate on the bitwise AND of the predicate and a
conceptual all 1's predicate of length VL.
As I understand it, vscale is the runtime multiple of some minimal,
guaranteed vector length. For SVE that minimum is whatever gives a bit
width of 128. My guess is that for a machine with a more dynamic vector
length, the minimum would be 1. vscale would then be the vector length
and would change accordingly if the vector length is changed.
Changing vscale would be no different than changing any other value in
the program. The dataflow determines its possible values at various
program points. vscale is an extra (implicit) operand to all vector
operations with scalable type.
> If not, we may have to enforce that this will not come to pass in its
> current form.
Why? If a user does asm or some other such trick to change what vscale
means, that's on the user. If a machine has a VL that changes
iteration-to-iteration, typically the compiler would be responsible for
controlling it.
If the vendor provides some target intrinsics to let the user write
low-level vector code that changes vscale in a high-level language, then
the vendor would be responsible for adding the necessary bits to the
frontend and LLVM. I would not recommend a vendor try to do this. :)
It wouldn't necessarily be hard to do, but it would be wasted work IMO
because it would be better to improve the vectorizer that already
exists.
> In this case, changing it later will require *a lot* more effort than
> doing it now.
I don't see why. Anyone adding ability to change vscale would need to
add intrinsics and specify their semantics. That shouldn't change
anything about this proposal and any such additions shouldn't be
hampered by this proposal.
Another way to think of vscale/vector length is as a different kind of
predicate. Right now LLVM uses select to track predicate application.
It uses a "top-down" approach in that the root of an expression tree (a
select) applies the predicate and presumably everything under it
operates under that predicate. It also uses intrinsics for certain
operations (loads, stores, etc.) that absolutely must be predicated no
matter what for safety reasons. So it's sort of a hybrid approach, with
predicate application at the root, certain leaves and maybe even on
interior nodes (FP operations come to mind).
To my knowledge, there's nothing in LLVM that checks to make sure these
predicate applications are all consistent with one another. Someone
could do a load with predicate 0011 and then a "select div" with
predicate 1111, likely resulting in a runtime fault but nothing in LLVM
would assert on the predicate mismatch.
Predicates could also be applied only at the leaves and propagated up
the tree. IIRC, Dan Gohman proposed something like this years back when
the topic of predication came up. He called it "applymask" but
unfortunately the Google is failing to find it.
I *could* imagine using select to also convey application of vector
length but that seems odd and unnecessarily complex.
If vector length were applied at the leaves, it would take a bit of work
to get it through instruction selection. Target opcodes would be one
way to do it. I think it would be straightforward to walk the DAG and
change generic opcodes to target opcodes when necessary.
I don't think we should worry about taking IR with dynamic changes to VL
and trying to generate good code for any random target from it. Such IR
is very clearly tied to a specific kind of target and we shouldn't
bother pretending otherwise. The vectorizer should be aware of the
target's capabilities and generate code accordingly.
-David
Yes, that's what I was referring as "not in the API" therefore "user error".
> The points where VL would be changed are limited and I think would
> require limited, straightforward additions on top of this proposal.
Indeed. I have a limited view on the spec and even more so on hardware
implementations, but it is my understanding that there is no attempt
to change VL mid-loop.
If we can assume VL will be "the same" (not constant) throughout every
self-contained sub-graph (from scalar|memory->vector to
vector->scalar|memory), there we should encode it in the IR spec that
this is a hard requirement.
This seems consistent with your explanation of the Cray VL change as
well as Bruce's description of RISC-V (both seem very similar to me),
where VL can change between two loop iterations but not within the
same iteration.
We will still have to be careful with access safety (alias, loop
dependencies, etc), but that shouldn't be different than if VL was
required to be constant throughout the program.
> That's right. This proposal doesn't expose a way to change vscale, but
> I don't think it precludes a later addition to do so.
That was my point about this change being harder to do later than now.
I think no one wants to do that now, so we're all happy to pay the
price later, because that will likely never come.
> I don't see why predicate values would be affected at all. If a machine
> with variable vector length has predicates, then typically the resulting
> operation would operate on the bitwise AND of the predicate and a
> conceptual all 1's predicate of length VL.
I think the problem is that SVE is fully predicated and Cray (RISC-V?)
is not, so mixing the two could lead into weird predication
situations.
So, if a high level optimisation pass assumes full predication and
change the loop accordingly, and another pass assumes no predication
and adds VL changes (say, loop tails), then we may end up with
incompatible IR that will be hard to select down in ISel.
Given that SVE has both predication and vscale change, this could
happen in practice. It wouldn't be necessarily wrong, but it would
have to be a conscious decision.
> Changing vscale would be no different than changing any other value in
> the program. The dataflow determines its possible values at various
> program points. vscale is an extra (implicit) operand to all vector
> operations with scalable type.
It is, but IIGIR, changing vscale and predicating are similar
transformations to achieve the similar goals, but will not be
represented the same way in IR.
Also, they're not always interchangeable, so that complicates the IR
matching in ISel as well as potential matching in optimisation passes.
> Why? If a user does asm or some other such trick to change what vscale
> means, that's on the user. If a machine has a VL that changes
> iteration-to-iteration, typically the compiler would be responsible for
> controlling it.
Not asm, sorry. Inline as is "user error".
I meant: make sure adding an IR visible change in VL (say, an
intrinsic or instruction), within a self-contained block, becomes an
IR error.
> If the vendor provides some target intrinsics to let the user write
> low-level vector code that changes vscale in a high-level language, then
> the vendor would be responsible for adding the necessary bits to the
> frontend and LLVM. I would not recommend a vendor try to do this. :)
Not recommending by making it an explicit error. :)
It may sound harsh, but given we're taking some pretty liberal design
choices right now, which could have long lasting impact on the
stability and quality of LLVM's code generation, I'd say we need to be
as conservative as possible.
> I don't see why. Anyone adding ability to change vscale would need to
> add intrinsics and specify their semantics. That shouldn't change
> anything about this proposal and any such additions shouldn't be
> hampered by this proposal.
I don't think it would be hard to do, but it could have consequences
to the rest of the optimisation and code generation pipeline.
I do not claim to have a clear vision on any of this, but as I said
above, it will pay off long term is we start conservative.
> I don't think we should worry about taking IR with dynamic changes to VL
> and trying to generate good code for any random target from it. Such IR
> is very clearly tied to a specific kind of target and we shouldn't
> bother pretending otherwise.
We're preaching for the same goals. :)
But we're trying to represent slightly different techniques
(predication, vscale change) which need to be tied down to only
exactly what they do.
Being conservative and explicit on the semantics is, IMHO, the easiest
path to get it right. We can surely expand later.
--
cheers,
--renato
Indeed. I have a limited view on the spec and even more so on hardware
implementations, but it is my understanding that there is no attempt
to change VL mid-loop.
If we can assume VL will be "the same" (not constant) throughout every
self-contained sub-graph (from scalar|memory->vector to
vector->scalar|memory), there we should encode it in the IR spec that
this is a hard requirement.
This seems consistent with your explanation of the Cray VL change as
well as Bruce's description of RISC-V (both seem very similar to me),
where VL can change between two loop iterations but not within the
same iteration.
> I don't see why predicate values would be affected at all. If a machine
> with variable vector length has predicates, then typically the resulting
> operation would operate on the bitwise AND of the predicate and a
> conceptual all 1's predicate of length VL.
I think the problem is that SVE is fully predicated and Cray (RISC-V?)
is not, so mixing the two could lead into weird predication
situations.
If this is orthogonal to the IR representation, ie. doesn't need
current instructions to *know* about it, but the sequence of IR
instructions will represent it, than it should be fine.
> I'm not sure whether it will end up being possible or not, but I did describe two situations where at least some RISC-V implementations might want to change VL within an iteration:
Apologies, I may have misinterpreted them.
> 1) a memory protection problem on some trailing part of a vector load or store, causing that iteration to operate only on the accessible part, and the next iteration to start from the first address in the non-accessible part (and actually take a fault)
SVE deals with those problems with predication and FFR
(first-fault-register), not by changing the VL, but I imagine they're
semantically similar.
> 2) an interrupt/task switch in the middle of a loop iteration. Some implementations may want to save/restore only the vector configuration, not the values of the vector registers.
I assume the architecture will have to continue the program in the
same state they were when the interrupt occurred. How it does
shouldn't concern the code generation.
>> The points where VL would be changed are limited and I think would
>> require limited, straightforward additions on top of this proposal.
>
> Indeed. I have a limited view on the spec and even more so on hardware
> implementations, but it is my understanding that there is no attempt
> to change VL mid-loop.
What does "mid-loop" mean? On traditional vector architectures it was
very common to change VL for the last loop iteration. Otherwise you had
to have a remainder loop. It was much better to change VL.
> If we can assume VL will be "the same" (not constant) throughout every
> self-contained sub-graph (from scalar|memory->vector to
> vector->scalar|memory), there we should encode it in the IR spec that
> this is a hard requirement.
>
> This seems consistent with your explanation of the Cray VL change as
> well as Bruce's description of RISC-V (both seem very similar to me),
> where VL can change between two loop iterations but not within the
> same iteration.
Ok, I think I am starting to grasp what you are saying. If a value
flows from memory or some scalar computation to vector and then back to
memory or scalar, VL should only ever be set at the start of the vector
computation until it finishes and the value is deposited in memory or
otherwise extracted. I think this is ok, but note that any vector
functions called may change VL for the duration of the call. The change
would not be visible to the caller.
Just thinking this through, a case where one might want to change VL
mid-stream is something like a half-length set of operations that feeds
a vector concat and then a full length set of operations following. But
again I think this would be a strange way to do things. If someone
really wants to do this they can predicate away the upper bits of the
half-length operations and maintain the same VL throughout the
computation. If predication isn't available they they've got more
serious problems vectorizing code. :)
> We will still have to be careful with access safety (alias, loop
> dependencies, etc), but that shouldn't be different than if VL was
> required to be constant throughout the program.
Yep.
>> That's right. This proposal doesn't expose a way to change vscale, but
>> I don't think it precludes a later addition to do so.
>
> That was my point about this change being harder to do later than now.
I guess I don't see why it would be any harder later.
> I think no one wants to do that now, so we're all happy to pay the
> price later, because that will likely never come.
I am not so sure about that. Power requirements may very well drive
more dynamic vector lengths. Even today some AVX 512 implementations
falter if there are "too many" 512-bit operations. Scaling back SIMD
width statically is very common today and doing so dynamically seems
like an obvious extension. I don't know of any efforts to do this so
it's all speculative at this point. But the industry has done it in the
past and we have a curious pattern of reinventing things we did before.
>> I don't see why predicate values would be affected at all. If a machine
>> with variable vector length has predicates, then typically the resulting
>> operation would operate on the bitwise AND of the predicate and a
>> conceptual all 1's predicate of length VL.
>
> I think the problem is that SVE is fully predicated and Cray (RISC-V?)
> is not, so mixing the two could lead into weird predication
> situations.
Cray vector ISAs were fully predicated and also used a vector length.
It didn't cause us any serious issues. In many ways having an
adjustable VL and predication makes things easier because you don't have
to regenerate predicates to switch to a shorter VL.
> So, if a high level optimisation pass assumes full predication and
> change the loop accordingly, and another pass assumes no predication
> and adds VL changes (say, loop tails), then we may end up with
> incompatible IR that will be hard to select down in ISel.
>
> Given that SVE has both predication and vscale change, this could
> happen in practice. It wouldn't be necessarily wrong, but it would
> have to be a conscious decision.
It seems strange to me for an optimizer to operate in such a way. The
optimizer should be fully aware of the target's capabilities and use
them accordingly. But let's say this happens. Pass 1 vectorizes the
loop with predication (for a conditional loop body) and creates a
remainder loop, which would also need to be predicated. Note that such
a remainder loop is not necessary with full predication support but for
the sake of argument lets say pass 1 is not too smart.
Pass 2 comes along and says, "hey, I have the ability to change VL so we
don't need a remainder loop." It rewrites the main loop to use dynamic
VL and removes the remainder loop. During that rewrite, pass 2 would
have to maintain predication. It can use the very same predicate values
pass 1 generated. There is no need to adjust them because the VL is
applied "on top of" the predicates.
Pass 2 effectively rewrites the code to what the vectorizer should have
emitted in the first place. I'm not seeing how ISel is any more
difficult. SVE has an implicit vscale operand on every instruction and
ARM seems to have no difficulty selecting instructions for it. Changing
the value of vscale shouldn't impact ISel at all. The same instructions
are selected.
>> Changing vscale would be no different than changing any other value in
>> the program. The dataflow determines its possible values at various
>> program points. vscale is an extra (implicit) operand to all vector
>> operations with scalable type.
>
> It is, but IIGIR, changing vscale and predicating are similar
> transformations to achieve the similar goals, but will not be
> represented the same way in IR.
They probably will not be represented the same way, though I think they
could be (but probably shouldn't be).
> Also, they're not always interchangeable, so that complicates the IR
> matching in ISel as well as potential matching in optimisation passes.
I'm not sure it does but I haven't worked something all the way through.
>> Why? If a user does asm or some other such trick to change what vscale
>> means, that's on the user. If a machine has a VL that changes
>> iteration-to-iteration, typically the compiler would be responsible for
>> controlling it.
>
> Not asm, sorry. Inline as is "user error".
Ok.
> I meant: make sure adding an IR visible change in VL (say, an
> intrinsic or instruction), within a self-contained block, becomes an
> IR error.
What do you mean by "self-contained block?" Assuming I understood it
correctly, the restriction you described at the top seems reasonable for
now.
>> If the vendor provides some target intrinsics to let the user write
>> low-level vector code that changes vscale in a high-level language, then
>> the vendor would be responsible for adding the necessary bits to the
>> frontend and LLVM. I would not recommend a vendor try to do this. :)
>
> Not recommending by making it an explicit error. :)
>
> It may sound harsh, but given we're taking some pretty liberal design
> choices right now, which could have long lasting impact on the
> stability and quality of LLVM's code generation, I'd say we need to be
> as conservative as possible.
Ok, but would be optimizer be prevented from introducing VL changes?
>> I don't see why. Anyone adding ability to change vscale would need to
>> add intrinsics and specify their semantics. That shouldn't change
>> anything about this proposal and any such additions shouldn't be
>> hampered by this proposal.
>
> I don't think it would be hard to do, but it could have consequences
> to the rest of the optimisation and code generation pipeline.
It could. I don't think any of us has a clear idea of what those might
be.
> I do not claim to have a clear vision on any of this, but as I said
> above, it will pay off long term is we start conservative.
Being conservative is fine, but we should have a clear understanding of
exactly what that means. I would not want to prohibit all VL changes
now and forever, because I see that as unnecessarily restrictive and
possibly damaging to supporting future architectures.
If we don't want to provide intrinsics for changing VL right now, I'm
all in favor. There would be no reason to add error checks because
there would be no way within the IR to change VL.
But I don't want to preclude adding such intrinsics in the future.
>> I don't think we should worry about taking IR with dynamic changes to VL
>> and trying to generate good code for any random target from it. Such IR
>> is very clearly tied to a specific kind of target and we shouldn't
>> bother pretending otherwise.
>
> We're preaching for the same goals. :)
Good! :)
> But we're trying to represent slightly different techniques
> (predication, vscale change) which need to be tied down to only
> exactly what they do.
Wouldn't intrinsics to change vscale do exactly that?
> Being conservative and explicit on the semantics is, IMHO, the easiest
> path to get it right. We can surely expand later.
I'm all for being explicit. I think we're basically on the same page,
though there are a few things noted above where I need a little more
clarity.
-David
I'm starting to feel like I'm a broken record about this, but too much
of the discussion has been unclear about this and I think it causes a
fair amount of confusion, so I feel obligated to state it again as
clearly as I can: There are TWO independent notions of vector length
in this space! Namely:
1. How large are the machine's vector registers?
2. How many elements of a vector register are processed by an instruction?
This RFC addresses only the former, with the vscale concept. We have
been and still are discussing the latter in this email thread too,
sometimes under names such as "VL" or "active vector length", but
unfortunately also often as just plain "vector length". I think this
is very unfortunate: having two intermingled discussions about
different things which share a name is very confusing, especially
since I believe there is no need to discuss them together.
The active vector length can't be larger than the number of elements
in a vector register, but apart from that they are entirely separate
and whether an architecture has a fixed- or variable-size register is
completely orthogonal to whether it has a VL register. All
combinations make sense and exist in real architectures:
- SSE, NEON, etc. have fixed-size vector registers (e.g. 128 bit)
without any active vector length mechanism
- Classical Cray-style vector processors have fixed-size vector
registers (e.g., Cray-1 had 64x64bit) and an active vector length
mechanism
- SVE has variable-size vector registers and no active vector length
mechanism (loops are instead controlled by predication)
- The vector extension for RISC-V has variable-size vector register
and an active vector length mechanism
More importantly, the two mechanism are *used* very differently and
place very different demands on a compiler. Therefore, any discussion
that conflates these two concerns is doomed from the start IMHO. I
have written a bit about these differences, but since I know many
people here only have so much time, I moved this to an "appendix"
after the end of this email and will now go straight to addressing
Hal's second concern with this distinction in mind.
Yes, changing vscale during program execution is necessary to some
degree for the RISC-V vector extension. Yes, doing this at arbitrary
program points is indeed extremely challenging for a compiler to
support, for the reason you describe. This is why I proposed a
tradeoff, which Graham incorporated into this RFC: vscale can only
change at function boundaries and is fixed between function entry and
exit. This restriction is OK (not ideal, but good enough IMO) for
RISC-V and it makes the problem much more manageable because most code
in LLVM operates only within one function at a time, so it never has
to encounter vscale changes. I also think this is the most we'll ever
be able to support -- the problem you describe isn't going away, and I
don't know of any major use cases that would require us to tackle this
difficult problem in its entirety. However, I might be unaware of
something people want to do with SVE that doesn't fit into this mould.
Despite also being relevant for RISC-V and being discussed extensively
in this thread, the active vector length is basically just a very
minor twist on predication, and therefore doesn't interact at all with
the type system changes proposed here. Like predication, it can just
be modelled by regular data flow between IR operations (as David
already said). As with predication, a smaller *active* vector length
(~= a mask with few elements enabled) doesn't mean vectors suddenly
have fewer elements, just that more of them are masked out while doing
calculations. While there's an interesting design space for how to
best represent this predication in the IR, it has entirely different
challenges and constraints than vscale changes. If anything, the
"active vector length" discussion has more in common with past
discussions about making predication for *fixed-length* vectors more
of a first-class citizen in LLVM IR.
So I think this RFC as-is solves the problem of changing vector
register sizes about as well as it can and needs to be solved, and in
a way that is entirely satisfactory for RISC-V (again, I can't speak
for SVE, I don't know the use cases there). While more work is needed
to deal with another aspect of the RISC-V vector architecture (the VL
register), that can and should be a separate discussion, the results
of which won't invalidate anything decided in this RFC.
Cheers,
Robin
## Appendix
The active vector length or VL register is a tool for loop control,
ensuring the vectorized loop does not run too far while still
maximizing use of the vector unit. As such, it is recomputed
frequently (at minimum once per loop iteration, possibly even within a
loop as Bruce explained) and can be seen as a particular kind of
predication. It applies to a particular operation, prevents it from
having unwanted side effects, and operates on a subset of a larger
vector. As SVE illustrates, one can use plain old masks in precisely
the same way to solve the same problem, constructing and maintaining
masks that enable "the first n elements" where n would be the active
vector length in a different architecture. Creating a special VL
register for this purpose is just an architectural accomodation for
this style of predication. While it may have significant impact on the
microarchitecture and suggest a different mental model to programmers,
it's basically just predication from a compiler's perspective.
The vector register size, on the other hand, is not something you
change just like that. Changing it is, at best, like deciding to
switch from AVX exclusively (i.e., no xmm registers) to SSE
exclusively. It changes fundamental properties of your register field
and vector unit. While you can easily compile one part of your
application one way and another part differently if they don't
interact directly, once you try to do this e.g. in the middle of a
vectorized code region, it gets difficult even conceptually, to say
nothing of the compiler implementation. Furthermore, in the AVX->SSE
case you know how the vector length changes, and that might also apply
to SVE -- e.g. you could halve vscale and split all your existing
N-element vectors into two (N/2)-element vectors each -- but on
RISC-V, you probably won't be able to control the vector register size
directly, so you can't even do that much.
These difference also impact how to approach the two concepts in a
compiler. The active vector length -- very much like a mask -- is just
a piece of data that is computed and then used in various vector
operations as an extra operand, as David suggested in a recent email.
For this reason, I agree with his assesment that the active vector
length is "just data flow" and doesn't interact with the type system
changes discussed in this RFC.
vscale, on the other hand, is not easily handled as "just a piece of
data". The size of vector registers impacts many things besides
individual operations that are explicit in IR, and as such many parts
of the compiler have to be acutely aware of what it isand where it
might change. To give just one example, if you increase the size of
vector registers in the middle of a function, you need to reserve more
stack space for spilling -- if you just reserve stack space in the
prologue using the *initial* register size, you won't have enough
space to spill the larger vector values later on. There's myriad more
problems like this if you sit down and sift through IR transformations
and the CodeGen infrastructure (as I have been doing for RISC-V over
the last year). A change of vscale is best considered to be a massive
barrier to all code that is even remotely vector-related. In Hal's
terms, you really want to prevent anything contaminated with the
vector register size to cross over the point where you change the
vector register size.
Like Hal, I am very skeptical how, if at all, such a barrier could be
added to IR. And I've spent a lot of time trying to come up with a
solution as part of my RISC-V work. That is why my RFC back in April
proposed a trade-off, which has been incorporated by Graham into this
RFC: vscale can change between functions, but does not change within a
function. As an analogy, consider how LLVM supports different
subtargets (each with different registers, instructions and legal
types) on a per-function basis but doesn't allow e.g. making a
register class completely unavailable at a certain point in a
function.
> On 30 Jul 2018, at 11:34, Chandler Carruth <chan...@gmail.com> wrote:
>
> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>
> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
The core IR patches (listed in the RFC) haven't really changed much and are ready for review. I appreciate that Sander has pushed a lot of SVE-related patches for MC through recently though.
> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>
> Put differently: I don't think silence is assent here. You really need some clear signal of consensus.
Understood. Thankfully there seems to be more interest in this now... I guess people will be busy with the release in the near future but I can work on responding to all the new messages now. I'll try to log in to irc during the evenings (UK time) if that would help.
-Graham
_______________________________________________
> On 30 Jul 2018, at 18:37, David Greene via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Chandler Carruth wrote:
>
>> I strongly suspect that there remains widespread concern with the
>> direction of this, I know I have them.
>>
>> I don't think that many of the people who have that concern have had
>> time to come back to this RFC and make progress on it, likely because
>> of other commitments or simply the amount of churn around SVE related
>> patches and such. That is at least why I haven't had time to return to
>> this RFC and try to write more detailed feedback.
>
> We believe ARM SVE will be an important architecture going forward. As
> such, it's important to us that these questions and concerns get posted
> and discussed, whatever the outcome may be. If there are objections,
> alternative proposals would be helpful.
Yes, pointing out alternatives we've missed would be helpful.
> I see a lot of SVE patches on Phab that are described as "not for
> review." I don't know how helpful that is. It would be more helpful to
> have actual patches intended for review/commit. It is difficult to know
> which is which in Phab. Could patches not intended for review either be
> removed if not needed, or their subjects updated to indicate they are
> not for review but for discussion purposes so that it's easier to filter
> search results?
All 14 patches listed at the bottom of the RFC are ready for an initial round of review, so I'll change the descriptions tomorrow to indicate that. I'll check to see if I have any older ones lying around and abandon them if so.
-Graham
>
> -David
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Let me put the last two comments up:
> > But we're trying to represent slightly different techniques
> > (predication, vscale change) which need to be tied down to only
> > exactly what they do.
>
> Wouldn't intrinsics to change vscale do exactly that?
You're right. I've been using the same overloaded term and this is
probably what caused the confusion.
In some cases, predicating and shortening the vectors are semantically
equivalent. In this case, the IR should also be equivalent.
Instructions/intrinsics that handle predication could be used by the
backend to simply change VL instead, as long as it's guaranteed that
the semantics are identical. There are no problems here.
In other cases, for example widening or splitting the vector, or cases
we haven't thought of yet, the semantics are not the same, and having
them in IR would be bad. I think we're all in agreements on that.
All I'm asking is that we make a list of what we want to happen and
disallow everything else explicitly, until someone comes with a strong
case for it. Makes sense?
> I'm all for being explicit. I think we're basically on the same page,
> though there are a few things noted above where I need a little more
> clarity.
Yup, I think we are. :)
> What does "mid-loop" mean? On traditional vector architectures it was
> very common to change VL for the last loop iteration. Otherwise you had
> to have a remainder loop. It was much better to change VL.
You got it below...
> Ok, I think I am starting to grasp what you are saying. If a value
> flows from memory or some scalar computation to vector and then back to
> memory or scalar, VL should only ever be set at the start of the vector
> computation until it finishes and the value is deposited in memory or
> otherwise extracted. I think this is ok, but note that any vector
> functions called may change VL for the duration of the call. The change
> would not be visible to the caller.
If a function is called and changes the length, does it restore back on return?
> I am not so sure about that. Power requirements may very well drive
> more dynamic vector lengths. Even today some AVX 512 implementations
> falter if there are "too many" 512-bit operations. Scaling back SIMD
> width statically is very common today and doing so dynamically seems
> like an obvious extension. I don't know of any efforts to do this so
> it's all speculative at this point. But the industry has done it in the
> past and we have a curious pattern of reinventing things we did before.
Right, so it's not as clear cut as I hoped. But we can start
implementing the basic idea and then expand as we go. I think trying
to hash out all potential scenarios now will drive us crazy.
> It seems strange to me for an optimizer to operate in such a way. The
> optimizer should be fully aware of the target's capabilities and use
> them accordingly.
Mid-end optimisers tend to be fairly agnostic. And when not, they
usually ask "is this supported" instead of "which one is better".
> ARM seems to have no difficulty selecting instructions for it. Changing
> the value of vscale shouldn't impact ISel at all. The same instructions
> are selected.
I may very well be getting lost in too many floating future ideas, atm. :)
> > It is, but IIGIR, changing vscale and predicating are similar
> > transformations to achieve the similar goals, but will not be
> > represented the same way in IR.
>
> They probably will not be represented the same way, though I think they
> could be (but probably shouldn't be).
Maybe in the simple cases (like last iteration) they should be?
> Ok, but would be optimizer be prevented from introducing VL changes?
In the case where they're represented in similar ways in IR, it
wouldn't need to.
Otherwise, we'd have to teach the two methods to IR optimisers that
are virtually identical in semantics. It'd be left for the back end to
implement the last iteration notation as a predicate fill or a vscale
change.
> Being conservative is fine, but we should have a clear understanding of
> exactly what that means. I would not want to prohibit all VL changes
> now and forever, because I see that as unnecessarily restrictive and
> possibly damaging to supporting future architectures.
>
> If we don't want to provide intrinsics for changing VL right now, I'm
> all in favor. There would be no reason to add error checks because
> there would be no way within the IR to change VL.
Right, I think we're converging.
How about we don't forbid changes in vscale, but we find a common
notation for all the cases where predicating and changing vscale would
be semantically identical, and implement those in the same way.
Later on, if there are additional cases where changes in vscale would
be beneficial, we can discuss them independently.
Makes sense?
--
cheers,
--renato
On Tue, 31 Jul 2018 at 19:03, Robin Kruppe <robin....@gmail.com> wrote:
> 1. How large are the machine's vector registers?
This is the only one I'm talking about. :)
> Like Hal, I am very skeptical how, if at all, such a barrier could be
> added to IR. And I've spent a lot of time trying to come up with a
> solution as part of my RISC-V work. That is why my RFC back in April
> proposed a trade-off, which has been incorporated by Graham into this
> RFC: vscale can change between functions, but does not change within a
> function. As an analogy, consider how LLVM supports different
> subtargets (each with different registers, instructions and legal
> types) on a per-function basis but doesn't allow e.g. making a
> register class completely unavailable at a certain point in a
> function.
Cray seems to use changes in vscale as we use predication for the last
loop iteration, as well as RISC-V uses for giving away resources to
different functions.
In the former case, they may want to change the vscale inside the same
function in the last iteration, but given that this is semantically
equivalent to shortening predicates, it could be a back-end decision
and not an IR one. We could have the same notation for both target
behaviours and not have to worry about the boundaries.
In the latter case, it's clear that functions are hard boundaries.
Providing, of course, that you either inline all functions called
before vectorisation, or, and only if there is a scalable vector PCS
ABI, make sure that all of them have the same length?
I haven't thought long enough about the latter, and that's why I was
proposing we take a conservative approach and restrict to what we can
actually reasonably do now.
I think this is what you and Graham are trying to do, right?
cheers,
--renato
> Hi David,
>
> Let me put the last two comments up:
>
>> > But we're trying to represent slightly different techniques
>> > (predication, vscale change) which need to be tied down to only
>> > exactly what they do.
>>
>> Wouldn't intrinsics to change vscale do exactly that?
>
> You're right. I've been using the same overloaded term and this is
> probably what caused the confusion.
Me too. Thanks Robin for clarifying this for all of us! I'll try to
follow this terminology:
VL/active vector length - The software notion of how many elements to
operate on; a special case of predication
vscale - The hardware notion of how big a vector register is
TL;DR - Changing VL in a function doesn't affect anything about this
proposal, but changing vscale might. Changing VL shouldn't
impact things like ISel at all but changing vscale might.
Changing vscale is (much) more difficult than changing VL.
> In some cases, predicating and shortening the vectors are semantically
> equivalent. In this case, the IR should also be equivalent.
> Instructions/intrinsics that handle predication could be used by the
> backend to simply change VL instead, as long as it's guaranteed that
> the semantics are identical. There are no problems here.
Right. Changing VL is no problem. I think even reducing vscale is ok
from an IR perspective, if a little strange.
> In other cases, for example widening or splitting the vector, or cases
> we haven't thought of yet, the semantics are not the same, and having
> them in IR would be bad. I think we're all in agreements on that.
You mean going from a shorter active vector length to a longer active
vector length? Or smaller vscale to larger vscale? The latter would be
bad. The former seems ok if the dataflow is captured and the vectorizer
generates correct code to account for it. Presumably it would if it is
the thing changing the active vector length.
> All I'm asking is that we make a list of what we want to happen and
> disallow everything else explicitly, until someone comes with a strong
> case for it. Makes sense?
Yes.
>> Ok, I think I am starting to grasp what you are saying. If a value
>> flows from memory or some scalar computation to vector and then back to
>> memory or scalar, VL should only ever be set at the start of the vector
>> computation until it finishes and the value is deposited in memory or
>> otherwise extracted. I think this is ok, but note that any vector
>> functions called may change VL for the duration of the call. The change
>> would not be visible to the caller.
>
> If a function is called and changes the length, does it restore back on return?
If a function changes VL, it would typically restore it before return.
This would be an ABI guarantee just like any other callee-save register.
If a function changes vscale, I don't know. The RISC-V people seem to
have thought the most about this. I have no point of reference here.
> Right, so it's not as clear cut as I hoped. But we can start
> implementing the basic idea and then expand as we go. I think trying
> to hash out all potential scenarios now will drive us crazy.
Sure.
>> It seems strange to me for an optimizer to operate in such a way. The
>> optimizer should be fully aware of the target's capabilities and use
>> them accordingly.
>
> Mid-end optimisers tend to be fairly agnostic. And when not, they
> usually ask "is this supported" instead of "which one is better".
Yes, the "is this supported" question is common. Isn't the whole point
of VPlan to get the "which one is better" question answered for
vectorization? That would be necessarily tied to the target. The
questions asked can be agnostic, like the target-agnostics bits of
codegen use, but the answers would be target-specific.
>> ARM seems to have no difficulty selecting instructions for it. Changing
>> the value of vscale shouldn't impact ISel at all. The same instructions
>> are selected.
>
> I may very well be getting lost in too many floating future ideas, atm. :)
Given our clearer terminology, my statement above is maybe not correct.
Changing vscale *would* impact the IR and codegen (stack allocation,
etc.). Changing VL would not, other than adding some Instructions to
capture the semantics. I suspect neither would change ISel (I know VL
would not) but as you say I don't think we need concern ourselves with
changing vscale right now, unless others have a dire need to support it.
>> > It is, but IIGIR, changing vscale and predicating are similar
>> > transformations to achieve the similar goals, but will not be
>> > represented the same way in IR.
>>
>> They probably will not be represented the same way, though I think they
>> could be (but probably shouldn't be).
>
> Maybe in the simple cases (like last iteration) they should be?
Perhaps changing VL could be modeled the same way but I have a feeling
it will be awkward. Changing vscale is something totally different and
likely should be represented differently if allowed at all.
>> Ok, but would be optimizer be prevented from introducing VL changes?
>
> In the case where they're represented in similar ways in IR, it
> wouldn't need to.
It would have to generate IR code to effect the software change in VL
somehow, by altering predicates or by using special instrinsics or some
other way.
> Otherwise, we'd have to teach the two methods to IR optimisers that
> are virtually identical in semantics. It'd be left for the back end to
> implement the last iteration notation as a predicate fill or a vscale
> change.
I suspect that is too late. The vectorizer needs to account for the
choice and pick the most profitable course. That's one of the reasons I
think modeling VL changes like predicates is maybe unnecessarily
complex. If VL is modeled as "just another predicate" then there's no
guarantee that ISel will honor the choices the vectorizer made to use VL
over predication. If it's modeled explicitly, ISel should have an
easier time generating the code the vectorizer expects.
VL changes aren't always on the last iteration. The Cray X1 had an
instruction (I would have to dust off old manuals to remember the
mnemonic) with somewhat strange semantics to get the desired VL for an
iteration. Code would look something like this:
loop top:
vl = getvl N # N contains the number of iterations left
<do computation>
N = N - vl
branch N > 0, loop top
The "getvl" instruction would usually return the full hardware vector
register length (MAXVL), except on the 2nd-to-last iteration if N was
larger than MAXVL but less than 2*MAXVL it would return something like
<N % 2 == 0 ? N/2 : N/2 + 1>, so in the range (0, MAXVL). The last
iteration would then run at the same VL or one less depending on whether
N was odd or even. So the last two iterations would often run at less
than MAXVL and often at different VLs from each other.
And no, I don't know why the hardware operated this way. :)
>> Being conservative is fine, but we should have a clear understanding of
>> exactly what that means. I would not want to prohibit all VL changes
>> now and forever, because I see that as unnecessarily restrictive and
>> possibly damaging to supporting future architectures.
>>
>> If we don't want to provide intrinsics for changing VL right now, I'm
>> all in favor. There would be no reason to add error checks because
>> there would be no way within the IR to change VL.
>
> Right, I think we're converging.
Agreed.
> How about we don't forbid changes in vscale, but we find a common
> notation for all the cases where predicating and changing vscale would
> be semantically identical, and implement those in the same way.
>
> Later on, if there are additional cases where changes in vscale would
> be beneficial, we can discuss them independently.
>
> Makes sense?
Again trying to use the VL/vscale terminology:
Changing vscale - no IR support currently and less likely in the future
Changing VL - no IR support currently but more likely in the future
The second seems like a straightforward extension to me. There will be
some questions about how to represent VL semantics in IR but those don't
impact the proposal under discussion at all.
The first seems much harder, at least within a function. It may or may
not impact the proposal under discussion. It sounds like the RISC-V
people have some use cases so those should probably be the focal point
of this discussion.
-David
+1
> TL;DR - Changing VL in a function doesn't affect anything about this
> proposal, but changing vscale might. Changing VL shouldn't
> impact things like ISel at all but changing vscale might.
> Changing vscale is (much) more difficult than changing VL.
Absolutely agreed. :)
> Right. Changing VL is no problem. I think even reducing vscale is ok
> from an IR perspective, if a little strange.
Yup.
> You mean going from a shorter active vector length to a longer active
> vector length? Or smaller vscale to larger vscale? The latter would be
> bad.
The latter. Bad indeed.
> If a function changes vscale, I don't know. The RISC-V people seem to
> have thought the most about this. I have no point of reference here.
I think the consensus is that this would be bad. So we should maybe
encode it as an error.
> Yes, the "is this supported" question is common. Isn't the whole point
> of VPlan to get the "which one is better" question answered for
> vectorization?
Yes, but the cost is high. We can have that in the vectoriser, as it's
a heavy pass and we're conscious, but we shouldn't make all other
passes "that smart".
> Changing vscale *would* impact the IR and codegen (stack allocation,
> etc.). Changing VL would not, other than adding some Instructions to
> capture the semantics. I suspect neither would change ISel (I know VL
> would not) but as you say I don't think we need concern ourselves with
> changing vscale right now, unless others have a dire need to support it.
Perfect! :)
> Perhaps changing VL could be modeled the same way but I have a feeling
> it will be awkward. Changing vscale is something totally different and
> likely should be represented differently if allowed at all.
Right, I was talking about vscale.
It would be awkward, but if this is the only thing the hardware
supports (ie. no predication), than it's up to the back-end to lower
how it sees fit.
In IR, we still see as a predication.
> Again trying to use the VL/vscale terminology:
>
> Changing vscale - no IR support currently and less likely in the future
> Changing VL - no IR support currently but more likely in the future
SGTM.
> The second seems like a straightforward extension to me. There will be
> some questions about how to represent VL semantics in IR but those don't
> impact the proposal under discussion at all.
Should be equivalent to predication, I imagine.
> The first seems much harder, at least within a function.
And it would require exposing the instruction to change it in IR.
> It may or may not impact the proposal under discussion.
As per Robin's email, it doesn't. Functions are vscale boundaries in
their current proposal.
--
cheers,
--renato
Great, seems like we're all in violent agreement that VL changes are a
non-issue for the discussion at hand.
Just like the old loop vectorizer, VPlan will need a cost model that
is based on properties of the target, exposed to the optimizer in the
form of e.g. TargetLowering hooks. But we should try really hard to
avoid having a hard distinction between e.g. predication- and VL-based
loops in the VPlan representation. Duplicating or triplicating
vectorization logic would be really bad, and there are a lot of
similarities that we can exploit to avoid that. For a simple example,
SVE and RVV both want the same basic loop skeleton: strip-mining with
predication of the loop body derived from the induction variable.
Hopefully we can have a 99% unified VPlan pipeline and most
differences can be delegated to the final VPlan->IR step and the
respective backends.
+ Diego, Florian and others that have been discussing this previously
FWIW this is exactly how the RISC-V vector unit works --
unsurprisingly, since it owes a lot to Cray-style processors :)
> And no, I don't know why the hardware operated this way. :)
>
>>> Being conservative is fine, but we should have a clear understanding of
>>> exactly what that means. I would not want to prohibit all VL changes
>>> now and forever, because I see that as unnecessarily restrictive and
>>> possibly damaging to supporting future architectures.
>>>
>>> If we don't want to provide intrinsics for changing VL right now, I'm
>>> all in favor. There would be no reason to add error checks because
>>> there would be no way within the IR to change VL.
>>
>> Right, I think we're converging.
>
> Agreed.
+1, there is no need to deal with VL at all at this point. I would
even say there isn't even any concept of VL in IR at all at this time.
At some point in the future I will propose something in this space to
support RISC-V vectors, but we'll cross that bridge when we come to
it.
>> How about we don't forbid changes in vscale, but we find a common
>> notation for all the cases where predicating and changing vscale would
>> be semantically identical, and implement those in the same way.
>>
>> Later on, if there are additional cases where changes in vscale would
>> be beneficial, we can discuss them independently.
>>
>> Makes sense?
>
> Again trying to use the VL/vscale terminology:
>
> Changing vscale - no IR support currently and less likely in the future
> Changing VL - no IR support currently but more likely in the future
>
> The second seems like a straightforward extension to me. There will be
> some questions about how to represent VL semantics in IR but those don't
> impact the proposal under discussion at all.
>
> The first seems much harder, at least within a function. It may or may
> not impact the proposal under discussion. It sounds like the RISC-V
> people have some use cases so those should probably be the focal point
> of this discussion.
Yes, for RISC-V we definitely need vscale to vary a bit, but are fine
with limiting that to function boundaries. The use case is *not*
"changing how large vectors are" in the middle of a loop or something
like that, which we all agree is very dubious at best. The RISC-V
vector unit is just very configurable (number of registers, vector
element sizes, etc.) and this configuration can impact how large the
vector registers are. For any given vectorized loop next we want to
configure the vector unit to suit that piece of code and run the loop
with whatever register size that configuration yields. And when that
loop is done, we stop using the vector unit entirely and disable it,
so that the next loop can use it differently, possibly with a
different register size. For IR modeling purposes, I propose to
enlarge "loop nest" to "function" but the same principle applies, it
just means all vectorized loops in the function will have to share a
configuration.
Without getting too far into the details, does this make sense as a use case?
Cheers,
Robin
>> Yes, the "is this supported" question is common. Isn't the whole point
>> of VPlan to get the "which one is better" question answered for
>> vectorization? That would be necessarily tied to the target. The
>> questions asked can be agnostic, like the target-agnostics bits of
>> codegen use, but the answers would be target-specific.
>
> Just like the old loop vectorizer, VPlan will need a cost model that
> is based on properties of the target, exposed to the optimizer in the
> form of e.g. TargetLowering hooks. But we should try really hard to
> avoid having a hard distinction between e.g. predication- and VL-based
> loops in the VPlan representation. Duplicating or triplicating
> vectorization logic would be really bad, and there are a lot of
> similarities that we can exploit to avoid that. For a simple example,
> SVE and RVV both want the same basic loop skeleton: strip-mining with
> predication of the loop body derived from the induction variable.
> Hopefully we can have a 99% unified VPlan pipeline and most
> differences can be delegated to the final VPlan->IR step and the
> respective backends.
>
> + Diego, Florian and others that have been discussing this previously
If VL and predication are represented the same way, how does VPlan
distinguish between the two? How does it cost code generation just
using predication vs. code generation using a combination of predication
and VL?
Assuming it can do that, do you envision vector codegen would emit
different IR for VL+predication (say, using intrinsics to set VL) vs. a
strictly predication-only-based plan? If not, how does the LLVM backend
know to emit code to manipulate VL in the former case?
I don't need answers to these questions right now as VL is a separate
issue and I don't want this thread to get bogged down in it. But these
are questions that will come up if/when we tackle VL.
> At some point in the future I will propose something in this space to
> support RISC-V vectors, but we'll cross that bridge when we come to
> it.
Sounds good.
> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine
> with limiting that to function boundaries. The use case is *not*
> "changing how large vectors are" in the middle of a loop or something
> like that, which we all agree is very dubious at best. The RISC-V
> vector unit is just very configurable (number of registers, vector
> element sizes, etc.) and this configuration can impact how large the
> vector registers are. For any given vectorized loop next we want to
> configure the vector unit to suit that piece of code and run the loop
> with whatever register size that configuration yields. And when that
> loop is done, we stop using the vector unit entirely and disable it,
> so that the next loop can use it differently, possibly with a
> different register size. For IR modeling purposes, I propose to
> enlarge "loop nest" to "function" but the same principle applies, it
> just means all vectorized loops in the function will have to share a
> configuration.
>
> Without getting too far into the details, does this make sense as a
> use case?
I think so. If changing vscale has some important advantage (saving
power?), I wonder how the compiler will deal with very large functions.
I have seen some truly massive Fortran subroutines with hundreds of loop
nests in them, possibly with very different iteration counts for each
one.
I have two concerns:
1. If we change vscale in the middle of a function, then we have no way
to introduce a dependence, or barrier, at the point where the change is
made. Transformations, GVN/PRE/etc. for example, can move code around
the place where the change is made and I suspect that we'll have no good
options to prevent it (this could include whole subloops, although we
might not do that today). In some sense, if you make vscale dynamic,
you've introduced dependent types into LLVM's type system, but you've
done it in an implicit manner. It's not clear to me that works. If we
need dependent types, then an explicit dependence seems better. (e.g.,
<scalable <n> x %vscale_var x <type>>)
2. How would the function-call boundary work? Does the function itself
have intrinsics that change the vscale? If so, then it's not clear that
the function-call boundary makes sense unless you prevent inlining. If
you prevent inlining, when does that decision get made? Will the
vectorizer need to outline loops? If so, outlining can have a real cost
that's difficult to model. How do return types work?
To other thoughts:
1. I can definitely see the use cases for changing vscale dynamically,
and so I do suspect that we'll want that support.
2. LLVM does not have loops as first-class constructs. We only have SSA
(and, thus, dominance), and when specifying restrictions on placement of
things in function bodies, we need to do so in terms of these constructs
that we have (which don't include loops).
Thanks again,
Hal
>
> -David
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
That's a shift from the current proposal and I think we can think
about it after the current changes. For now, both SVE and RISC-V are
proposing function boundaries for changes in vscale.
> 2. How would the function-call boundary work? Does the function itself
> have intrinsics that change the vscale?
Functions may not know what their vscale is until they're actually
executed. They could even have different vscales for different call
sites.
AFAIK, it's not up to the compiled program (ie via a function
attribute or an inline asm call) to change the vscale, but the
kernel/hardware can impose dynamic restrictions on the process. But,
for now, only at (binary object) function boundaries.
I don't know how that works at the kernel level (how to detect those
boundaries? instrument every branch?) but this is what I understood
from the current discussion.
> If so, then it's not clear that
> the function-call boundary makes sense unless you prevent inlining. If
> you prevent inlining, when does that decision get made? Will the
> vectorizer need to outline loops? If so, outlining can have a real cost
> that's difficult to model. How do return types work?
The dynamic nature is not part of the program, so inlining can happen
as always. Given that the vectors are agnostic of size and work
regardless of what the kernel provides (within safety boundaries), the
code generation shouldn't change too much.
We may have to create artefacts to restrict the maximum vscale (for
safety), but others are better equipped to answer that question.
> 1. I can definitely see the use cases for changing vscale dynamically,
> and so I do suspect that we'll want that support.
At a process/function level, yes. Within the same self-contained
sub-graph, I don't know.
> 2. LLVM does not have loops as first-class constructs. We only have SSA
> (and, thus, dominance), and when specifying restrictions on placement of
> things in function bodies, we need to do so in terms of these constructs
> that we have (which don't include loops).
That's why I was trying to define the "self-contained sub-graph" above
(there must be a better term for that). It has to do with data
dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure
side-effects don't leak out.
A loop iteration is usually such a block, but not all are and not all
such blocks are loops.
Changing vscale inside a function, but outside of those blocks would
be "fine", as long as we made sure code movement respects those
boundaries and that context would be restored correctly on exceptions.
But that's not part of the current proposal.
Chaning vscale inside one of those blocks would be madness. :)
cheers,
--renato
I understand. I'm afraid that the function-boundary idea doesn't work
reasonably.
>
>
>> 2. How would the function-call boundary work? Does the function itself
>> have intrinsics that change the vscale?
> Functions may not know what their vscale is until they're actually
> executed. They could even have different vscales for different call
> sites.
>
> AFAIK, it's not up to the compiled program (ie via a function
> attribute or an inline asm call) to change the vscale, but the
> kernel/hardware can impose dynamic restrictions on the process. But,
> for now, only at (binary object) function boundaries.
I'm not sure if that's better or worse than the compiler putting in code
to indicate that the vscale might change. How do vector function
arguments work if vscale gets larger? or smaller?
So, if I have some vectorized code, and we figure out that some of it is
cold, so we outline it, and then the kernel decides to decrease vscale
for that function, now I have broken the application? Storing a vector
argument in memory in that function now doesn't store as much data as it
would have in the caller?
>
> I don't know how that works at the kernel level (how to detect those
> boundaries? instrument every branch?) but this is what I understood
> from the current discussion.
Can we find out?
But I don't know how to implement that restriction without major changes
to the code base. Such a restriction doesn't follow from use/def chains,
and if we need a restriction that involves looking for non-SSA
dependencies (e.g., memory dependencies), then I think that we need
something different than the current proposal. Explicitly dependent
types might work, something like intrinsics might work, etc.
Thanks again,
Hal
>
> Chaning vscale inside one of those blocks would be madness. :)
>
> cheers,
> --renato
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
> On 30 Jul 2018, at 20:10, Hal Finkel <hfi...@anl.gov> wrote:
>
>
> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>
>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>
>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>
> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.
Thanks, that's good to hear.
> 1.
>> This is a proposal for how to deal with querying the size of scalable types for
>> > analysis of IR. While it has not been implemented in full,
>
> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.
Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.
> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
-Graham
_______________________________________________
At least on this point, I think that we'll want to have the
implementation to help make sure there aren't important details we're
overlooking.
>
>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
I think that this will likely work, although I think we want to invert
the sense of the attribute. vscale should be inherited by default, and
some attribute can say that this isn't so. That same attribute, I
imagine, will also forbid scalable vector function arguments and return
values on those functions. If we don't have inherited vscale as the
default, we place an implicit contract on any IR transformation hat
performs outlining that it needs to scan for certain kinds of vector
operations and add the special attribute, or just always add this
special attribute, and that just becomes another special case, which
will only actually manifest on certain platforms, that it's best to avoid.
>
> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
My point is that, while there may be some sense in which the details can
be worked out later, we need to have a good-enough understanding of how
this will work now in order to make sure that we're not making design
decisions now that make handling the dynamic vscale in a reasonable way
later more difficult.
Thanks again,
Hal
>
> -Graham
>
>> Thanks again,
>> Hal
FWIW, I don't think dependent types really help with the code motion
problems. While using an SSA value in a type would presumably enforce
that instructions mentioning that type have to be dominated by the
definition of said value, the real problem is when you _stop_ using
one vscale (and presumably start using another). For example, we want
to rule out the following:
%vscale.1 = call i32 @change_vscale(...)
%v.1 = load <scalable 4 x %vscale.1 x i32> ...
%vscale.2 = call i32 @change_vscale(...)
%v.2 = load <scalable 4 x %vscale.1 x i32> ... ; vscale changed but
we're still doing things with the old one
And of course, actually introducing this notion of types mentioning
SSA values into LLVM would be an extraordinarily huge and difficult
step. I did actually consider something along these lines (and even
had a digression about it in drafts of my RFC, but I cut it in the
final version) but I don't think it's viable.
Tying some values to the function they're in, on the other hand, even
has precedent in current LLVM: tokens values must be confined to one
function (intrinsics are special, of course), so most of the
interprocedural passes already must be careful with moving certain
kinds of values between functions. It's ad-hoc and requires auditing
passes, yes, but it's something we know and have some experience with.
(The similarity to tokens is strong enough that my original proposal
heavily leaned on tokens to encode the restrictions on the optimizer
that are needed for different-vscale-per-function, but I've been
persuaded that it's more trouble than it's worth, hence the "implicit"
approach of this RFC.)
>>
>>
>>> 2. How would the function-call boundary work? Does the function itself
>>> have intrinsics that change the vscale?
>> Functions may not know what their vscale is until they're actually
>> executed. They could even have different vscales for different call
>> sites.
>>
>> AFAIK, it's not up to the compiled program (ie via a function
>> attribute or an inline asm call) to change the vscale, but the
>> kernel/hardware can impose dynamic restrictions on the process. But,
>> for now, only at (binary object) function boundaries.
>
> I'm not sure if that's better or worse than the compiler putting in code
> to indicate that the vscale might change. How do vector function
> arguments work if vscale gets larger? or smaller?
I don't see any way for the OS to change a running process's vscale
without a great amount of cooperation from the program and the
compiler. In general, the kernel has nowhere near enough information
to identify spots where it's safe to fiddle with vscale -- function
call boundaries aren't safe in general, as you point out. FWIW, in
the RISC-V vector task group we discussed migrating running processes
between cores in heterogenous architectures (e.g. think big.LITTLE)
that may have different vector register sizes. We quickly agreed that
there's no way to make that work and dismissed the idea. The current
thinking is, if you want to migrate a process that's currently using
the vector unit, you can only migrate it between cores that have the
same kind of register field.
For the RISC-V backend I don't want anything to do with OS
shenangians, I'm exclusively focused on codegen. The backend inserts
machine code in the prologue that configures the vector unit in
whatever way the backend considers best, and this configuration
determines vscale (and some other things that aren't visible to IR).
The caller saves their vector unit state before the call and restores
it after the call returns, so their vscale is not affected by the call
either.
For SVE, I could imagine a function attribute that indicates it's OK
to change vscale at this point (this will probably have to be a very
careful and deliberate decision by a programmer). The backend could
then change vscale in the prologie, either set it to a specific value
(e.g., requested by the attribute) or make a libcall asking the kernel
to adjust vscale if it wants to.
In both cases, the change happens after the caller saved all their
state and before any of the callee's code runs.
That leaves arguments and return values, and more generally any vector
values that are shared (e.g., in memory) between caller and callee.
Indeed it's not possible to share any vectors between two functions
that disagree on how large a vector is (sounds obvious when you put it
that way). If you need to pass vectors in any way, caller and callee
have to agree on vscale as part of the ABI, and the callee does *not*
change vscale but "inherits" it from the caller. On SVE that's the
default ABI, on RISC-V there will be one or multiple non-default
"vector call" ABIs (as Bruce mentioned in an earlier email).
In IR we could represent these different ABIs though calling
convention numbers, function attributes, or a combination thereof.
With ABIs where caller and callee don't necessarily agree on vscale,
it is simply impossible to pass vector values (and while you can e.g.
pass the caller's vscale value, it probably isn't meaningful to the
callee):
- it's a verifier error if such a function takes or returns scalable
vectors directly
- a source program that e.g. tries to smuggle a vector from one
function to another through heap memory is erroneous
- the optimizer must not introduce such errors in correct input programs
The last point means, for example, that partial inlining can't pull
the computation of a vector value into the caller and pass the result
as a new argument. Such optimizations wouldn't be correct anyway,
regardless of ABI concerns: the instructions that are affected all
depend on vscale and therefore moving them to a different function
changes their behavior. Of course, this doesn't mean all
interprocedural optimizations are invalid. *Complete* inlining, for
example, is always valid.
Of course, all this applies only if caller and callee don't agree on
vscale. With suitable ABIs, all existing optimizations can be applied
without problem.
Seconded, this is an extraordinarily difficult problem. I've spent
unreasonable amounts of time thinking about ways to model changing
vector sizes and sketching countless designs for it. Multiple times I
convinced myself some clever setup would work, and every time I later
discovered a fatal flaw. Until I settled on "only at funciton
boundaries", that is, and even that took a few iterations.
Cheers,
Robin
+1
>>
>>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
>> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.
>
> I think that this will likely work, although I think we want to invert
> the sense of the attribute. vscale should be inherited by default, and
> some attribute can say that this isn't so. That same attribute, I
> imagine, will also forbid scalable vector function arguments and return
> values on those functions. If we don't have inherited vscale as the
> default, we place an implicit contract on any IR transformation hat
> performs outlining that it needs to scan for certain kinds of vector
> operations and add the special attribute, or just always add this
> special attribute, and that just becomes another special case, which
> will only actually manifest on certain platforms, that it's best to avoid.
It's a real relief to hear that you think this "will likely work".
Inverting the attribute seems good to me. I probably proposed not
inheriting by default because that's the default on RISC-V, but your
rationale is convincing.
>>
>> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.
>
> My point is that, while there may be some sense in which the details can
> be worked out later, we need to have a good-enough understanding of how
> this will work now in order to make sure that we're not making design
> decisions now that make handling the dynamic vscale in a reasonable way
> later more difficult.
Sorry if I'm a broken record, but I believe Graham was referring to
the _active vector length_ or VL here, which has nothing to do with
vscale, dynamic or not. I described earlier why I think the former
doesn't interact with the contents of this RFC in any interesting way.
If you think otherwise, could you elaborate on why you think that?
Cheers,
Robin
Yeah, many loops with different demands on the vector unit in one
function is a problem for the "one vscale per function" approach.
Though for the record, the differences that matter here are not trip
count, but things like register pressure and the bit widths of the
vector elements.
There are some (fragile) workarounds for this problem, such as
splitting up the function. There's also the possibility of optimizing
for this case in the backend: trying to recognize when you can use
different configurations/vscales for two loops without changing
observable behavior (no vector values live between the loops, vscale
doesn't escape, etc.). In general this is of course extremely
difficult, but I hope it'll work well enough in practice to mitigate
this problem somewhat. This is just an educated guess at this point,
we'll have to wait and see how big the impact is on real applications
and real hardware (or simulations thereof).
But at the end of the day, sure, maybe we'll generate sub-optimal code
for some applications. That's still better than making the problem
intractable by being too greedy and ending up with either a broken
compiler or one that can't vary vscale at all.
Cheers,
Robin
Good point.
...
Was it decided that this issue is equivalent to, or a subset of,
per-lane predication on load, stores, and similar? Or is it different?
Thanks again,
Hal
It is equivalent a subset. If there are k lanes, vector instructions
execute under a mask that enables the first VL lanes and disables the
remaining (k - VL) lanes. That same mask is computed by SVE
instructions such as whilelt. This style of predication can be
combined with a more conventional and more general one-bit-per-lane
mask, then the instruction executes under the conjunction of these two
masks.
Just a quick question about bitcode format changes; is there anything special I should be doing for that beyond ensuring the reader can still process older bitcode files correctly?
The code in the main patch will always emit 3 records for a vector type (as opposed to the current 2), but we could omit the third field for fixed-length vectors if that's preferable.
-Graham
Any reason not to omit the field? This can affect object-file size when
using LTO, etc.
-Hal
Sorry for the delay, but I now have an initial implementation of size queries
for scalable types on phabricator:
https://reviews.llvm.org/D53137 and https://reviews.llvm.org/D53138
This isn't complete (I haven't used the DataLayout version outside of the tests),
but I'd like to get some feedback before making further changes.
Some notes/questions:
1. I've changed Type::getPrimitiveSizeInBits to return 0 for vector types
and changed all uses of it with vector types to use
Type::getScalableSizeInBits instead, following the design in the RFC.
While this shows where getPrimitiveSizeInBits is used with vector types,
I think it would be better for now to change it back to avoid breaking
existing targets and put an assert in it to ensure that it's only used
on non-scalable vectors. We can revisit the decision later once
scalable vector support is more mature. Thoughts?
2. There are two implementations of ScalableSize; one in Type.h, and one
in DataLayout.h. I'd prefer to only have one, but the former reports
sizes as 32bits while the latter uses 64bits.
I think changing all size queries to use 64bits is the best way to
resolve it -- are there any significant problems with that approach
aside from lots of code churn?
It would also be possible to use templates and typedefs, but I figure
unifying size reporting would be better.
3. I have only implemented 'strict' comparisons for now, which leads to
some possibly-surprising results; {X, 0} compared with {0, X} will
return false for both '==' and '<' comparisons, but true for '<='.
I think that supporting 'maybe' results from overloaded operators
would be a bad idea, so if/when we find cases where they are needed
then I think new functions should be written to cover those cases
and only used where it matters. For simple things like stopping
casts between scalable and non-scalable vectors the strict
comparisons should suffice.
4. Alignment for structs containing scalable types is tricky. For now,
I've added an assert to force all structs containing scalable vectors
to be packed.
It won't be possible to calculate correct offsets at compile time if
the minimum size of a struct member isn't a multiple of the required
alignment for the subsequent element(s).
Instead, a runtime calculation will be required. This could arise in
SVE if a predicate register (minimum 2 bytes) were used followed by
an aligned data vector -- it could be aligned, but it could also
require adding up to 14 bytes of padding to reach minimum alignment
for data vectors.
The proposed ACLE does allow creating sizeless structs with both
predicate and data registers so we can't forbid such structs, but it
makes no guarantees about alignment -- it's implementation defined.
Do any of the other architectures with scalable vectors have any
particular requirements for this?
5. The comparison operators contain all cases within them. Would it be
preferable to just keep the initial case (scalable terms equal and
likely zero) in the header for inlining, and move all other cases
into another function elsewhere to reduce code bloat a bit?
6. Any suggestions for better names?
7. Would it be beneficial to put the RFC in a phabricator review to make
it easier to see changes?
8. I will be at the devmeeting next week, so if anyone wants to chat
about scalable vector support that would be very welcome.
-Graham
On Thu, 11 Oct 2018 at 15:14, Graham Hunter <Graham...@arm.com> wrote:
> 1. I've changed Type::getPrimitiveSizeInBits to return 0 for vector types
> and changed all uses of it with vector types to use
> Type::getScalableSizeInBits instead, following the design in the RFC.
>
> While this shows where getPrimitiveSizeInBits is used with vector types,
> I think it would be better for now to change it back to avoid breaking
> existing targets and put an assert in it to ensure that it's only used
> on non-scalable vectors. We can revisit the decision later once
> scalable vector support is more mature. Thoughts?
Another solution would be to make it return ScalableSize.Unscaled. At
least in a transition period.
> 2. There are two implementations of ScalableSize; one in Type.h, and one
> in DataLayout.h. I'd prefer to only have one, but the former reports
> sizes as 32bits while the latter uses 64bits.
>
> I think changing all size queries to use 64bits is the best way to
> resolve it -- are there any significant problems with that approach
> aside from lots of code churn?
>
> It would also be possible to use templates and typedefs, but I figure
> unifying size reporting would be better.
Agreed.
> 3. I have only implemented 'strict' comparisons for now, which leads to
> some possibly-surprising results; {X, 0} compared with {0, X} will
> return false for both '==' and '<' comparisons, but true for '<='.
>
> I think that supporting 'maybe' results from overloaded operators
> would be a bad idea, so if/when we find cases where they are needed
> then I think new functions should be written to cover those cases
> and only used where it matters. For simple things like stopping
> casts between scalable and non-scalable vectors the strict
> comparisons should suffice.
How do you differentiate between maybe and certain?
Asserts making sure you never compare scalable with non-scalable in
the wrong way would be heavy handed, but are the only sure way to
avoid this pitfall.
A handler to make those comparisons safe (for example, returning
safety breach via argument pointer) would be lighter, but require big
code changes and won't work with overloaded operators.
> 4. Alignment for structs containing scalable types is tricky. For now,
> I've added an assert to force all structs containing scalable vectors
> to be packed.
I take it by "alignment" you mean element size (== structure size),
not structure alignment, which IIUC, only depends on the ABI.
I remember vaguely that scalable vectors' alignment in memory is the
same as the unit vector's, and the unit vector is known at compile
time, just not the multiplicity.
Did I get that wrong?
> It won't be possible to calculate correct offsets at compile time if
> the minimum size of a struct member isn't a multiple of the required
> alignment for the subsequent element(s).
I assume this would be either an ABI decision or an extension to the
standard, but we can re-use C99's VLA concepts, only here it's the
element size that is unknown, not just the element count.
This would keep the costs of unknown offsets until runtime to a minimal.
It would also make sure undefined behaviour while accessing
out-of-bounds offsets in a structure with SVE types break consistently
and early. :)
cheers,
--renato
Thanks for taking a look.
> On 11 Oct 2018, at 15:57, Renato Golin <renato...@linaro.org> wrote:
>
> Hi Graham,
>
> On Thu, 11 Oct 2018 at 15:14, Graham Hunter <Graham...@arm.com> wrote:
>> 1. I've changed Type::getPrimitiveSizeInBits to return 0 for vector types
>> and changed all uses of it with vector types to use
>> Type::getScalableSizeInBits instead, following the design in the RFC.
>>
>> While this shows where getPrimitiveSizeInBits is used with vector types,
>> I think it would be better for now to change it back to avoid breaking
>> existing targets and put an assert in it to ensure that it's only used
>> on non-scalable vectors. We can revisit the decision later once
>> scalable vector support is more mature. Thoughts?
>
> Another solution would be to make it return ScalableSize.Unscaled. At
> least in a transition period.
True, though there are places in the code that expect a size of 0 to mean
"this is a pointer", so using scalable vectors with that could lead to
incorrect code being generated instead of an obvious ICE.
>> 2. There are two implementations of ScalableSize; one in Type.h, and one
>> in DataLayout.h. I'd prefer to only have one, but the former reports
>> sizes as 32bits while the latter uses 64bits.
>>
>> I think changing all size queries to use 64bits is the best way to
>> resolve it -- are there any significant problems with that approach
>> aside from lots of code churn?
>>
>> It would also be possible to use templates and typedefs, but I figure
>> unifying size reporting would be better.
>
> Agreed.
>
>
>> 3. I have only implemented 'strict' comparisons for now, which leads to
>> some possibly-surprising results; {X, 0} compared with {0, X} will
>> return false for both '==' and '<' comparisons, but true for '<='.
>>
>> I think that supporting 'maybe' results from overloaded operators
>> would be a bad idea, so if/when we find cases where they are needed
>> then I think new functions should be written to cover those cases
>> and only used where it matters. For simple things like stopping
>> casts between scalable and non-scalable vectors the strict
>> comparisons should suffice.
>
> How do you differentiate between maybe and certain?
This work is biased towards 'true' being valid if and only if the condition
holds for all possible values of vscale. This does mean that returning a
'false' in some cases may be incorrect, since the result could be true for
some (but not all) vscale values.
I don't know if 'maybe' results are useful on their own yet.
> Asserts making sure you never compare scalable with non-scalable in
> the wrong way would be heavy handed, but are the only sure way to
> avoid this pitfall.
>
> A handler to make those comparisons safe (for example, returning
> safety breach via argument pointer) would be lighter, but require big
> code changes and won't work with overloaded operators.
My initial intention was for most existing code (especially in target
specific code for targets without scalable vectors) to continue using
the unscaled-only interfaces; there's also common code which is guarded
by a check for scalar types before querying size. I haven't counted up
all the cases that would need to change, but the majority will be fine
as is.
Do you think that implementing the comparisons without operator overloading
would be preferable? I know that APInt does this, so it wouldn't be
unprecedented in the codebase -- I was just trying to fit the existing code
without changing too much, but maybe that's the wrong approach.
Either passing in a pointer as you suggest, or returning an 'ErrorOr<bool>'
as a result would allow appropriate boolean results through and require
the calling code to handle 'maybes' (which could just mean bailing out of
whatever transformation that was about to be performed).
I'll take a look through some uses of DataLayout to see how well that would
work.
>> 4. Alignment for structs containing scalable types is tricky. For now,
>> I've added an assert to force all structs containing scalable vectors
>> to be packed.
>
> I take it by "alignment" you mean element size (== structure size),
> not structure alignment, which IIUC, only depends on the ABI.
I mean alignment of elements within a struct, which does indeed determine
structure size.
> I remember vaguely that scalable vectors' alignment in memory is the
> same as the unit vector's, and the unit vector is known at compile
> time, just not the multiplicity.
>
> Did I get that wrong?
That's correct, but data vectors (Z registers) and predicate vectors
(P registers) have different unit vector sizes: 128bits vs 16bits,
respectively.
We could insist that predicate vectors take up the same space as data
vectors, but that will waste some space.
>> It won't be possible to calculate correct offsets at compile time if
>> the minimum size of a struct member isn't a multiple of the required
>> alignment for the subsequent element(s).
>
> I assume this would be either an ABI decision or an extension to the
> standard, but we can re-use C99's VLA concepts, only here it's the
> element size that is unknown, not just the element count.
>
> This would keep the costs of unknown offsets until runtime to a minimal.
Sure, it's something to handle at the ABI level, so I'd like to know if
RVV or NEC's vector architecture have any special requirements here.
I would hope that sufficient advice to the programmer would avoid this
being a common problem and predicate vectors were always placed after
data vectors, but we do need to make sure it will work the other way
round.
-Graham
I see.
> This work is biased towards 'true' being valid if and only if the condition
> holds for all possible values of vscale. This does mean that returning a
> 'false' in some cases may be incorrect, since the result could be true for
> some (but not all) vscale values.
I wonder is there's any case where returning a wrong false would be
problematic.
I can't think of anything, so I agree with your approach. :)
> Do you think that implementing the comparisons without operator overloading
> would be preferable? I know that APInt does this, so it wouldn't be
> unprecedented in the codebase -- I was just trying to fit the existing code
> without changing too much, but maybe that's the wrong approach.
No, I think keep the code changes to a minimum is important.
And the problems will only be on scalable vs. non-scalable, which is
non-existent today, so not expecting anything current to break.
> I'll take a look through some uses of DataLayout to see how well that would
> work.
Thanks! If we can solve that in a simple way, good. If not, I don't
see it as a big deal, for now.
> That's correct, but data vectors (Z registers) and predicate vectors
> (P registers) have different unit vector sizes: 128bits vs 16bits,
> respectively.
Ah, I see. I imagine P regs will need padding to the maximum alignment
in (almost?) all cases.
Following various discussions at the recent devmeeting, I've posted an RFC for
scalable vector IR type alone on phabricator: https://reviews.llvm.org/D53695
There's a couple of changes, and I posted that as a separate revision on top
of the previous text so changes are visible.
The main differences are:
- Size comparisons between unscaled and scaled vector types are considered
invalid for now, and will assert.
- Scalable vector types cannot be members of StructTypes or ArrayTypes. If
these are needed at the C level (e.g. the SVE ACLE C intrinsics), then
clang must perform lowering to a pointer + vscale-based arithmetic instead
of creating aggregates in IR.
I will update the IR type patch and size query patch soon.
-Graham
Sounds like a safe approach.
> - Scalable vector types cannot be members of StructTypes or ArrayTypes. If
> these are needed at the C level (e.g. the SVE ACLE C intrinsics), then
> clang must perform lowering to a pointer + vscale-based arithmetic instead
> of creating aggregates in IR.
Perhaps also mark them as noalias, given that the front-end has full
control over its lifetime?
> I will update the IR type patch and size query patch soon.
I'll have a look, thanks!
--renato
Generic code can enquire the size, dynamically allocate space, and
transparently save and restore the contents of a vector register or
registers.
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
JinGu:
I’m not Graham, but you might find the following link a good starting point.
The question you ask doesn’t have a short answer. The compiler and the instruction set design work together to allow programs to be compiled without knowing until run-time what the vector width is (within limits of min and max possible widths). One key restriction is that certain storage classes can’t contain scalable vector types, like statically allocated globals for example.
Joel Jones
From:
JinGu Kang <ji...@codeplay.com>
Date: Friday, May 24, 2019 at 11:28 AM
To: Chris Lattner <clat...@nondot.org>, Hal Finkel <hfi...@anl.gov>, "Jones, Joel" <Joel....@cavium.com>, "d...@cray.com" <d...@cray.com>, Renato Golin <renato...@linaro.org>, Kristof Beyls <Kristo...@arm.com>, Amara Emerson <aeme...@apple.com>,
Florian Hahn <Floria...@arm.com>, Sander De Smalen <Sander....@arm.com>, Robin Kruppe <robin....@gmail.com>, "llvm...@lists.llvm.org" <llvm...@lists.llvm.org>, "mku...@google.com" <mku...@google.com>, Sjoerd Meijer <Sjoerd...@arm.com>, Sam
Parker <Sam.P...@arm.com>, Graham Hunter <Graham...@arm.com>
Cc: nd <n...@arm.com>
Subject: [EXT] Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
External Email