[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Anna Sophia Welker via llvm-dev

unread,

Oct 30, 2020, 6:08:47 AM10/30/20

to llvm...@lists.llvm.org

Hi all,

I am looking into the benefits of a VPlan-based cost model, and into how such a cost model should be implemented to make the most out of these benefits.

Over the last year I have been working with LLVM, mainly focused on the ARM backend, in the course of a one-year internship at Arm Ltd. My main project from December 2019 to August 2020 was to introduce gather/scatters for MVE auto-vectorization. One of the recurring challenges during this work was to get things right in the cost model.
For example, gathers can extend the data they load, while scatters can truncate their input, meaning that an extend following the load, or a truncate preceding the store, is for free if it meets certain type conditions. As the current cost model is not designed for context-based analysis, this was a pain to model.

I have done some research and found out that one of the proposed benefits of VPlan is that a new cost model making use of it would be able to better support context-dependent decisions like this.
However, there does not exist much specification about what such a cost model should look like.

Also, I have read through the respective code to understand how loop vectorization is currently done and how far the introduction of VPlan has progressed and have realised that the only recipe that actually handles more than one instruction from the input IR is the one for interleaved groups. When the VPlan is generated on the VPlan-native path, every IR instruction is considered and transformed into a recipe separately, ignoring its context (to give a code location, I am looking at VPlanTransforms::VPInstructionsToVPRecipes).
And maybe there are architectures that for some cases do not have the same vector instructions, so a pattern that works great for one could be useless for others. So I am wondering: Is there any plan to have target-dependent flavours of recipes, or how will those things be handled?
Right now it makes sense that nothing like this has been implemented yet, as there is no cost model that could guide the transformation. But if recipes are a general thing, should the cost model be the component actually providing the target-specific pattern for a recipe, together with its cost?

I am considering choosing a topic related to VPlan, possibly cost modelling, for my Master thesis, with the goal to present a solution and implement a prototype.

What I would like to ask the community is

(1) what goals there are or should be for VPlan-based cost modelling,
(2) whether there have been any discussions on target-dependent patterns yet, and
(3) which examples of inefficiencies and shortcomings of the current cost model they have come across.

I am looking forward to your feedback.

Many thanks,
Anna Welker

Florian Hahn via llvm-dev

unread,

Oct 30, 2020, 7:38:34 AM10/30/20

to Anna Sophia Welker, llvm...@lists.llvm.org

Hi,

On Oct 30, 2020, at 10:08, Anna Sophia Welker via llvm-dev <llvm...@lists.llvm.org> wrote:

Hi all,

I am looking into the benefits of a VPlan-based cost model, and into how such a cost model should be implemented to make the most out of these benefits.

Over the last year I have been working with LLVM, mainly focused on the ARM backend, in the course of a one-year internship at Arm Ltd. My main project from December 2019 to August 2020 was to introduce gather/scatters for MVE auto-vectorization. One of the recurring challenges during this work was to get things right in the cost model.
For example, gathers can extend the data they load, while scatters can truncate their input, meaning that an extend following the load, or a truncate preceding the store, is for free if it meets certain type conditions. As the current cost model is not designed for context-based analysis, this was a pain to model.

One goal is a more modular approach to cost-modeling and transformations. The current cost-model is quite monolithic and keeps some context information in various tables, which is hard to reason about and also hard to change and extend. One idea with VPlan is that cost decisions could be materialized in the VPlans directly by transformations working in concert to find the best overall strategy.

I think the example with gathers that extend the loaded value illustrates that quite well. In the legacy cost model, you could maintain a set of values that are free to extend and add the loaded values to it. When visiting a zext, return cost of zero for those.

With VPlan, instead you could add a new VPInstruction opcode ZextLoad and have a transformation that replaces all `zext (load)` instructions with the new ZExtLoad one. The cost-model needs to be taught about this special instruction and how much it costs. Then you could apply this transformation to an existing plan and check if the overall cost improved.

Of course there’s also the issue of how to generalize the TTI APIs to allow for computing the cost of instructions with more context.

I have done some research and found out that one of the proposed benefits of VPlan is that a new cost model making use of it would be able to better support context-dependent decisions like this.
However, there does not exist much specification about what such a cost model should look like.

Also, I have read through the respective code to understand how loop vectorization is currently done and how far the introduction of VPlan has progressed and have realised that the only recipe that actually handles more than one instruction from the input IR is the one for interleaved groups. When the VPlan is generated on the VPlan-native path, every IR instruction is considered and transformed into a recipe separately, ignoring its context (to give a code location, I am looking at VPlanTransforms::VPInstructionsToVPRecipes).
And maybe there are architectures that for some cases do not have the same vector instructions, so a pattern that works great for one could be useless for others. So I am wondering: Is there any plan to have target-dependent flavours of recipes, or how will those things be handled?

I think we probably want to keep things as generic as possible, try to model generic concepts and use TTI to decide whether to use it or not (e.g. see how masked loads/stores are handled).

Right now it makes sense that nothing like this has been implemented yet, as there is no cost model that could guide the transformation. But if recipes are a general thing, should the cost model be the component actually providing the target-specific pattern for a recipe, together with its cost?

I am not sure what patterns specifically you are thinking about, but I think the cost model should just evaluate the cost of a given plan and not provide anything beyond that. Of course, this still can mean that there might be certain recipes/instructions that are not available/profitable on some targets and we decide to never generate them, based on the cost-information.

I am considering choosing a topic related to VPlan, possibly cost modelling, for my Master thesis, with the goal to present a solution and implement a prototype.

I am hoping to make some progress on this in the next months (hopefully the work on modeling the def-use chains between recipes in VPlan will be wrapped up soon) and I expect there to be a few moving parts. Not sure what that means for a master thesis in this area.

Cheers,

Florian

Sjoerd Meijer via llvm-dev

unread,

Oct 30, 2020, 7:40:09 AM10/30/20

to llvm...@lists.llvm.org, Anna Sophia Welker, ayal...@intel.com, gil.ra...@intel.com, Florian Hahn, David Green

Hello Anna,

There was a VPlan round-table at the US LLVM dev conference just a few weeks ago, and I have linked in the people who were involved in this as cost-modeling was one of the topics that was discussed. You may have seen this already, but Dave started doing some work in this area recently:

https://reviews.llvm.org/D89322

https://reviews.llvm.org/D89323

Just some first high-level remarks on your proposal: I think work in this area would be very valuable and solve a real problem that we currently have. If you plan to choose this as a subject for a master's thesis, you probably need to think about how big of a task this is and having some motivating examples would be good too (but you mentioned some already). To me it sounds like a big task that needs a lot of plumbing in VPlan, but that's why having this discussion here is interesting and hopefully people with more VPlan experience can provide insights.

Cheers,

Sjoerd.

From: llvm-dev <llvm-dev...@lists.llvm.org> on behalf of Anna Sophia Welker via llvm-dev <llvm...@lists.llvm.org>
Sent: 30 October 2020 10:08
To: llvm...@lists.llvm.org <llvm...@lists.llvm.org>
Subject: [llvm-dev] [RFC] Goals for VPlan-based cost modelling

Renato Golin via llvm-dev

unread,

Oct 30, 2020, 9:37:52 AM10/30/20

to Anna Sophia Welker, LLVM Dev

Hi Anna,

This sounds like a great work for a master!

The main difference with VPlan (versus the original vectoriser) is that we can have multiple paths in parallel and compare them to each other via some cost function. The problem, as you have stated, is that the cost model is really naive and has a lot of heuristics that were fudged to suit the old vectoriser. Transitioning to the new model will likely produce a cost model that has nothing to do with the previous one. At all.

I see the problem from two angles. Both need to be thought about in parallel, as their designs will need to be in sync.

1. How to build the cost model

The cost model structure needs to be target independent but with heavy target-specific knowledge. The kind of specialisation we created for the previous vectoriser is too naive, and assumes the concepts are the same, just not necessarily the costs. That's not true at all for different hardware.

Another big flaw is that the cost is per instruction, not per pattern. So casts have zero cost because we expect the targets to optimise them away, but in some cases they don't, so the heuristics is an average of how efficient the back-end is at getting rid of those. When you're choosing between two VPlan paths, this will generate noise beyond usefulness.

So we need to find the cost of instructions that are actually generated and not those that aren't. One idea is to ignore most instructions and focus on key ones (memory, arithmetic, etc) and then follow the use-def chain to see the patterns, and then, *per target*, know what the costs are for the pattern, not each individual instruction. Different targets can have different key instructions, so the walk through the basic-block can be target-dependent (ex: `for (auto inst: TTI->AllKeyInsts(bb))`). Once you try to take the cost of a key instruction, it will follow all of the others that you have ignored, to see if they would make a difference in the pattern, stopping short at other key instructions, for example, to avoid counting twice the same cost.

There are probably many other ways to do this, I'm just thinking out loud to give an idea of the problems we faced earlier. I'm sure you'll find better ways to do this yourself once you start looking at it more.

Regardless of how we do it, having the ability to say "this shuffle is free on this instruction" but "not with that other one" will make the cost model a lot more precise and need a lot less fudged heuristics.

2. How to use the cost model

Once we have costs of a VPlan, we need to traverse the problem space efficiently. It would be awesome if we could use some smart traversing (like RLO) but that also brings highly uncertain execution times and the need for training and genralisation, so not generally feasible. But we can do something simpler while still having the same idea. What we really don't want is to be as greedy as the original vectoriser. We must have a way to occasionally increase costs without giving up, but for that, we need a policy that tells us that it's ok to go that way and not some other (less optimal) way.

This policy has to be target-specific. Given it's more deterministic nature (more than RLO at least), this will probably be the place where we fudge most of our heuristics. Things like "keep adding shuffles that it's likely they'll disappear", even if they don't all the time, it's worth pushing a bit more, in case they do. So we'll also need to have hard limits, possibly per target, and possibly benchmark-driven heuristics.

Once we have the policy and the cost, we can traverse one or many paths (depending on the budget we give the compiler, which could be a command line option), and then push as hard as we can through all paths within the budget and when we stop, we take the lowest cost and apply the series. How we traverse this will depend on the implementation of the cost model and the policy, but some simple heuristics search would be fine as a first approach.

Folks working on VPlan are creating a really nice structure to work with the different concepts in vectorisation strategies (predication, scalable vectors, use-def chain costs), so I'm sure you'll have at least a few good ways to proceed, and hopefully help define the interfaces between the different parts of the vectoriser.

cheers,

--renato

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Anna Sophia Welker via llvm-dev

unread,

Nov 2, 2020, 4:49:01 AM11/2/20

to Florian Hahn, llvm...@lists.llvm.org

Hi Florian,

With VPlan, instead you could add a new VPInstruction opcode ZextLoad and have a transformation that replaces all `zext (load)` instructions with the new ZExtLoad one. The cost-model needs to be taught about this special instruction and how much it costs. Then you could apply this transformation to an existing plan and check if the overall cost improved.

Of course there’s also the issue of how to generalize the TTI APIs to allow for computing the cost of instructions with more context.

I think we probably want to keep things as generic as possible, try to model generic concepts and use TTI to decide whether to use it or not (e.g. see how masked loads/stores are handled).

Hm, so what you are saying is that for any instruction that could be merged into another one on some architecture, there should be a new VPInstruction (naturally based on the 'plain' VPInstruction for that instruction) that symbolises this and would be assigned the cost 0 by the TTI, or whatever a new cost model would use, of only the respective architecture. I had to think about this a bit, but now it makes sense to me. Handling it a smart way, such that TTIs that do not know about this special case fall back to what they do for the normal case, would also save us from generating additional VPlans. And with the load being the ingredient of the extend, and thus accessible, everything is available to do some type checking and ensure the merge is doable for the architecture.
This really helped me to understand how things are supposed to be working, thanks a lot for explaining!
I do still have a few doubts as to whether this approach might still bear some risk of over-generating (see my reply to Renato's email which I'll link you to), but those might simply originate from me being too cautious because I do not have a good overview yet of how many (different) examples like the one I mentioned there are, so how many special cases would have to get their own VPInstruction flavour.

I am not sure what patterns specifically you are thinking about, but I think the cost model should just evaluate the cost of a given plan and not provide anything beyond that. Of course, this still can mean that there might be certain recipes/instructions that are not available/profitable on some targets and we decide to never generate them, based on the cost-information.

I was wondering about how one would merge the extend from the example above into the load without any changes to the VPlan structure/classes. So basically without the trick of adding special instructions that you mentioned above. But that does not seem a nice way to do it.

I am considering choosing a topic related to VPlan, possibly cost modelling, for my Master thesis, with the goal to present a solution and implement a prototype.

I am hoping to make some progress on this in the next months (hopefully the work on modeling the def-use chains between recipes in VPlan will be wrapped up soon) and I expect there to be a few moving parts. Not sure what that means for a master thesis in this area.

I guess the extend of impact that has on a thesis strongly depends on what exactly 'this' entails - could you elaborate a bit, please? Are you planning to do work on a cost model for VPlan, or to change how VPInstructions and VPRecipes work, or something else?
If I would start working on a thesis named 'Cost modelling with VPlan' or similar, the first thing to do would be a lot of thinking and drafting to come up with a good and solid concept. I don't think that would depend too much on concrete implementation details of VPlan, as long as the core of the individual components stays the same...

Best,
Anna

Anna Sophia Welker via llvm-dev

unread,

Nov 2, 2020, 4:53:09 AM11/2/20

to Renato Golin, LLVM Dev

Hi Renato,

> The cost model structure needs to be target independent but with heavy
> target-specific knowledge. The kind of specialisation we created for
> the previous vectoriser is too naive, and assumes the concepts are the
> same, just not necessarily the costs. That's not true at all for
> different hardware.

Yes, that's a big issue to be solved. I do like the idea from Florians
reply, basically encoding an instruction's environment in the
VPInstruction that is generated and make it such that target-specific
cost components would overwrite the cost calculations for the general
class if they can be treated in a special way by the architecture.
Though currently the gather that is extended and the scatter whose input
is truncated are my only examples for this kind of thing, which is why I
am hoping that the community will have encountered more examples like
these and is willing to share. Maybe there are cases that are more
complex, that e.g. merge or depend on far more then two instructions, in
which case the generation of the right VPInstructions might become
tricky - especially if there are multiple possible 'merge patterns' (for
one or different architectures) that are overlapping, which would
require to generate alternative VPlans, whose number for long loops with
many special cases might grow exponentially to cover all possible
combinations.

> There are probably many other ways to do this, I'm just thinking out
> loud to give an idea of the problems we faced earlier. I'm sure you'll
> find better ways to do this yourself once you start looking at it more.
>
> Regardless of how we do it, having the ability to say "this shuffle is
> free on this instruction" but "not with that other one" will make the
> cost model a lot more precise and need a lot less fudged heuristics.

Well thank you for thinking out loud, that is exactly what I was hoping
to get out of this RFC!
Yes, there needs to be a good way to say things like that, and
preferrably 'good' should not only mean efficient, but also easy to
understand, maintain and extend - but that's just me dreaming :)
Are you thinking of a concrete example when mentioning those shuffles?
Because concrete examples would really help me a lot when thinking
further about this - if I decide to do this as a thesis, I might have to
constrain the actual implementation to (a subset of instructions for) a
single architecture, but for thinking about the issue itself any
concrete examples for any architecture would be a great help!

Hm, I hadn't yet gotten as far as how to use it, but you're right of
course - that will be another challenge. I know that there are many
things being planned and on the way for VPlan, but I'll willingly admit
that I am currently still having some trouble gaining an overview of
what is there, what is not, and what might be on its way. It's simply
hard and takes a lot of time to find documentation for all the new
things, but I'm sure I'll get there.

Thanks for your detailed reply and all the ideas, they do help a lot.

Best,
Anna

Renato Golin via llvm-dev

unread,

Nov 2, 2020, 5:36:16 AM11/2/20

to Anna Sophia Welker, LLVM Dev

On Mon, 2 Nov 2020 at 09:53, Anna Sophia Welker <s9aa...@stud.uni-saarland.de> wrote:

Yes, that's a big issue to be solved. I do like the idea from Florians
reply, basically encoding an instruction's environment in the
VPInstruction that is generated and make it such that target-specific
cost components would overwrite the cost calculations for the general
class if they can be treated in a special way by the architecture.

Merging instructions as a single VPInst is a good target-specific optimisation to do to simplify the cost model, especially if the target can lower them into a single instruction. But I worry that we'll have to create too many VPInst variants for each platform, where for the cost model, it doesn't usually matter what instructions are inside, other than their cost.

In the example of a "zext (load)", instead of having two VPInsts, you have one, and the hardware will have to have a cost of zext, load and zext-load. This may be enough, or it may lead to a combinatorial explosion. If it does, something leaner like adding a check on Zext for operands, see that it's a load, and return zero. This also keeps the logic inside the instruction itself, so it's easy to understand the code.

I'm honestly not sure which approach is best. Perhaps a mix of both? Florian is working on this, so I'd trust his judgement more than mine.

But I'd keep in mind that the goals of the cost model need to account for patterns that we don't necessarily know about now.

Maybe there are cases that are more
complex, that e.g. merge or depend on far more then two instructions, in
which case the generation of the right VPInstructions might become
tricky - especially if there are multiple possible 'merge patterns' (for
one or different architectures) that are overlapping, which would
require to generate alternative VPlans, whose number for long loops with
many special cases might grow exponentially to cover all possible
combinations.

Exactly my thinking...

Yes, there needs to be a good way to say things like that, and
preferrably 'good' should not only mean efficient, but also easy to
understand, maintain and extend - but that's just me dreaming :)

Agreed!

Are you thinking of a concrete example when mentioning those shuffles?
Because concrete examples would really help me a lot when thinking
further about this - if I decide to do this as a thesis, I might have to
constrain the actual implementation to (a subset of instructions for) a
single architecture, but for thinking about the issue itself any
concrete examples for any architecture would be a great help!

Unfortunately, the last time I worked on this was many years ago. There should be examples on the vectoriser tests for each platform.

I'm thinking about special instructions in each arch that resolve into a complicated set of LLVM IR. For example strides loads/stores in Arm (LDn/STn), gather-scatter in AVX/SVE, vector predication, reduction, etc.

Some of those things have gained intrinsics since the last time I worked with the vectoriser (ex. reduction, which was particularly messy), and that would make the cost model a lot simpler to implement.

Hm, I hadn't yet gotten as far as how to use it, but you're right of
course - that will be another challenge.

Indeed. One problem at a time. :)

But as you help design the cost model, we need to make sure its structure is the most sensible (efficient, simple to read, easy to understand and extend) so that others can make use of it (and extend later), as they implement the VPlan transformations.

I didn't mean you'd have to do it all, just to have it in mind how it could be used, perhaps even write (in pseudo-code) a few examples from each use-case in VPlan, to guide you on the cost model design. This would also look nice in your thesis introduction, to guide the reader into why we need your work in the first place and why your design is what it is.

I know that there are many
things being planned and on the way for VPlan, but I'll willingly admit
that I am currently still having some trouble gaining an overview of
what is there, what is not, and what might be on its way. It's simply
hard and takes a lot of time to find documentation for all the new
things, but I'm sure I'll get there.

It's still very much work in progress, and a lot of work and a lot of progress, so it's hard to keep up with what's done or in flight.

Initially it looks daunting, but soon enough you'll get the hang of it, don't worry. When in doubt, ask the list (or IRC). There are enough people that can help you around.

cheers,

--renato

Anna Sophia Welker via llvm-dev

unread,

Nov 4, 2020, 9:00:51 AM11/4/20

to LLVM Dev, Sjoerd Meijer, gil.ra...@intel.com, Florian Hahn, David Green

+ llvm-dev, which got accidentally dropped - sorry for any duplication

-------- Forwarded Message --------

Hi Sjoerd,

There was a VPlan round-table at the US LLVM dev conference just a few weeks ago, and I have linked in the people who were involved in this as cost-modeling was one of the topics that was discussed. You may have seen this already, but Dave started doing some work in this area recently:

https://reviews.llvm.org/D89322

https://reviews.llvm.org/D89323

That's good to know, thanks for linking the experts!

Just some first high-level remarks on your proposal: I think work in this area would be very valuable and solve a real problem that we currently have. If you plan to choose this as a subject for a master's thesis, you probably need to think about how big of a task this is and having some motivating examples would be good too (but you mentioned some already). To me it sounds like a big task that needs a lot of plumbing in VPlan, but that's why having this discussion here is interesting and hopefully people with more VPlan experience can provide insights.

You are right and I am fully aware that building a fully functional cost model for VPlan might be a very big task, well beyond the scope of a master thesis. But right now there isn't even clear documentation of how exactly such a cost model should look like, and I think it is an interesting and important topic to explore. If I end up with a good idea , a thesis that discusses the topic so there finally is some documentation, and a proof-of-concept implementation, I hope that would already be of some value and interest to the community.

But yes, good examples are needed - that's one reason for this RFC, as of course my experience in the matter is very localised to a specific kind of instructions, and I am sure many members of the community have met other issues that could help painting a better picture of what should be done.

Best,
Anna

David Green via llvm-dev

unread,

Nov 4, 2020, 10:21:01 AM11/4/20

to Anna Sophia Welker, Sjoerd Meijer, gil.ra...@intel.com, Florian Hahn, LLVM Dev

Hello

The gramma idea sounds interesting. My brain tends to just work in terms of code, but it might make a great master thesis :) The pattern matches were modelled after the existing ones in llvm that are used all over the place, especially instcombine. It was meant to feel familiar with what already existed.

I imagine a lot of people must have thought about the same kinds of ideas and there must be some research out there on it. Either whether it worked well or not, it would be good to look into what was already tried. I imagine there will be cases where it works well and times where it doesn't. The same concept of specifying patterns in some way that isn't C++ code could be said for instcombine or DAG combining too. The fact that no-one ever did it is perhaps a bad sign? I know there are some tablegen defined combines for global isel now, they might give some ideas.

Combining a few instructions into a vmulh is simple enough, and I was hoping that there would not be a huge number of combines needed. There are bigger transforms that perhaps wouldn't fit that very well. Imagine trying to take code that does:
x, y = vld2
x *= -1
vst2 x, y

and turn that into
x = load
x *= [-1,1]
str x

That is like "de-interleaving" an interleaving group, much like slp vectorization. Plus it would need to make that transform generic across all the different operations that could be between the loads/store and on the size of the interleaving groups. It, of course, could get quite complex.

Looking forward to seeing what you come up with.
Dave

--------------------------------------------
Sent: 03 November 2020 15:45
Subject: Re: [llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hi Dave,

thanks for your detailed reply, the reductions you describe really seem
like a perfect example of what a new cost model and IR transformation
should be able to cope with, and so does vmulh and the other
instructions you mentioned.
In particular, they do underline the importance of finding a solution
for patterns of arbitrary size, which is much harder than finding a
feasible solution for patterns of size two like the ones I mentioned.

A thought that has been sitting in my head for a while is whether this
part of transforming IR (or VP-) instructions into VPRecipes could be
handled by some kind of grammar, like grammars used for parsing in NLP,
or also in compilers. Patterns could be represented as grammar rules,
and if more than one pattern matches the analysis can fork and continue
on two or more VPlans in parallel. Recipes that can be used in many
architectures would be in the main grammar. It could be a probabilistic
grammar with values either 0 or 1, such that architectures that do not
implement a recipe can set the probability of it ever being generated to
0. For very target-specifc patterns, target-specific sub-grammars could
be maintained.
On a totally crazy thought, a cost model could be integrated in this
grammar step, if the patterns need legality (type) checking anyway like
it is currently being done in some TTI cost functions. Then the recipes
can be annotated with their cost at generation time, we maintain a total
cost for each VPlan and can define a threshold of `cost(VPlan) -
minVPlanCost' above which we abandon plans on the fly, thus abandoning
options that are unlikely to be fruitful mid-analysis (I know that
threshold would be difficult to find, but at least it would reduce the
problem space a bit).

In an approach like that patterns could be arbitrarily large, but I do
have some fears that it might lead to over-generating VPlans. It would
also probably mean that a lot of recipes are needed to model every
instruction that can possibly be generated, but if the concrete costs
were calculated at parse time and the information from the matched
instructions were reduced to what will be needed to generate the output
IR, it might be possible to reduce that.

I think your example in D88152 does go into a similar direction, only it
is transforming VPInstructions to another VPInstruction, not to a recipe
(The pattern matching does remind me a bit of Tablegen, tbh).
What I wonder is, if it looks like to leverage VPlan we need to do some
kind of pattern matching, be it in the cost model for correct cost
calculation, to detect or to transform patterns of VPInstructions,
wouldn't it conceptually make sense to explicitly define something like
a grammar, instead of doing it per instruction/pattern and per
application (costing/transformation)?

What are your thoughts on this?

Thanks,
Anna

On 03.11.20 11:55, David Green wrote:
> Hello
>
> Perhaps I can talk a little about the things I have been at least thinking about lately. Unfortunately a few other important things have come up and I have not had the time do as much as I would have liked. What I have is mostly on the pragmatic end of the spectrum.
>
> So we've been thinking about this for a while but it really starts with vector reductions. Arm MVE has some instruction that make it beneficial to put vector reduction instruction into the loop, instead of summing in the loop and reducing after it. So instead of:
> loop:
>    a = load v16i8
>    s = s + a
> reduce(a)
>
> We can efficiently do:
> loop:
>    a = load v16i8
>    s += reduce a
>
> That is all working now, and seems to be humming along nicely in the vectorizer. It even handles tail predicated loops. The real benefit from those instructions though comes from the fact that they can natively handle larger than legal reductions. So we have an instruction that can take 16i8's and sum them into an i32. That would look something like this in IR:
> loop:
>    a = load v16i8
>    e = sext a to v16i32
>    s += reduce e
>
> It can also do a mla at the same time, doing reduce(mul(sext, sext)) as a single instruction. We even have i64 variants of the instruction, even though there is no i64 vector add in MVE. Unfortunately all of this is both too large for the vectorizer to consider at the moment and looks like a set of very expensive operations. Extends especially are very expensive at the moment in MVE.
>
> We can already cost model load(sext(..)) well (along with other patterns like add(sext, ..) for aarch64). We have the context parameter (plus a parameter for the type of the load now). For 2 instruction patterns this mostly works OK, if not very elegantly, providing that the context instruction can be trusted (which with vplan making more changes becomes a big if). Any more than 2 starts to get very awkward though, trying to check that you are the middle mul in a reduce(mul(sext, sext)) pattern. Especially if the reduce looks like an add!
>
> I had a go at implemented some code to try and get the reductions more correct. One idea was for a better way of providing context. Something like an "InstructionView", in the same way as there is an ArrayView or StringView. It gives a way to query possibly fake instructions for the info the backend might need to make decisions and could be backed by real instructions or vplan nodes. Another simpler idea was to keep a list of "free instructions" that the backend could provide and iterator backwards through the costs instead of forward, letting the backend greedily choose patterns. I didn't much like the changes they required though and didn't get very far with them as they are all very large changes that do not play well into the costmodel as it currently exists.
>
> So I knew that there were some other things that I wanted to start making better in the vectorizer/costmodel, maybe investigating those would give a better view onto what the best way to handle this would be. One is instructions like vmulh. They multiply and take the top half, and are common across a number of architectures. Things like vrmulh, vhadd, vrhadd, vqdmulh, vabsd, etc can all fit into the same bucket. They conceptually do something like:
> loop:
>    a = load v16i8
>    a = load v16i8
>    ae = sext a to v16i16
>    be = sext a to v16i16
>    m = mul ae, be
>    s = shr m, 8
>    t = trunc s to v16i8
>    store t
>
> So the same issue with sign extends being expensive exists, plus considering the cost of those instructions will be very wrong (especially if you consider a i32 extended to an i64 on an architecture with no 64 bit operations.) That I put up as https://reviews.llvm.org/D88152, mostly as a proof of concept to show how it can be done. It recognizes the pattern in vplan and turns it into a single mulh VPInstruction simply enough. The obvious problem comes in where we are still costing the origin instructions in the loop though. Hence the cost model really needs to be moved over to costing vplan nodes, not the original IR instructions (which leads us to https://reviews.llvm.org/D89322 making a start at that).
>
> D88152 obviously still needs some details filled in. The rough idea was to have a set of "VectorPatterns" that the backend could be queried the cost of (like vmulh etc). In this case we are replacing 5 instructions with 1 at a lower bitwidth, so the change is going to be fairly obvious to make if it is legal, and can be done by replacing nodes. In other cases it may be better to clone the vplan and cost them against one another. It still seems simpler in these cases (and the concepts are general enough) that combining in the vectorizer seems like a good way forward. It doesn't really help costmodel anywhere else in the compiler though, if that is something that becomes important.
>
> We would love to be able to start doing things like cost predicated vplans against unpredicated vplans, along with things like de-interleaving the interleaving groups (kind of like slp vectorization), costing gathers vs interleaving loads, and even possibly things like getting the vectorizer to consider lane interleaving if we can. All of this requires us to be costing vplans though, at least as a first step. Doing some of these like (what I have been calling) lane interleaving certainly require more global analysis than the local checks we can do in places like instcombine or ISel, to make sure they are really profitable. It's hard to guess at the costs of them.
>
> I also found it very easy to introduce regressions when trying to rewrite an entire costmodel, as you might imagine. Simple enough things (and actually costing predicate instructions) can have severe effects on some code. It was something I was trying to go slowly on and hopefully do bit at a time. We will see how that goes though.
>
> In short, I think I like the idea of vplan looking more like the target expects it to end up as, as opposed to trying to reverse engineer or guess that in the costmodel. None of this has been through review or anything like that. It was just the kinds of things we were trying to work on and hoping on improving. VPlan has a lot of potential I believe, and it would be great to see it start solving some real problems.
>
> I'm not sure if that helped at all. Maybe it at least gave some context, even if it was a bit rambley.
>
> Looking forward to seeing what your thoughts are,
> Dave

Reply all

Reply to author

Forward