On Oct 30, 2020, at 10:08, Anna Sophia Welker via llvm-dev <llvm...@lists.llvm.org> wrote:Hi all,
I am looking into the benefits of a VPlan-based cost model, and into how such a cost model should be implemented to make the most out of these benefits.
Over the last year I have been working with LLVM, mainly focused on the ARM backend, in the course of a one-year internship at Arm Ltd. My main project from December 2019 to August 2020 was to introduce gather/scatters for MVE auto-vectorization. One of the recurring challenges during this work was to get things right in the cost model.
For example, gathers can extend the data they load, while scatters can truncate their input, meaning that an extend following the load, or a truncate preceding the store, is for free if it meets certain type conditions. As the current cost model is not designed for context-based analysis, this was a pain to model.
I have done some research and found out that one of the proposed benefits of VPlan is that a new cost model making use of it would be able to better support context-dependent decisions like this.
However, there does not exist much specification about what such a cost model should look like.
Also, I have read through the respective code to understand how loop vectorization is currently done and how far the introduction of VPlan has progressed and have realised that the only recipe that actually handles more than one instruction from the input IR is the one for interleaved groups. When the VPlan is generated on the VPlan-native path, every IR instruction is considered and transformed into a recipe separately, ignoring its context (to give a code location, I am looking at VPlanTransforms::VPInstructionsToVPRecipes).
And maybe there are architectures that for some cases do not have the same vector instructions, so a pattern that works great for one could be useless for others. So I am wondering: Is there any plan to have target-dependent flavours of recipes, or how will those things be handled?
Right now it makes sense that nothing like this has been implemented yet, as there is no cost model that could guide the transformation. But if recipes are a general thing, should the cost model be the component actually providing the target-specific pattern for a recipe, together with its cost?
I am considering choosing a topic related to VPlan, possibly cost modelling, for my Master thesis, with the goal to present a solution and implement a prototype.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hi Florian,
With VPlan, instead you could add a new VPInstruction opcode ZextLoad and have a transformation that replaces all `zext (load)` instructions with the new ZExtLoad one. The cost-model needs to be taught about this special instruction and how much it costs. Then you could apply this transformation to an existing plan and check if the overall cost improved.
Of course there’s also the issue of how to generalize the TTI APIs to allow for computing the cost of instructions with more context.
I think we probably want to keep things as generic as possible, try to model generic concepts and use TTI to decide whether to use it or not (e.g. see how masked loads/stores are handled).
Hm, so what you are saying is that for any instruction that could
be merged into another one on some architecture, there should be a
new VPInstruction (naturally based on the 'plain' VPInstruction
for that instruction) that symbolises this and would be assigned
the cost 0 by the TTI, or whatever a new cost model would use, of
only the respective architecture. I had to think about this a
bit, but now it makes sense to me. Handling it a smart way, such
that TTIs that do not know about this special case fall back to
what they do for the normal case, would also save us from
generating additional VPlans. And with the load being the
ingredient of the extend, and thus accessible, everything is
available to do some type checking and ensure the merge is doable
for the architecture.
This really helped me to understand how things are supposed to be
working, thanks a lot for explaining!
I do still have a few doubts as to whether this approach might
still bear some risk of over-generating (see my reply to Renato's
email which I'll link you to), but those might simply originate
from me being too cautious because I do not have a good overview
yet of how many (different) examples like the one I mentioned
there are, so how many special cases would have to get their own
VPInstruction flavour.
I am not sure what patterns specifically you are thinking about, but I think the cost model should just evaluate the cost of a given plan and not provide anything beyond that. Of course, this still can mean that there might be certain recipes/instructions that are not available/profitable on some targets and we decide to never generate them, based on the cost-information.
I was wondering about how one would merge the extend from the
example above into the load without any changes to the VPlan
structure/classes. So basically without the trick of adding
special instructions that you mentioned above. But that does not
seem a nice way to do it.
I am considering choosing a topic related to VPlan, possibly cost modelling, for my Master thesis, with the goal to present a solution and implement a prototype.
I am hoping to make some progress on this in the next months (hopefully the work on modeling the def-use chains between recipes in VPlan will be wrapped up soon) and I expect there to be a few moving parts. Not sure what that means for a master thesis in this area.
> The cost model structure needs to be target independent but with heavy
> target-specific knowledge. The kind of specialisation we created for
> the previous vectoriser is too naive, and assumes the concepts are the
> same, just not necessarily the costs. That's not true at all for
> different hardware.
Yes, that's a big issue to be solved. I do like the idea from Florians
reply, basically encoding an instruction's environment in the
VPInstruction that is generated and make it such that target-specific
cost components would overwrite the cost calculations for the general
class if they can be treated in a special way by the architecture.
Though currently the gather that is extended and the scatter whose input
is truncated are my only examples for this kind of thing, which is why I
am hoping that the community will have encountered more examples like
these and is willing to share. Maybe there are cases that are more
complex, that e.g. merge or depend on far more then two instructions, in
which case the generation of the right VPInstructions might become
tricky - especially if there are multiple possible 'merge patterns' (for
one or different architectures) that are overlapping, which would
require to generate alternative VPlans, whose number for long loops with
many special cases might grow exponentially to cover all possible
combinations.
> There are probably many other ways to do this, I'm just thinking out
> loud to give an idea of the problems we faced earlier. I'm sure you'll
> find better ways to do this yourself once you start looking at it more.
>
> Regardless of how we do it, having the ability to say "this shuffle is
> free on this instruction" but "not with that other one" will make the
> cost model a lot more precise and need a lot less fudged heuristics.
Well thank you for thinking out loud, that is exactly what I was hoping
to get out of this RFC!
Yes, there needs to be a good way to say things like that, and
preferrably 'good' should not only mean efficient, but also easy to
understand, maintain and extend - but that's just me dreaming :)
Are you thinking of a concrete example when mentioning those shuffles?
Because concrete examples would really help me a lot when thinking
further about this - if I decide to do this as a thesis, I might have to
constrain the actual implementation to (a subset of instructions for) a
single architecture, but for thinking about the issue itself any
concrete examples for any architecture would be a great help!
Hm, I hadn't yet gotten as far as how to use it, but you're right of
course - that will be another challenge. I know that there are many
things being planned and on the way for VPlan, but I'll willingly admit
that I am currently still having some trouble gaining an overview of
what is there, what is not, and what might be on its way. It's simply
hard and takes a lot of time to find documentation for all the new
things, but I'm sure I'll get there.
Thanks for your detailed reply and all the ideas, they do help a lot.
Best,
Anna
Yes, that's a big issue to be solved. I do like the idea from Florians
reply, basically encoding an instruction's environment in the
VPInstruction that is generated and make it such that target-specific
cost components would overwrite the cost calculations for the general
class if they can be treated in a special way by the architecture.
Maybe there are cases that are more
complex, that e.g. merge or depend on far more then two instructions, in
which case the generation of the right VPInstructions might become
tricky - especially if there are multiple possible 'merge patterns' (for
one or different architectures) that are overlapping, which would
require to generate alternative VPlans, whose number for long loops with
many special cases might grow exponentially to cover all possible
combinations.
Yes, there needs to be a good way to say things like that, and
preferrably 'good' should not only mean efficient, but also easy to
understand, maintain and extend - but that's just me dreaming :)
Are you thinking of a concrete example when mentioning those shuffles?
Because concrete examples would really help me a lot when thinking
further about this - if I decide to do this as a thesis, I might have to
constrain the actual implementation to (a subset of instructions for) a
single architecture, but for thinking about the issue itself any
concrete examples for any architecture would be a great help!
Hm, I hadn't yet gotten as far as how to use it, but you're right of
course - that will be another challenge.
I know that there are many
things being planned and on the way for VPlan, but I'll willingly admit
that I am currently still having some trouble gaining an overview of
what is there, what is not, and what might be on its way. It's simply
hard and takes a lot of time to find documentation for all the new
things, but I'm sure I'll get there.
+ llvm-dev, which got accidentally dropped - sorry for any
duplication
Hi Sjoerd,
There was a VPlan round-table at the US LLVM dev conference just a few weeks ago, and I have linked in the people who were involved in this as cost-modeling was one of the topics that was discussed. You may have seen this already, but Dave started doing some work in this area recently:
Just some first high-level remarks on your proposal: I think work in this area would be very valuable and solve a real problem that we currently have. If you plan to choose this as a subject for a master's thesis, you probably need to think about how big of a task this is and having some motivating examples would be good too (but you mentioned some already). To me it sounds like a big task that needs a lot of plumbing in VPlan, but that's why having this discussion here is interesting and hopefully people with more VPlan experience can provide insights.
You are right and I am fully aware that building a fully
functional cost model for VPlan might be a very big task, well
beyond the scope of a master thesis. But right now there isn't
even clear documentation of how exactly such a cost model should
look like, and I think it is an interesting and important topic
to explore. If I end up with a good idea , a thesis that
discusses the topic so there finally is some documentation, and
a proof-of-concept implementation, I hope that would already be
of some value and interest to the community.
But yes, good examples are needed - that's one reason for this
RFC, as of course my experience in the matter is very localised
to a specific kind of instructions, and I am sure many members
of the community have met other issues that could help painting
a better picture of what should be done.
Best,
Anna
The gramma idea sounds interesting. My brain tends to just work in terms of code, but it might make a great master thesis :) The pattern matches were modelled after the existing ones in llvm that are used all over the place, especially instcombine. It was meant to feel familiar with what already existed.
I imagine a lot of people must have thought about the same kinds of ideas and there must be some research out there on it. Either whether it worked well or not, it would be good to look into what was already tried. I imagine there will be cases where it works well and times where it doesn't. The same concept of specifying patterns in some way that isn't C++ code could be said for instcombine or DAG combining too. The fact that no-one ever did it is perhaps a bad sign? I know there are some tablegen defined combines for global isel now, they might give some ideas.
Combining a few instructions into a vmulh is simple enough, and I was hoping that there would not be a huge number of combines needed. There are bigger transforms that perhaps wouldn't fit that very well. Imagine trying to take code that does:
x, y = vld2
x *= -1
vst2 x, y
and turn that into
x = load
x *= [-1,1]
str x
That is like "de-interleaving" an interleaving group, much like slp vectorization. Plus it would need to make that transform generic across all the different operations that could be between the loads/store and on the size of the interleaving groups. It, of course, could get quite complex.
Looking forward to seeing what you come up with.
Dave
--------------------------------------------
Sent: 03 November 2020 15:45
Subject: Re: [llvm-dev] [RFC] Goals for VPlan-based cost modelling
Hi Dave,
thanks for your detailed reply, the reductions you describe really seem
like a perfect example of what a new cost model and IR transformation
should be able to cope with, and so does vmulh and the other
instructions you mentioned.
In particular, they do underline the importance of finding a solution
for patterns of arbitrary size, which is much harder than finding a
feasible solution for patterns of size two like the ones I mentioned.
A thought that has been sitting in my head for a while is whether this
part of transforming IR (or VP-) instructions into VPRecipes could be
handled by some kind of grammar, like grammars used for parsing in NLP,
or also in compilers. Patterns could be represented as grammar rules,
and if more than one pattern matches the analysis can fork and continue
on two or more VPlans in parallel. Recipes that can be used in many
architectures would be in the main grammar. It could be a probabilistic
grammar with values either 0 or 1, such that architectures that do not
implement a recipe can set the probability of it ever being generated to
0. For very target-specifc patterns, target-specific sub-grammars could
be maintained.
On a totally crazy thought, a cost model could be integrated in this
grammar step, if the patterns need legality (type) checking anyway like
it is currently being done in some TTI cost functions. Then the recipes
can be annotated with their cost at generation time, we maintain a total
cost for each VPlan and can define a threshold of `cost(VPlan) -
minVPlanCost' above which we abandon plans on the fly, thus abandoning
options that are unlikely to be fruitful mid-analysis (I know that
threshold would be difficult to find, but at least it would reduce the
problem space a bit).
In an approach like that patterns could be arbitrarily large, but I do
have some fears that it might lead to over-generating VPlans. It would
also probably mean that a lot of recipes are needed to model every
instruction that can possibly be generated, but if the concrete costs
were calculated at parse time and the information from the matched
instructions were reduced to what will be needed to generate the output
IR, it might be possible to reduce that.
I think your example in D88152 does go into a similar direction, only it
is transforming VPInstructions to another VPInstruction, not to a recipe
(The pattern matching does remind me a bit of Tablegen, tbh).
What I wonder is, if it looks like to leverage VPlan we need to do some
kind of pattern matching, be it in the cost model for correct cost
calculation, to detect or to transform patterns of VPInstructions,
wouldn't it conceptually make sense to explicitly define something like
a grammar, instead of doing it per instruction/pattern and per
application (costing/transformation)?
What are your thoughts on this?
Thanks,
Anna
On 03.11.20 11:55, David Green wrote:
> Hello
>
> Perhaps I can talk a little about the things I have been at least thinking about lately. Unfortunately a few other important things have come up and I have not had the time do as much as I would have liked. What I have is mostly on the pragmatic end of the spectrum.
>
> So we've been thinking about this for a while but it really starts with vector reductions. Arm MVE has some instruction that make it beneficial to put vector reduction instruction into the loop, instead of summing in the loop and reducing after it. So instead of:
> loop:
> a = load v16i8
> s = s + a
> reduce(a)
>
> We can efficiently do:
> loop:
> a = load v16i8
> s += reduce a
>
> That is all working now, and seems to be humming along nicely in the vectorizer. It even handles tail predicated loops. The real benefit from those instructions though comes from the fact that they can natively handle larger than legal reductions. So we have an instruction that can take 16i8's and sum them into an i32. That would look something like this in IR:
> loop:
> a = load v16i8
> e = sext a to v16i32
> s += reduce e
>
> It can also do a mla at the same time, doing reduce(mul(sext, sext)) as a single instruction. We even have i64 variants of the instruction, even though there is no i64 vector add in MVE. Unfortunately all of this is both too large for the vectorizer to consider at the moment and looks like a set of very expensive operations. Extends especially are very expensive at the moment in MVE.
>
> We can already cost model load(sext(..)) well (along with other patterns like add(sext, ..) for aarch64). We have the context parameter (plus a parameter for the type of the load now). For 2 instruction patterns this mostly works OK, if not very elegantly, providing that the context instruction can be trusted (which with vplan making more changes becomes a big if). Any more than 2 starts to get very awkward though, trying to check that you are the middle mul in a reduce(mul(sext, sext)) pattern. Especially if the reduce looks like an add!
>
> I had a go at implemented some code to try and get the reductions more correct. One idea was for a better way of providing context. Something like an "InstructionView", in the same way as there is an ArrayView or StringView. It gives a way to query possibly fake instructions for the info the backend might need to make decisions and could be backed by real instructions or vplan nodes. Another simpler idea was to keep a list of "free instructions" that the backend could provide and iterator backwards through the costs instead of forward, letting the backend greedily choose patterns. I didn't much like the changes they required though and didn't get very far with them as they are all very large changes that do not play well into the costmodel as it currently exists.
>
> So I knew that there were some other things that I wanted to start making better in the vectorizer/costmodel, maybe investigating those would give a better view onto what the best way to handle this would be. One is instructions like vmulh. They multiply and take the top half, and are common across a number of architectures. Things like vrmulh, vhadd, vrhadd, vqdmulh, vabsd, etc can all fit into the same bucket. They conceptually do something like:
> loop:
> a = load v16i8
> a = load v16i8
> ae = sext a to v16i16
> be = sext a to v16i16
> m = mul ae, be
> s = shr m, 8
> t = trunc s to v16i8
> store t
>
> So the same issue with sign extends being expensive exists, plus considering the cost of those instructions will be very wrong (especially if you consider a i32 extended to an i64 on an architecture with no 64 bit operations.) That I put up as https://reviews.llvm.org/D88152, mostly as a proof of concept to show how it can be done. It recognizes the pattern in vplan and turns it into a single mulh VPInstruction simply enough. The obvious problem comes in where we are still costing the origin instructions in the loop though. Hence the cost model really needs to be moved over to costing vplan nodes, not the original IR instructions (which leads us to https://reviews.llvm.org/D89322 making a start at that).
>
> D88152 obviously still needs some details filled in. The rough idea was to have a set of "VectorPatterns" that the backend could be queried the cost of (like vmulh etc). In this case we are replacing 5 instructions with 1 at a lower bitwidth, so the change is going to be fairly obvious to make if it is legal, and can be done by replacing nodes. In other cases it may be better to clone the vplan and cost them against one another. It still seems simpler in these cases (and the concepts are general enough) that combining in the vectorizer seems like a good way forward. It doesn't really help costmodel anywhere else in the compiler though, if that is something that becomes important.
>
> We would love to be able to start doing things like cost predicated vplans against unpredicated vplans, along with things like de-interleaving the interleaving groups (kind of like slp vectorization), costing gathers vs interleaving loads, and even possibly things like getting the vectorizer to consider lane interleaving if we can. All of this requires us to be costing vplans though, at least as a first step. Doing some of these like (what I have been calling) lane interleaving certainly require more global analysis than the local checks we can do in places like instcombine or ISel, to make sure they are really profitable. It's hard to guess at the costs of them.
>
> I also found it very easy to introduce regressions when trying to rewrite an entire costmodel, as you might imagine. Simple enough things (and actually costing predicate instructions) can have severe effects on some code. It was something I was trying to go slowly on and hopefully do bit at a time. We will see how that goes though.
>
> In short, I think I like the idea of vplan looking more like the target expects it to end up as, as opposed to trying to reverse engineer or guess that in the costmodel. None of this has been through review or anything like that. It was just the kinds of things we were trying to work on and hoping on improving. VPlan has a lot of potential I believe, and it would be great to see it start solving some real problems.
>
> I'm not sure if that helped at all. Maybe it at least gave some context, even if it was a bit rambley.
>
> Looking forward to seeing what your thoughts are,
> Dave