From: "Sean Silva via llvm-dev" <llvm...@lists.llvm.org>
To: "llvm-dev" <llvm...@lists.llvm.org>
Sent: Wednesday, June 8, 2016 6:19:03 AM
Subject: [llvm-dev] Intended behavior of CGSCC pass manager.Hi Chandler, Philip, Mehdi, (and llvm-dev,)(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.(Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.
a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?btw:One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.
Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.
Sorry for the wall of text.
-- Sean Silva
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hi Chandler, Philip, Mehdi, (and llvm-dev,)(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.(Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.)
Sean Silva wrote:
> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>
> (this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a
> discussion about the direction of the CGSCC pass manager)
Thanks for writing this up! This sort of thing is very helpful for
people who don't attend the social, or had to leave early. :)
> Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of
> modifications that can occur:
> - a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This
> requires adding an edge in the CG but not in the ref graph.
> - a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not
> in the ref graph.
> - a pass may delete a direct call. This removes an edge in the CG and also in the ref graph.
At what granularity are we modeling these things? E.g. if SimplifyCFG
deletes a basic block, will we remove call edges that start from that
block?
Note there is fourth kind of modification that isn't modeled here:
devirtualization, in which we transform IR like
%fnptr = compute_fnptr(%receiver, i32 <method id>)
call %fptr(%receiver)
to (via a pass that understands the semantics of compute_fnptr).
call some.specific.Class::method(%receiver)
However, at this time I don't think modeling "out of thin air"
devirtualization of this sort is important, since nothing upstream
does things like this (we do have ModulePasses downstream that do
this).
> From the perspective of the CGSCC pass manager, these operations can affect the SCC structure. Adding an edge might
> merge SCC's and deleting an edge might split SCC's. Chandler mentioned that apparently the issues of splitting and
> merging SCC's within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation
> issues, and that is what he is working on.
>
> (
> The ref graph is important to guide the overall SCC visitation order because it basically represents "the largest graph
> that the CG may turn into due to our static analysis of this module". I.e. no transformation we can statically make in
> the CGSCC passes can ever cause us to need to merge SCC's in the ref graph.
> )
Except in the above "out of thin air" devirtualization case.
I think cross-function store forwarding can also be problematic:
void foo(fnptr* bar) {
if (bar)
(*bar)();
}
void baz() { foo(null); }
void caller() {
fnptr *t = malloc();
*t = baz;
foo(t);
}
// RefGraph is
// caller -> baz
// caller -> foo
// baz -> foo
Now the RefSCCs are {foo}, {caller}, {baz} but if we forward the store
to *t in caller into the load in foo (and are smart about the null
check) we get:
void foo(fnptr* bar) {
if (bar)
baz();
}
void baz() { foo(null); }
void caller() {
fnptr *t = malloc();
*t = baz;
foo(t);
}
// RefGraph is
// foo -> baz
// baz -> foo
// caller -> foo
// caller -> baz
and now the RefSCCs are {foo, baz}, {caller}
But again, I think this is fine since nothing upstream does this at
this time; and we can cross that bridge when we come to it.
> 2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g.
> the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after
> inlining). Consider:
>
> a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do
> you re-run the CGSCC pass (say, the inliner) on this larger SCC?
>
> b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which
> is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited
> a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?
Okay, so this is the same question I wrote above: "At what granularity
are we modeling these things?", phrased more eloquently. :)
> One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm.
> I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm.
> Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test )
> we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the
> variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an
> entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to
> think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new
> edges and if we delete edges then the DFS order of the stack gives us certain guarantees.
I'm not sure how this will be easier E.g. consider
X -> A
A -> B
B -> C
C -> B
B -> A
Now you're going to pop A,B,C together, as an SCC; and you start
optimizing them, and the edge from B to A falls out. Don't you have
the same problem now (i.e. we've removed an edge from an SCC we're
iterating over)? Perhaps I misunderstood your scheme?
-- Sanjoy
2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?btw:One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.
Does it make sense to change RefSCCs to hold a list of
RefSCC-DAG-Roots that were split out of it because of edge deletion?
Then one way to phrase the inliner/function pass iteration would be
(assuming I understand the issues):
Stack.push(RefSCC_Leaves);
while (!Stack.empty()) {
RefSCC = Stack.pop();
InlineCallSites(RefSCC);
if (!RefSCC.splitOutSCCs.empty())
goto repush;
for each func in RefSCC:
FPM.run(func);
if (!RefSCC.splitOutSCCs.empty())
goto repush;
continue;
repush:
for (refscc_dag_root in RefSCCs.splitOutSCCs)
// here we don't want to push every leaf, but leafs that
// have functions that haven't had the FPM run on (maybe we can
do this by maintaining a set?)
// if we don't push a leaf, we push its parent (which we want
to push even if we've run FPM on it
// since we'd like to re-run the inliner on it).
refscc_dag_root.push_leaves_to(Stack)
}
(I know this isn't ideal, since now RefSCC is no longer "just a data
structure", but is actually has incidental information.)
--
Sanjoy Das
http://playingwithpointers.com
I'm not sure what you mean above, but IIUC I think it is a bit more complex since the inliner runs at the same place you put FPM.run().
--
Mehdi
On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chiso...@gmail.com> wrote:Hi Chandler, Philip, Mehdi, (and llvm-dev,)(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.(Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.)I want to clarify a little more on GCC's behavior. GCC has two types of ipa_passes. One called simple_ipa pass and the other is called regular ipa_pass. A simple IPA pass does not transform IR by itself, but it can have function level sub-passes that does transformation. If it has function sub-passes, the function passes will be executed in bottom up order just like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module pass in LLVM.GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has the following components:(1) lowering passes(2) small/simple IPA passes including the early optimization pipeline (bottom-up)(3) full ipa pipelines(4) post-ipa optimization and code generation.The difference is that LLVM's (2) includes only inst-combine and simplfy CFG, but GCC's (2) is larger which happens also includes an early inlining pass.
GCC's (2) is similar to LLVM's CGSCC pass but with fewer passes. The inliner is also targeting tiny functions that reduces size overall. LLVM's LTO pipeline is kind of like this too.GCC's (3) includes a full blown regular inlining pass. This inliner uses the priority order based algorithm, so bottom-up traversal does not make sense. It requires (2) to have pass to collect summary to drive the decisions (as well as profile data).
On Wed, Jun 8, 2016 at 10:57 AM, Xinliang David Li <dav...@google.com> wrote:On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chiso...@gmail.com> wrote:Hi Chandler, Philip, Mehdi, (and llvm-dev,)(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.(Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.)I want to clarify a little more on GCC's behavior. GCC has two types of ipa_passes. One called simple_ipa pass and the other is called regular ipa_pass. A simple IPA pass does not transform IR by itself, but it can have function level sub-passes that does transformation. If it has function sub-passes, the function passes will be executed in bottom up order just like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module pass in LLVM.GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has the following components:(1) lowering passes(2) small/simple IPA passes including the early optimization pipeline (bottom-up)(3) full ipa pipelines(4) post-ipa optimization and code generation.The difference is that LLVM's (2) includes only inst-combine and simplfy CFG, but GCC's (2) is larger which happens also includes an early inlining pass.So the early inlining is a function pass? In LLVM I think that the function pass contract would not allow inlining to occur in a function pass (the contract is theoretically designed to allow function passes to run concurrently on functions within a module).
2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?btw:One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.A very high level comment: why do we need to update callgraph on the fly ? Can we have a more general support of iterative SCC pass invocation?something like:1) build the callgraph2) cache the post-order traversal order3) if the order list is empty -- done4) traversal: invoke function passes for each function on the order (step 2 or 5). The call graph gets updated on the fly (with new edges, or new nodes for cloned functions)5) update the function traversal order from new nodes and new edges created in 4)6) go to step 3).
, but there may be missing optimization in some cases. I think the current scheme catches most cases and when it does not we are just missing potential inlining. The question may be how much (more) cases we really need to catch with the new pass manager?
-- Sanjoy
On Wed, Jun 8, 2016 at 4:20 PM, Xinliang David Li <xinli...@gmail.com> wrote:
> Is it in the category of invalidating the iterator while iterating' which
> feels very wrong to me. We should avoid going there and find better ways to
> solve the motivating problems (perhaps defining them more clearly first ?).
I'm not a 100% sure of what you meant by that, so I'll try to give a
general answer, and hope that it covers the points you wanted to
raise. :)
In the scheme above I'm not trying to solve iterator invalidation --
I'm trying to solve the following problem: the CGSCC pass manager ran
the inliner and a set of function passes on a function, and they did
something to invalidate the RefSCC we were iterating over[0]. How do
we _continue_ our iteration over this now non-existent RefSCC?
The solution I'm trying to propose is this: The only possibility for
invalidation is that the RefSCC was broken up into a forest of
RefSCC-DAGs[1]. This means if we had a way to get to the leaves of
this forest of RefSCCs, we could restart our iteration from there
(I've tacitly assumed we're interested in a bottom-up SCC order).
This may be difficult in general, but my idea was to "cheat" and
explicitly remember the Ref-SCC-forest a RefSCC was broken down into
when we do that invalidation. Then once an RefSCC is split, we can
pick up the forest from the original RefSCC* (which is otherwise
useless now), gather the leaves, and re-start our iteration from those
leaves.
This leaves the question of what to do with the SCC DAG nested inside
the RefSCC. I'm not sure what Chandler has in mind in how much
influence these should have over the iteration order, but if we wanted
to iterate over the SCC-DAG in bottom up order as well (as we iterated
over a single RefSCC), we could have the same scheme to handle
SCC-splits, and a similar scheme to handle SCC-merges (when you merge
an SCC, the SCC that gets cleaned out gets a pointer to the SCC where
all the functions went, and if the SCC you were iterating over gets
such a pointer after running the inliner/FPM you chase that pointer
(possibly multiple times, if more than one SCC was merged) and
re-start iteration over that SCC).
By "incident data structure" I meant that with these additions the
RefSCC or SCC is no longer a "pure" function of the structure of the
module, but has state that is a function of what the pass manager did
which is not ideal. That is, in theory this isn't significantly
cleaner than the passes reaching out into and changing the CGSCC pass
manager's state, but perhaps we are okay with this kind of design for
practicality's sake?
[0]: One question I don't know the answer to -- how will we detect
that something has removed a call or ref edge? Will we rescan
functions to verify that edges that we though existed still exist? Or
will we have a ValueHandles-like scheme?
[1]: As Sean mentioned, by design nothing in the function pass
manaager pipeline could have invalidated the RefSCC by merging it with
other RefSCCs.
-- Sanjoy
On Wed, Jun 8, 2016 at 5:35 PM, Sean Silva <chiso...@gmail.com> wrote:
> In both of these examples you gave (out-of-thin-air devirtualization and
> forwarding into a callee) is the contract of a CGSCC pass being violated? (I
> believe the contract is that a CGSCC pass cannot invalidate the analyses of
> a different SCC
> (https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CGSCCPassManager.h#L104)
> )
I'm not sure -- in both the cases I'm introducing a call to a
different SCC. Obviously, if your analysis state is "cache of all
incoming call edges at the function granularity" then it is violated.
But, OTOH, if your analysis state is "a count of all the CallInsts
coming into the function" then even basic things like unrolling loops
will invalidate that.
Speaking more broadly about the algorithm you just described, did you intend to omit an SCC visitation step?
The goal of the CGSCC pass manager is the ability to visit an SCC (e.g. inliner visits an SCC), then immediately run function passes to simplify the result of inlining.
Only after the simplification has occurred do we visit the parent SCC. By running the simplifications in lock-step with the inliner SCC visitation we give parent SCC's a more accurate view of the real cost of inlining a function from a child SCC.
Issues similar to 2.a) (i.e. adding edges to the CG) also affect the visitation order of SCC's (not just functions). For example, we visit an SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass on that SCC, which may add an edge (e.g. promote indirect call to direct call) that may enlarge the SCC. Do we continue running the remaining function passes? Do we re-visit this new enlarged SCC? (if so, when?) These are the types of questions that motivated this post (hence the name "Intended behavior of CGSCC pass manager.").
Hi David,
On Wed, Jun 8, 2016 at 4:20 PM, Xinliang David Li <xinli...@gmail.com> wrote:
> Is it in the category of invalidating the iterator while iterating' which
> feels very wrong to me. We should avoid going there and find better ways to
> solve the motivating problems (perhaps defining them more clearly first ?).
I'm not a 100% sure of what you meant by that, so I'll try to give a
general answer, and hope that it covers the points you wanted to
raise. :)
In the scheme above I'm not trying to solve iterator invalidation --
I'm trying to solve the following problem: the CGSCC pass manager ran
the inliner and a set of function passes on a function, and they did
something to invalidate the RefSCC we were iterating over[0]. How do
we _continue_ our iteration over this now non-existent RefSCC?
Speaking more broadly about the algorithm you just described, did you intend to omit an SCC visitation step?It is independent of this decision. However we can actually go back one step and ask the question: is adding this layer really necessary? Why not just traversing the cgraph nodes in reverse topo-order (after removing the cycles)?The goal of the CGSCC pass manager is the ability to visit an SCC (e.g. inliner visits an SCC), then immediately run function passes to simplify the result of inlining.This is how current implementation is. Is it a fundamental requirement?
Only after the simplification has occurred do we visit the parent SCC. By running the simplifications in lock-step with the inliner SCC visitation we give parent SCC's a more accurate view of the real cost of inlining a function from a child SCC.This works for all bottom up scheme regardless of a SCC layer is added or not.Issues similar to 2.a) (i.e. adding edges to the CG) also affect the visitation order of SCC's (not just functions). For example, we visit an SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass on that SCC, which may add an edge (e.g. promote indirect call to direct call) that may enlarge the SCC. Do we continue running the remaining function passes? Do we re-visit this new enlarged SCC? (if so, when?) These are the types of questions that motivated this post (hence the name "Intended behavior of CGSCC pass manager.").yes, I understand the motivation -- that is way I propose the agorithm that uses cached iteration order + worklist based iterative approach. In your example, when a new edge is exposed via devirtualization, only its caller/ancestor nodes need to be revisited in the next iteration.
From: "Sean Silva via llvm-dev" <llvm...@lists.llvm.org>
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
To clarify, we're trying to provide this invariant on the "ref" graph or on the graph with direct calls only? I think the invariant need only apply to the former
if we're relying on this for correctness (i.e. an analysis must visit all callees before visiting the callers).
From: "Xinliang David Li" <dav...@google.com>
To: "Hal Finkel" <hfi...@anl.gov>
Cc: "Sean Silva" <chiso...@gmail.com>, "llvm-dev" <llvm...@lists.llvm.org>
Sent: Thursday, June 16, 2016 12:45:50 PM
Subject: Re: [llvm-dev] Intended behavior of CGSCC pass manager.
To clarify, we're trying to provide this invariant on the "ref" graph or on the graph with direct calls only? I think the invariant need only apply to the formerMore clarification needed :) What do you mean by 'invariant need only apply to the former'?
;)
if we're relying on this for correctness (i.e. an analysis must visit all callees before visiting the callers).Not necessarily. Due to lost edges (from caller to indirect callees), a callee node may be visited later. The analysis will just have to punt when a special edge to 'external' node is seen.
On Thu, Jun 16, 2016 at 4:48 AM, Sean Silva via llvm-dev
<llvm...@lists.llvm.org> wrote:
> One question is what invariants we want to provide for the visitation.
>
> For example, should a CGSCC pass be able to assume that all "child" SCC's
> (SCC's it can reach via direct calls emanating from the SCC being visited)
> have already been visited? Currently I don't think it can, and IIRC from the
> discussion at the social this is one thing that Chandler is hoping to fix.
> The "ref edge" notion in LazyCallGraph ensures that we cannot create a call
> edge (devirtualizing a ref edge) that will point at an SCC that has not yet
> been visited.
>
> E.g. consider this graph:
>
> digraph G {
> A -> B; B -> A; // SCC {A,B}
> S -> T; T -> S; // SCC {S,T}
> X -> Y; Y -> X; // SCC {X,Y}
>
> B -> X;
> B -> S;
> T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC
> {S,T}",constraint=false,style=dashed]
> }
>
> (can visualize conveniently at http://www.webgraphviz.com/ or I have put an
> image at http://reviews.llvm.org/F2073104)
>
> If we do not consider the ref graph, then it is possible for SCC {S,T} to be
I'm not sure why you wouldn't consider the ref graph? I think the
general idea is to visit RefSCCs in bottom up order, and when visiting
a RefSCC, visiting the SCC's inside the RefSCC in bottom up order.
So in your example, given the edges you've shown, we will visit {X,Y}
before visiting {S,T}.
> A more complicated case is when SCC {S,T} and SCC {X,Y} both call into each
> other via function pointers. So eventually after devirtualizing the calls in
> both directions there will be a single SCC {S,T,X,Y}.
>
> digraph G {
> A -> B; B -> A; // SCC {A,B}
> S -> T; T -> S; // SCC {S,T}
> X -> Y; Y -> X; // SCC {X,Y}
>
> B -> X;
> B -> S;
> T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC
> {S,T}",constraint=false,style=dashed]
> X -> S [label="Ref edge that is devirtualized\nwhen visiting SCC
> {X,Y}",constraint=false,style=dashed]
> }
>
> (rendering at: http://reviews.llvm.org/F2073479)
>
> Due to the cyclic dependence there is no SCC visitation order that can
> directly provide the invariant above. Indeed, the problem of maintaining the
I think the workflow in the above will (roughly) be:
Visit the RefSCC {X,Y,S,T}
Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}
Visit the SCC {S,T}
Optimize({S,T})
// Now {X,Y,S,T} collapses to form a single SCC
Visit the SCC {S,T,X,Y}
Optimize({S,T,X,Y})
The difficult bit is to make the inner "// Now.*" bits work well.
-- Sanjoy
From: "Xinliang David Li" <dav...@google.com>;)
To: "Hal Finkel" <hfi...@anl.gov>
Cc: "Sean Silva" <chiso...@gmail.com>, "llvm-dev" <llvm...@lists.llvm.org>
Sent: Thursday, June 16, 2016 12:45:50 PM
Subject: Re: [llvm-dev] Intended behavior of CGSCC pass manager.To clarify, we're trying to provide this invariant on the "ref" graph or on the graph with direct calls only? I think the invariant need only apply to the formerMore clarification needed :) What do you mean by 'invariant need only apply to the former'?
I mean that we only need to visit children in the "ref" graph before their parents. Furthermore, I'm not even sure that we need an invariant on the SCC level, but rather on the functions themselves. Meaning that I don't think we need to specify an invariant that requires revisiting once we split an SCC (it might be useful to do so, but nothing comes to mind that would require that for correctness).Yes, but my impression is that the "ref" graph has no lost edges (it is the conservative over-approximation). Is that right?if we're relying on this for correctness (i.e. an analysis must visit all callees before visiting the callers).Not necessarily. Due to lost edges (from caller to indirect callees), a callee node may be visited later. The analysis will just have to punt when a special edge to 'external' node is seen.
Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}
Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}I am not sure if this makes sense. If dynamically, the call edge from {X, Y} to {S, T} does exist, but not discovered by the analysis, then the cached {X, Y} will still be invalid, but who is going to invalidate it?
On Thu, Jun 16, 2016 at 9:53 PM, Xinliang David Li <dav...@google.com> wrote:
>> I think the workflow in the above will (roughly) be:
>>
>> Visit the RefSCC {X,Y,S,T}
>
>
> Are we sure RefSCC has ref edges between {X, Y} and {S, T} in this case? I
> may miss the code handling it.
I was going by the diagram -- the diagram explicitly has ref edges
between {X,Y} and {S,T}.
>> Visit the SCC {X,Y} // arbitrary
>> Optimize({X,Y})
>> // Now there's an edge to {S,T}, invalidate
>> // the analyses cached for {X,Y} and visit {S,T}
>
>
> I am not sure if this makes sense. If dynamically, the call edge from {X,
> Y} to {S, T} does exist, but not discovered by the analysis, then the cached
> {X, Y} will still be invalid, but who is going to invalidate it?
I cannot answer this with authority, since I'm not the one working on
the callgraph, but I'll jot down what I think is the case:
Whatever you analyze on {X,Y} will have to be conservative around
indirect calls that haven't yet been devirtualized. Say you're trying
to prove that an SCC is readnone. With the SCC iteration order,
you'll have to do:
for every call site in CurrentSCC
if the call is indirect, then ReadWrite, break out of loop
if the call is to SSC X, then
CurrentSCC.MemoryEffect.unionWith(X.MemoryEffect) // L1
To avoid re-walking the call-sites in common cases like the above,
we'll have to add a "HasNonAnalyzedCalls" bit on SCCs that we'll set
when building the call graph (Chandler had promised this a few socials
ago). That would let us directly walk the outgoing call edges (the
bit would be the moral equivalent of having a call edge to
"external").
The invariant provided by the bottom up SCC iteration order, as I
understand it, assures us that in line L1, X.MemoryEffect is as
precise as it can be. When we analyze a call site and the call target
is in a ReadOnly SCC, we are assured that the call target SCC could
not have been proved ReadNone -- we've already tried our best. So in
a way the bottom up order gives us precision, not correctness.
Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}I am not sure if this makes sense. If dynamically, the call edge from {X, Y} to {S, T} does exist, but not discovered by the analysis, then the cached {X, Y} will still be invalid, but who is going to invalidate it?I assume that if dynamically there was a call from {X,Y} to {S,T}, then the analysis would have observed an indirect call and would have behaved conservatively.
Hi David,
On Thu, Jun 16, 2016 at 9:53 PM, Xinliang David Li <dav...@google.com> wrote:
>> I think the workflow in the above will (roughly) be:
>>
>> Visit the RefSCC {X,Y,S,T}
>
>
> Are we sure RefSCC has ref edges between {X, Y} and {S, T} in this case? I
> may miss the code handling it.
I was going by the diagram -- the diagram explicitly has ref edges
between {X,Y} and {S,T}.
On Thu, Jun 16, 2016 at 5:13 PM, Sean Silva <chiso...@gmail.com> wrote:
> The simple answer is that this is the current state of things. The SCC
> visitation logic in the old PM does not consider the ref graph.
>
> So in some sense the question is why *should* we consider the ref graph?
> What is it buying us? Presumably this will take the form of some invariant
> on the `run(SCC &)` calls. But I have yet to see any explicit statement of
> an invariant that it gives us.
I'll have think harder to give a deeper answer, but trivially the
invariant you have is run(SCCA) is called before run(SCCB) if SCCB
Refs SCCA but not the other way around. This is just a fancy way of
re-stating that we'll iterate over RefSCC's in bottom up order, of
course; but it already helps in cases like your first example -- if we
don't iterate in bottom up RefSCC order there is no ordering between
{X,Y} and {S,T}, but we would like to visit {X,Y} before {S,T}.
> For example, the examples I gave show that without bailing out in the middle
I'm not sure what design point Chandler's is pursuing, but bailing out
in the middle of a pass manager because the data structure it was
operating on is gone is not new to LLVM. The LoopPassManager does
exactly this when a loop is deleted (due to full unrolling, for
instance).
> of a cgscc pass manager (e.g. after the `function(...simplifications that
> can devirtualize...)`) then we cannot even guarantee that the thing passed
> to the `run(SCC &)` function is actually an SCC.
>
> But consider that Optimize({S,T}) might be of the form:
> `cgscc(function(...simplifications that can
> devirtualize...),foo-cgscc-pass)`.
> After running `function(...simplifications that can devirtualize...)` we
> would end up running `foo-cgscc-pass` on {S,T} which is no longer an SCC
> anymore.
> What is the invariant here? What do we actually guarantee for the `run(SCC
> &)` function?
As I said, one possibility is to bail out of the current pipeline, and
re-start from the new leaves (similar to what we do for the loop pass
manager today).
Hi David,
On Thu, Jun 16, 2016 at 9:53 PM, Xinliang David Li <dav...@google.com> wrote:
>> I think the workflow in the above will (roughly) be:
>>
>> Visit the RefSCC {X,Y,S,T}
>
>
> Are we sure RefSCC has ref edges between {X, Y} and {S, T} in this case? I
> may miss the code handling it.
I was going by the diagram -- the diagram explicitly has ref edges
between {X,Y} and {S,T}.
>> Visit the SCC {X,Y} // arbitrary
>> Optimize({X,Y})
>> // Now there's an edge to {S,T}, invalidate
>> // the analyses cached for {X,Y} and visit {S,T}
>
>
> I am not sure if this makes sense. If dynamically, the call edge from {X,
> Y} to {S, T} does exist, but not discovered by the analysis, then the cached
> {X, Y} will still be invalid, but who is going to invalidate it?
I cannot answer this with authority, since I'm not the one working on
the callgraph, but I'll jot down what I think is the case:
Whatever you analyze on {X,Y} will have to be conservative around
indirect calls that haven't yet been devirtualized. Say you're trying
to prove that an SCC is readnone. With the SCC iteration order,
you'll have to do:
for every call site in CurrentSCC
if the call is indirect, then ReadWrite, break out of loop
if the call is to SSC X, then
CurrentSCC.MemoryEffect.unionWith(X.MemoryEffect) // L1
To avoid re-walking the call-sites in common cases like the above,
we'll have to add a "HasNonAnalyzedCalls" bit on SCCs that we'll set
when building the call graph (Chandler had promised this a few socials
ago). That would let us directly walk the outgoing call edges (the
bit would be the moral equivalent of having a call edge to
"external").
The invariant provided by the bottom up SCC iteration order, as I
understand it, assures us that in line L1, X.MemoryEffect is as
precise as it can be. When we analyze a call site and the call target
is in a ReadOnly SCC, we are assured that the call target SCC could
not have been proved ReadNone -- we've already tried our best. So in
a way the bottom up order gives us precision, not correctness.
-- Sanjoy
Another point is that it may not be practical to model edges to indirect targets. For virtual calls, without CHA, each virtual callsite will end up referencing all potential address taken functions.
On Thu, Jun 16, 2016 at 11:38 PM, Xinliang David Li <dav...@google.com> wrote:
> yes, ff RefSCC has such special edge to 'unknown' node to model icall, there
> is no problem, or the analysis still has to walk through the IR?
I don't think RefSCC has such a bit right now -- I believe it is in
the "coming soon" stage. :)
Again, I'm not doing the work or the planning; you'll have to ask
Chandler for specifics.
> There is a disadvantage of setting bit on SCC compared with special call
> edge -- the later can be per-callsite, so elimination of the last such edge
> automatically makes caller 'clean'. With the special bit, it is not so easy
> to get rid of it.
Yes, but I suppose we can keep a list of unanalyzable calls instead of
a single bit to make it easier to update.
>> The invariant provided by the bottom up SCC iteration order, as I
>> understand it, assures us that in line L1, X.MemoryEffect is as
>> precise as it can be. When we analyze a call site and the call target
>> is in a ReadOnly SCC, we are assured that the call target SCC could
>> not have been proved ReadNone -- we've already tried our best. So in
>> a way the bottom up order gives us precision, not correctness.
>
>
> you mean 'correctness' not 'precision'?
I did mean "precision, not correctness", but I phrased it badly. I
meant the motivation for bottom up strategy is that we're more
precise. We're also correct, but the bottom up strategy has no direct
hand it in that; we have to write our SCC passes in a way that they're
conservative around the cases they should be conservative in anyway.
Xinliang David Li wrote:
>> I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> This is what it can do -- but what benefit does it provide?
One benefit is that once you get to a function F that constructs an
instance of a class with virtual functions and then calls a virtual
function on the instance, then the virtual function being called and
the constructor will have been maximally simplified (F refs the
constructor, and the constructor refs all the virtual functions), and
you're more likely to inline the constructor and devirtualize the
call. I don't have any real data to back up that this will materially
help, though.
Hi David,
Xinliang David Li wrote:
>> I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> This is what it can do -- but what benefit does it provide?
One benefit is that once you get to a function F that constructs an
instance of a class with virtual functions and then calls a virtual
function on the instance, then the virtual function being called and
the constructor will have been maximally simplified (F refs the
constructor, and the constructor refs all the virtual functions), and
you're more likely to inline the constructor and devirtualize the
call.
I don't have any real data to back up that this will materially
help, though.
Sean Silva wrote:
>
>
> On Sun, Jun 19, 2016 at 12:01 AM, Sanjoy Das <san...@playingwithpointers.com <mailto:san...@playingwithpointers.com>>
wrote:
>
> Hi David,
>
> Xinliang David Li wrote:
> > > I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> > This is what it can do -- but what benefit does it provide?
>
> One benefit is that once you get to a function F that constructs an
> instance of a class with virtual functions and then calls a virtual
> function on the instance, then the virtual function being called and
> the constructor will have been maximally simplified (F refs the
> constructor, and the constructor refs all the virtual functions), and
> you're more likely to inline the constructor and devirtualize the
> call.
>
>
> That is true for a graph like http://reviews.llvm.org/F2073104 but not one like http://reviews.llvm.org/F2073479
I agree with this ^
> That is, there is no real guarantee.
But not with this ^ :)
The *guarantee*, as I understand it, is bottom up order on the RefSCC
DAG, and once inside a RefSCC bottom up on the SCC DAG contained in
it. This guarantee (the part about RefSCCs) helps more in cases like
http://reviews.llvm.org/F2073104 and the situation David Li described,
and does not quite help as much on http://reviews.llvm.org/F2073479.
In cases like http://reviews.llvm.org/F2073479 the "bottom up on SCCs
inside a RefSCC" part of the guarantee helps more since we still get
to see (depending on whether we first picked {S,T} or {X,Y}) a direct
call from {X,Y} (when iterating over {X,Y}) to {S,T} with {S,T} fully
simplified.
> I don't have any real data to back up that this will materially
> help, though.
>
> And we haven't had an RFC for any of this...
Yes, an RFC would have helped here.
Hi David,
Xinliang David Li wrote:
>> I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> This is what it can do -- but what benefit does it provide?
One benefit is that once you get to a function F that constructs an
instance of a class with virtual functions and then calls a virtual
function on the instance, then the virtual function being called and
the constructor will have been maximally simplified (F refs the
constructor, and the constructor refs all the virtual functions), and
you're more likely to inline the constructor and devirtualize the
call. I don't have any real data to back up that this will materially
help, though.
-- Sanjoy
Xinliang David Li wrote:
> [snip]
>
> However, in real applications, what I see is the following pattern (for
> instances LLVM's Pass )
>
> Caller() {
> Base *B = Factory::create(...);
> Stash (B); // store the object in some container to be
retrieved later
> ...
> }
>
> SomeTask() {
>
> Base *B = findObject(...);
> B->vCall(); // do the work
> }
>
> Driver() {
> Caller(); // create objects ...
> SomeTask();
> }
>
> Set aside the fact that it is usually much harder to do
> de-viritualization in this case, assuming the virtual call in
> SomeTask can be devritualized. What we need is that the virtual
> functions are processed before SomeTask node, but this is not guaranteed
> unless we also model the call edge ordering imposed by control flow.
I think the thesis here is you cannot devirtualize the call in
`SomeTask` without also looking at `Caller` [0]. So the flow is:
- Optimize Caller, SomeTask independently as much as you want
* Caller -refs-> Factory::create which -refs-> the constructors
which -refs-> the various implementation of virtual functions
(based on my current understanding of how C++ vtables are
lowered); so these implementations should have been simplified by
the time we look at Caller.
- Then look at Driver. Caller, SomeTask are all maximally
simplified. We now (presumably) inline Caller and SomeTask,
devirtualize the B->vCall (as you said: theoretically possible, but
if findObject etc. are complex then practically maybe not), and now
inline the maximally simplified devirtualized call targets.
> However, this is enforcing virtual methods to be processed before their
> object's creators. Are there other simpler ways to achieve the effect
> (if we have data to justify it)?
Honestly: I'll have to think about it. It is entirely possible that a
(much?) simpler design will catch 99% (or even better) of the
idiomatic cases, I just don't have a good mental model for what those
cases are.
At this point I'm waiting for Chandler to upload his patch so that we
can have this discussion on the review thread. :)
[0]: This breaks down when we allow "out of thin air"
devirtualizations (I'm stealing this term from memory models, but I
think it is appropriate here :) ), where you look at call site and
"magically" (i.e. in a way not expressible in terms of "normal"
optimizations like store forwarding, pre, gvn etc.) are able to
devirtualize the call site. We do this all the time in Java (we'll
look at the type of the receiver object, look at the current class
hierarchy and directly mandate that a certain call site has to have a
certain target), but the RefSCC call graph does not allow for that.
These kinds of out-of-thin-air devirtualizations will have to be
modeled as ModulePass es, IIUC.
> However, this is enforcing virtual methods to be processed before their
> object's creators. Are there other simpler ways to achieve the effect
> (if we have data to justify it)?
Honestly: I'll have to think about it. It is entirely possible that a
(much?) simpler design will catch 99% (or even better) of the
idiomatic cases, I just don't have a good mental model for what those
cases are.
At this point I'm waiting for Chandler to upload his patch so that we
can have this discussion on the review thread. :)
[0]: This breaks down when we allow "out of thin air"
devirtualizations (I'm stealing this term from memory models, but I
think it is appropriate here :) ), where you look at call site and
"magically" (i.e. in a way not expressible in terms of "normal"
optimizations like store forwarding, pre, gvn etc.) are able to
devirtualize the call site. We do this all the time in Java (we'll
look at the type of the receiver object, look at the current class
hierarchy and directly mandate that a certain call site has to have a
certain target), but the RefSCC call graph does not allow for that.
These kinds of out-of-thin-air devirtualizations will have to be
modeled as ModulePass es, IIUC.
-- Sanjoy
From: "Sean Silva via llvm-dev" <llvm...@lists.llvm.org>
To: "llvm-dev" <llvm...@lists.llvm.org>
Sent: Wednesday, June 8, 2016 6:19:03 AM
Subject: [llvm-dev] Intended behavior of CGSCC pass manager.Hi Chandler, Philip, Mehdi, (and llvm-dev,)(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.(Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.
This incremental simplification is an important feature of our inliner, and one we should endeavor to keep. We might also want different phases at some point (e.g. a top-down and a bottom-up phase), but that's another story.
2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:
This is not how I thought the current scheme worked ;) -- I was under the impression that we had a call graph with conservatively-connected dummy nodes for external/indirect functions. As a result, there is no semantics-preserving optimization that will merge SCCs, only split them. In that case, I'd expect that once an SCC is split, we re-run the CGSCC passes over the newly-separated SCCs. But this corresponds to running over the "ref graph", as you describe it. I don't understand why we want the non-conservative graph.
a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?btw:One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.
FWIW, I see no purpose in abstracting one algorithm, especially if that makes things algorithmically harder. Also, the LazyCallGraph abstraction and the iterator abstraction seem like separate issues. Iterator abstractions are often useful because you can use them in generic algorithms, etc.
Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.
I'm in favor of this approach for exactly the reason you mention. Being able to bisect regressions to the algorithmic change, separate from the infrastructure change, will likely make things easier in the long run (and will avoid the problem, to the extent possible, of performance regressions blocking the pass-manager work).
Sorry for the wall of text.
No problem; much appreciated.
-Hal-- Sean Silva
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev