[llvm-dev] Intended behavior of CGSCC pass manager.

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 7:19:12 AM6/8/16

to llvm-dev

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.

(

Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)

For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.

Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.

)

As background for what is below, the LazyCallGraph tracks two graphs: the "call graph" and the "ref graph".

Conceptually, the call graph is the graph of direct calls, where indirect calls and calls to external functions do not appear (or are connected to dummy nodes). The ref graph is basically the graph of all functions transitively accessible based on the globals/constants/etc. referenced by a function (e.g. if a function `foo` references a vtable that is defined in the module, there is an edge in the ref graph from `foo` to every function in the vtable).

The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of modifications that can occur:

- a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This requires adding an edge in the CG but not in the ref graph.

- a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not in the ref graph.

- a pass may delete a direct call. This removes an edge in the CG and also in the ref graph.

From the perspective of the CGSCC pass manager, these operations can affect the SCC structure. Adding an edge might merge SCC's and deleting an edge might split SCC's. Chandler mentioned that apparently the issues of splitting and merging SCC's within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation issues, and that is what he is working on.

(

The ref graph is important to guide the overall SCC visitation order because it basically represents "the largest graph that the CG may turn into due to our static analysis of this module". I.e. no transformation we can statically make in the CGSCC passes can ever cause us to need to merge SCC's in the ref graph.

)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();

fp funcs[] = {

&foo1,

&foo2,

...

&fooN

}

void foo1() { funcs[something](); }

void foo2() { funcs[something](); }

...

void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref graph can get in programs that might reasonably be fed to the compiler?

I just did a quick sanity check with LLD/ELF using `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate / not run out of memory. Based on some rough calculations looking at the profile, it seem like the entire run of the inliner in the old LTO pipeline (which is about 5% of total LTO time on this particular example I looked at) is only 2-3x as expensive as just `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction is certainly not free.

2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:

a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.

Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).

The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).

E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.

This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.

Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.

Sorry for the wall of text.

-- Sean Silva

Hal Finkel via llvm-dev

unread,

Jun 8, 2016, 12:33:00 PM6/8/16

to Sean Silva, llvm-dev

From: "Sean Silva via llvm-dev" <llvm...@lists.llvm.org>
To: "llvm-dev" <llvm...@lists.llvm.org>
Sent: Wednesday, June 8, 2016 6:19:03 AM
Subject: [llvm-dev] Intended behavior of CGSCC pass manager.

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.

This incremental simplification is an important feature of our inliner, and one we should endeavor to keep. We might also want different phases at some point (e.g. a top-down and a bottom-up phase), but that's another story.

This is not how I thought the current scheme worked ;) -- I was under the impression that we had a call graph with conservatively-connected dummy nodes for external/indirect functions. As a result, there is no semantics-preserving optimization that will merge SCCs, only split them. In that case, I'd expect that once an SCC is split, we re-run the CGSCC passes over the newly-separated SCCs. But this corresponds to running over the "ref graph", as you describe it. I don't understand why we want the non-conservative graph.

a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.
Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).
The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.

FWIW, I see no purpose in abstracting one algorithm, especially if that makes things algorithmically harder. Also, the LazyCallGraph abstraction and the iterator abstraction seem like separate issues. Iterator abstractions are often useful because you can use them in generic algorithms, etc.

Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.

I'm in favor of this approach for exactly the reason you mention. Being able to bisect regressions to the algorithmic change, separate from the infrastructure change, will likely make things easier in the long run (and will avoid the problem, to the extent possible, of performance regressions blocking the pass-manager work).

Sorry for the wall of text.

No problem; much appreciated.

-Hal

-- Sean Silva

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Daniel Berlin via llvm-dev

unread,

Jun 8, 2016, 12:39:57 PM6/8/16

to Sean Silva, llvm-dev

I can state that almost all call graphs of compilers include edges for indirect calls and external functions, so they are already quadratic in this sense.

if what you state is correct, and we don't have a conservatively correct call graph, that would be ... interesting.

The solution to most issues (large sccs, etc) that exist here that most other compilers take is to try to make the call graph more precise, not to avoid indirect/external calls in the call graph.

In turn, this means the solution often take is to not have two graphs at all.

Xinliang David Li via llvm-dev

unread,

Jun 8, 2016, 1:58:02 PM6/8/16

to Sean Silva, llvm-dev

On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chiso...@gmail.com> wrote:

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.
)

I want to clarify a little more on GCC's behavior. GCC has two types of ipa_passes. One called simple_ipa pass and the other is called regular ipa_pass. A simple IPA pass does not transform IR by itself, but it can have function level sub-passes that does transformation. If it has function sub-passes, the function passes will be executed in bottom up order just like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module pass in LLVM.

GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has the following components:

(1) lowering passes

(2) small/simple IPA passes including the early optimization pipeline (bottom-up)

(3) full ipa pipelines

(4) post-ipa optimization and code generation.

The difference is that LLVM's (2) includes only inst-combine and simplfy CFG, but GCC's (2) is larger which happens also includes an early inlining pass. GCC's (2) is similar to LLVM's CGSCC pass but with fewer passes. The inliner is also targeting tiny functions that reduces size overall. LLVM's LTO pipeline is kind of like this too.

GCC's (3) includes a full blown regular inlining pass. This inliner uses the priority order based algorithm, so bottom-up traversal does not make sense. It requires (2) to have pass to collect summary to drive the decisions (as well as profile data).

David

Sanjoy Das via llvm-dev

unread,

Jun 8, 2016, 2:14:11 PM6/8/16

to Sean Silva, llvm-dev

Hi Sean,

Sean Silva wrote:
> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>
> (this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a
> discussion about the direction of the CGSCC pass manager)

Thanks for writing this up! This sort of thing is very helpful for
people who don't attend the social, or had to leave early. :)

> Chandler described that he had a major breakthrough in that the CGSCC pass manager only had to deal with 3 classes of
> modifications that can occur:
> - a pass may e.g. propagate a load of a function pointer into an indirect call, turning it into an direct call. This
> requires adding an edge in the CG but not in the ref graph.
> - a pass may take a direct call and turn it into an indirect call. This requires removing an edge from the CG, but not
> in the ref graph.
> - a pass may delete a direct call. This removes an edge in the CG and also in the ref graph.

At what granularity are we modeling these things? E.g. if SimplifyCFG
deletes a basic block, will we remove call edges that start from that
block?

Note there is fourth kind of modification that isn't modeled here:
devirtualization, in which we transform IR like

%fnptr = compute_fnptr(%receiver, i32 <method id>)
call %fptr(%receiver)

to (via a pass that understands the semantics of compute_fnptr).

call some.specific.Class::method(%receiver)

However, at this time I don't think modeling "out of thin air"
devirtualization of this sort is important, since nothing upstream
does things like this (we do have ModulePasses downstream that do
this).

> From the perspective of the CGSCC pass manager, these operations can affect the SCC structure. Adding an edge might
> merge SCC's and deleting an edge might split SCC's. Chandler mentioned that apparently the issues of splitting and
> merging SCC's within the current infrastructure are actually quite challenging and lead to e.g. iterator invalidation
> issues, and that is what he is working on.
>
> (
> The ref graph is important to guide the overall SCC visitation order because it basically represents "the largest graph
> that the CG may turn into due to our static analysis of this module". I.e. no transformation we can statically make in
> the CGSCC passes can ever cause us to need to merge SCC's in the ref graph.
> )

Except in the above "out of thin air" devirtualization case.

I think cross-function store forwarding can also be problematic:

void foo(fnptr* bar) {
if (bar)
(*bar)();
}

void baz() { foo(null); }

void caller() {
fnptr *t = malloc();
*t = baz;
foo(t);
}

// RefGraph is
// caller -> baz
// caller -> foo
// baz -> foo

Now the RefSCCs are {foo}, {caller}, {baz} but if we forward the store
to *t in caller into the load in foo (and are smart about the null
check) we get:

void foo(fnptr* bar) {
if (bar)
baz();
}

void baz() { foo(null); }

void caller() {
fnptr *t = malloc();
*t = baz;
foo(t);
}

// RefGraph is
// foo -> baz
// baz -> foo
// caller -> foo
// caller -> baz

and now the RefSCCs are {foo, baz}, {caller}

But again, I think this is fine since nothing upstream does this at
this time; and we can cross that bridge when we come to it.

> 2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g.
> the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after
> inlining). Consider:
>
> a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do
> you re-run the CGSCC pass (say, the inliner) on this larger SCC?
>
> b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which
> is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited
> a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?

Okay, so this is the same question I wrote above: "At what granularity
are we modeling these things?", phrased more eloquently. :)

> One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm.
> I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm.
> Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test )
> we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the
> variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an
> entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to
> think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new
> edges and if we delete edges then the DFS order of the stack gives us certain guarantees.

I'm not sure how this will be easier E.g. consider

X -> A
A -> B
B -> C
C -> B
B -> A

Now you're going to pop A,B,C together, as an SCC; and you start
optimizing them, and the edge from B to A falls out. Don't you have
the same problem now (i.e. we've removed an edge from an SCC we're
iterating over)? Perhaps I misunderstood your scheme?

-- Sanjoy

Mehdi Amini via llvm-dev

unread,

Jun 8, 2016, 2:58:35 PM6/8/16

to Hal Finkel, llvm-dev

The fact that we have a separate nodes for calling into an external function and "being called" from an external function, these don't form SCC. So it means we can end up merging SCC IIUC.

--

Mehdi

Finkel, Hal J. via llvm-dev

unread,

Jun 8, 2016, 3:08:08 PM6/8/16

to Mehdi Amini, llvm-dev

Sent from my Verizon Wireless 4G LTE DROID

Yes, although I thought there was only one dummy node for those things.

-Hal

Xinliang David Li via llvm-dev

unread,

Jun 8, 2016, 3:31:07 PM6/8/16

to Sean Silva, llvm-dev

On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chiso...@gmail.com> wrote:

Conceptually, reference graph should also include variable nodes. With variable nodes introduced, the quadratic behavior mentioned can be avoided.

2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:

a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.
Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).
The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.

Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.

A very high level comment: why do we need to update callgraph on the fly ? Can we have a more general support of iterative SCC pass invocation?

something like:

1) build the callgraph

2) cache the post-order traversal order

3) if the order list is empty -- done

4) traversal: invoke function passes for each function on the order (step 2 or 5). The call graph gets updated on the fly (with new edges, or new nodes for cloned functions)

5) update the function traversal order from new nodes and new edges created in 4)

6) go to step 3).

David

Sanjoy Das via llvm-dev

unread,

Jun 8, 2016, 3:36:44 PM6/8/16

to Finkel, Hal J., llvm-dev

Hi,

Does it make sense to change RefSCCs to hold a list of
RefSCC-DAG-Roots that were split out of it because of edge deletion?
Then one way to phrase the inliner/function pass iteration would be
(assuming I understand the issues):

Stack.push(RefSCC_Leaves);
while (!Stack.empty()) {
RefSCC = Stack.pop();
InlineCallSites(RefSCC);
if (!RefSCC.splitOutSCCs.empty())
goto repush;
for each func in RefSCC:
FPM.run(func);
if (!RefSCC.splitOutSCCs.empty())
goto repush;
continue;
repush:
for (refscc_dag_root in RefSCCs.splitOutSCCs)
// here we don't want to push every leaf, but leafs that
// have functions that haven't had the FPM run on (maybe we can
do this by maintaining a set?)
// if we don't push a leaf, we push its parent (which we want
to push even if we've run FPM on it
// since we'd like to re-run the inliner on it).
refscc_dag_root.push_leaves_to(Stack)
}

(I know this isn't ideal, since now RefSCC is no longer "just a data
structure", but is actually has incidental information.)

--
Sanjoy Das
http://playingwithpointers.com

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 5:25:15 PM6/8/16

to Mehdi Amini, llvm-dev

Thanks for clarifying this.

My intuition for this behavior (by looking at a limiting case) is that e.g. during FullLTO we have `main`, which is "externally called", so that any function containing an external function call or indirect call would may end up calling back into main from the compiler's perspective.

This would result in a huge SCC containing `main` and all functions that transitively call a function containing an external function call / indirect call; this doesn't seem like the SCC we want to look at during inlining.

I.e. the conservatively correct call graph is not the graph that we necessarily want during inlining. The LazyCallGraph "potential direct call" approach seems a better fit for what the inliner wants.

The comments in LazyCallGraph.h do clarify this somewhat: https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/LazyCallGraph.h#L14

E.g. it says:

```

/// NB: This is *not* a traditional call graph! It is a graph which models both

/// the current calls and potential calls. As a consequence there are many

/// edges in this call graph that do not correspond to a 'call' or 'invoke'

/// instruction.

```

-- Sean Silva

Mehdi Amini via llvm-dev

unread,

Jun 8, 2016, 5:52:19 PM6/8/16

to Sanjoy Das, llvm-dev

> On Jun 8, 2016, at 12:36 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:
>
> Hi,
>
> Does it make sense to change RefSCCs to hold a list of
> RefSCC-DAG-Roots that were split out of it because of edge deletion?
> Then one way to phrase the inliner/function pass iteration would be
> (assuming I understand the issues):
>
> Stack.push(RefSCC_Leaves);
> while (!Stack.empty()) {
> RefSCC = Stack.pop();
> InlineCallSites(RefSCC);
> if (!RefSCC.splitOutSCCs.empty())
> goto repush;
> for each func in RefSCC:

> FPM.run(fund);

I'm not sure what you mean above, but IIUC I think it is a bit more complex since the inliner runs at the same place you put FPM.run().

--
Mehdi

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 5:54:34 PM6/8/16

to Daniel Berlin, llvm-dev

Mehdi clarified downthread the situation (at least for me).

Now that I look more closely at the comments in https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/LazyCallGraph.h it intentionally does not model a call graph in this sense. I.e.

- in a traditional CG, an edge means something like "at runtime there may be a call from A->B"

- in the LazyCallGraph an edge (a "ref edge" as it calls it) represents something like "during optimization of this module, we may discover the existence of a direct call from A->B". There is also a distinguished subgraph of the ref graph (which I think LazyCallGraph calls just the "call graph") which represents the actual direct calls that are present in the module currently.

The comments in LazyCallGraph.h are quite good, but the existing CallGraph.h doesn't seem to touch on this up front in its comments in quite the same way, but it does at least say that it models with two external nodes like Mehdi mentioned:

https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CallGraph.h#L30

So its edges don't represent "at runtime there may be a call from A->B". But since it doesn't maintain a "ref graph" I'm not sure what the edges exactly represent.

-- Sean Silva

Hal Finkel via llvm-dev

unread,

Jun 8, 2016, 5:55:31 PM6/8/16

to Sean Silva, llvm-dev

No, I thought we special-cased main. Or at least we did until we figured out that we couldn't do that for all languages, and so we introduced some kind of "no recurse" attribute?

-Hal

Mehdi Amini via llvm-dev

unread,

Jun 8, 2016, 6:10:54 PM6/8/16

to Sean Silva, llvm-dev

I thought of it as the edges are "there is a direct call from A -> B". Which is a subset of "at runtime there may be a call from A->B".

I think that with all this discussion, it is important to distinguish that (I think) there is no "correctness" issue at stance (we won't miscompile anything), but there may be missing optimization in some cases. I think the current scheme catches most cases and when it does not we are just missing potential inlining. The question may be how much (more) cases we really need to catch with the new pass manager?

And could a first implementation not catch everything and be improved incrementally? This comes back somehow to what Hal was mentioning (reproducing the current behavior before improving it).

--

Mehdi

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 6:11:21 PM6/8/16

to Hal Finkel, llvm-dev

This is just a thought experiment I did for the intuition of why modeling the call graph purely in the sense of "an edge A->B in the call graph means that at runtime a call from A to B may occur" is not what we want during inlining. I have no idea what we actually do (I'm still exploring).

Looking naively in https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis/CallGraph.cpp and https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis/CallGraphSCCPass.cpp I don't see any mention of norecurse that would prevent us from forming SCC's due to the norecurse attribute though.

We do have some logic for main, but it doesn't seem like it would affect my thought experiment: https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis/CallGraph.cpp#L65

Like Mehdi pointed out, I think that we avoid forming this kind of giant SCC implied by the "an edge A->B in the call graph means that at runtime a call from A to B may occur" by having two separate external nodes. This makes an external/indirect function call be fundamentally not tracked as potentially calling back into `main` (or other external functions in this module).

-- Sean Silva

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 6:22:53 PM6/8/16

to Xinliang David Li, llvm-dev

On Wed, Jun 8, 2016 at 10:57 AM, Xinliang David Li <dav...@google.com> wrote:

On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chiso...@gmail.com> wrote:
Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.
)

I want to clarify a little more on GCC's behavior. GCC has two types of ipa_passes. One called simple_ipa pass and the other is called regular ipa_pass. A simple IPA pass does not transform IR by itself, but it can have function level sub-passes that does transformation. If it has function sub-passes, the function passes will be executed in bottom up order just like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module pass in LLVM.

GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has the following components:

(1) lowering passes
(2) small/simple IPA passes including the early optimization pipeline (bottom-up)
(3) full ipa pipelines
(4) post-ipa optimization and code generation.

The difference is that LLVM's (2) includes only inst-combine and simplfy CFG, but GCC's (2) is larger which happens also includes an early inlining pass.

So the early inlining is a function pass? In LLVM I think that the function pass contract would not allow inlining to occur in a function pass (the contract is theoretically designed to allow function passes to run concurrently on functions within a module).

GCC's (2) is similar to LLVM's CGSCC pass but with fewer passes. The inliner is also targeting tiny functions that reduces size overall. LLVM's LTO pipeline is kind of like this too.

GCC's (3) includes a full blown regular inlining pass. This inliner uses the priority order based algorithm, so bottom-up traversal does not make sense. It requires (2) to have pass to collect summary to drive the decisions (as well as profile data).

Thanks for this clarification; it is always good to know what other compilers do for comparison.

-- Sean Silva

Xinliang David Li via llvm-dev

unread,

Jun 8, 2016, 6:44:51 PM6/8/16

to Sean Silva, llvm-dev

+cc Easwaran who is an expert on GCC's inliner. See my reply below.

On Wed, Jun 8, 2016 at 3:22 PM, Sean Silva <chiso...@gmail.com> wrote:

On Wed, Jun 8, 2016 at 10:57 AM, Xinliang David Li <dav...@google.com> wrote:

On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chiso...@gmail.com> wrote:
Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.
)

I want to clarify a little more on GCC's behavior. GCC has two types of ipa_passes. One called simple_ipa pass and the other is called regular ipa_pass. A simple IPA pass does not transform IR by itself, but it can have function level sub-passes that does transformation. If it has function sub-passes, the function passes will be executed in bottom up order just like LLVM's CGSCC pass. A regular ipa pass is somewhat like the module pass in LLVM.

GCC's pass pipeline is actually similar to LLVM's pipeline overall: it has the following components:

(1) lowering passes
(2) small/simple IPA passes including the early optimization pipeline (bottom-up)
(3) full ipa pipelines
(4) post-ipa optimization and code generation.

The difference is that LLVM's (2) includes only inst-combine and simplfy CFG, but GCC's (2) is larger which happens also includes an early inlining pass.

So the early inlining is a function pass? In LLVM I think that the function pass contract would not allow inlining to occur in a function pass (the contract is theoretically designed to allow function passes to run concurrently on functions within a module).

Yes -- because the early inliner is basically a simple/dummy function level transformation/enabler pass and has very few heuristics. It simply looks the call edges of the the current node/function, mark some edges as inlinable and does the transformation.

Two more notes:

1) always inline is handled here

2) it is also iterative allowing exposed new callsites to be considered again

IPA-inline, the regular inline pass, on the other hand, does a lot more global analysis and benefit ranking and decide the inline ordering. A couple of notes:

1) The IPA inline pass does not actually modify IR at all (no IR transformation is done). It only transform the callgraph to introduce node clones for inline instances and new cgraph edges for new callistes introduced. This helps reducing LTO time.

2) Recursive inlining is handled here (which is currently missing in LLVM)

3) The inline transformation (IR) is actually happening in the post-IPA phase.

David

Xinliang David Li via llvm-dev

unread,

Jun 8, 2016, 7:20:35 PM6/8/16

to Sanjoy Das, llvm-dev

Is it in the category of invalidating the iterator while iterating' which feels very wrong to me. We should avoid going there and find better ways to solve the motivating problems (perhaps defining them more clearly first ?).

thanks,

David

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 7:36:55 PM6/8/16

to Hal Finkel, llvm-dev

Yeah. Last night after I went home I was thinking to myself that this really seems like the obvious path forward. I will start implementing this.

-- Sean Silva

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 7:38:16 PM6/8/16

to Xinliang David Li, llvm-dev

Yeah, this is what I was trying to get at with the statement "In the current LazyCallGraph, this would require adding some sort of notion of hyperedge."

But you are right: from an implementation perspective of a call graph data structure that is trying to model the "ref graph" of LazyCallGraph, it is cleaner to have nodes that are not functions than to introduce a notion of hyperedge.

2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:

a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.
Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).
The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.

Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.

A very high level comment: why do we need to update callgraph on the fly ? Can we have a more general support of iterative SCC pass invocation?

something like:

1) build the callgraph
2) cache the post-order traversal order

3) if the order list is empty -- done
4) traversal: invoke function passes for each function on the order (step 2 or 5). The call graph gets updated on the fly (with new edges, or new nodes for cloned functions)
5) update the function traversal order from new nodes and new edges created in 4)
6) go to step 3).

(sorry for the delayed reply... this is a very poignant question / example)

From the discussion with Chandler, I think he wants to provide more guarantees to function passes about the visitation order. He will need to explain his exact concerns. But IIRC the essence of one of the issues is captured in the example I gave in 2.b) in the OP:

"These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a)."

The issue that I remember Chandler brought up is that before deleting an edge from an SCC, the nodes in the SCC are "unordered" w.r.t. each other; but after deleting an edge from the CG which splits the SCC, there is an order.

In other words, that the cached post-order may no longer be a post-order traversal of the new graph after a function pass runs. For example, in the cached post-order traversal we may have entered the SCC `a->b->c->a` via an edge `x->b` and so our cached post-order traversal visits the functions in the order `a, then c, then b`. If a function pass visits `c` and deletes the call `c->a`, then "a, then c, then b" is not a valid post-order for visiting the functions.

Looking at the explanation I gave, I think there isn't really a problem here. The cached post-order can only be invalidated for functions that have already been visited.

Speaking more broadly about the algorithm you just described, did you intend to omit an SCC visitation step? The goal of the CGSCC pass manager is the ability to visit an SCC (e.g. inliner visits an SCC), then immediately run function passes to simplify the result of inlining. Only after the simplification has occurred do we visit the parent SCC. By running the simplifications in lock-step with the inliner SCC visitation we give parent SCC's a more accurate view of the real cost of inlining a function from a child SCC.

Issues similar to 2.a) (i.e. adding edges to the CG) also affect the visitation order of SCC's (not just functions). For example, we visit an SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass on that SCC, which may add an edge (e.g. promote indirect call to direct call) that may enlarge the SCC. Do we continue running the remaining function passes? Do we re-visit this new enlarged SCC? (if so, when?) These are the types of questions that motivated this post (hence the name "Intended behavior of CGSCC pass manager.").

-- Sean Silva

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 7:48:41 PM6/8/16

to Mehdi Amini, llvm-dev

This is a good point worth explicitly noting (and hopefully there are in fact no passes relying on it to mean "at runtime there may be a call from A->B").

, but there may be missing optimization in some cases. I think the current scheme catches most cases and when it does not we are just missing potential inlining. The question may be how much (more) cases we really need to catch with the new pass manager?

I think this only affects inlining of mutually recursive functions. Most functions are not mutually recursive [citation needed] so I'm not too worried.

(The main performance-critical case that I can think of using mutual recursion would be parsers).

-- Sean Silva

Mehdi Amini via llvm-dev

unread,

Jun 8, 2016, 8:11:49 PM6/8/16

to Sean Silva, llvm-dev

Sent from my iPhone

What about devirtualization? Or even simply cases where an indirect call can be folded to a direct call?

Now you are processing the caller without having processed the callee.

I think the current way of handling this is to add the new callee to the SCC (even though it is not actually participating into it) and continue to iterate on this SCC. Sounds like a hack, but seems to works in practice.

Mehdi

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 8:35:12 PM6/8/16

to Sanjoy Das, llvm-dev

In both of these examples you gave (out-of-thin-air devirtualization and forwarding into a callee) is the contract of a CGSCC pass being violated? (I believe the contract is that a CGSCC pass cannot invalidate the analyses of a different SCC (https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CGSCCPassManager.h#L104) )

If not, then Chandler's analysis was missing a case (beyond the 3 cases I mentioned which he had convinced himself were the only possible ones).

I wasn't really proposing a specific algorithm, but rather a different way to think about the problem and what semantics we really want (hence the thread title) and are implementable.

A good example of what I mean is David's example of caching the post-order traversal: thinking in terms of a raw graph algorithm can provide some insights into the actual invariants that are (or can be) maintained. Whereas that example was in terms of just a post-order CG walk, to think about the whole CGSCC pass manager visitation, thinking in terms of something like Tarjan's algorithm seems likely to provide insight about the invariants both on the function order and the SCC order.

(btw, http://www.webgraphviz.com/ helped me visualize your graph; handy!)

-- Sean Silva

-- Sanjoy

Sean Silva via llvm-dev

unread,

Jun 8, 2016, 8:36:06 PM6/8/16

to Mehdi Amini, llvm-dev

Ah, yeah. Having a more principled solution to this is one of the main motivations Chandler mentioned. I forgot about this when writing the OP.

I think this is precisely the thing that the "ref graph" solves. I.e. you are constrained by the ref graph SCC's and so this kind of unordered visition cannot occur (pending closer study of Sanjoy's counterexamples).

-- Sean Silva

Sanjoy Das via llvm-dev

unread,

Jun 8, 2016, 8:55:08 PM6/8/16

to Xinliang David Li, llvm-dev

Hi David,

On Wed, Jun 8, 2016 at 4:20 PM, Xinliang David Li <xinli...@gmail.com> wrote:
> Is it in the category of invalidating the iterator while iterating' which
> feels very wrong to me. We should avoid going there and find better ways to
> solve the motivating problems (perhaps defining them more clearly first ?).

I'm not a 100% sure of what you meant by that, so I'll try to give a
general answer, and hope that it covers the points you wanted to
raise. :)

In the scheme above I'm not trying to solve iterator invalidation --
I'm trying to solve the following problem: the CGSCC pass manager ran
the inliner and a set of function passes on a function, and they did
something to invalidate the RefSCC we were iterating over[0]. How do
we _continue_ our iteration over this now non-existent RefSCC?

The solution I'm trying to propose is this: The only possibility for
invalidation is that the RefSCC was broken up into a forest of
RefSCC-DAGs[1]. This means if we had a way to get to the leaves of
this forest of RefSCCs, we could restart our iteration from there
(I've tacitly assumed we're interested in a bottom-up SCC order).
This may be difficult in general, but my idea was to "cheat" and
explicitly remember the Ref-SCC-forest a RefSCC was broken down into
when we do that invalidation. Then once an RefSCC is split, we can
pick up the forest from the original RefSCC* (which is otherwise
useless now), gather the leaves, and re-start our iteration from those
leaves.

This leaves the question of what to do with the SCC DAG nested inside
the RefSCC. I'm not sure what Chandler has in mind in how much
influence these should have over the iteration order, but if we wanted
to iterate over the SCC-DAG in bottom up order as well (as we iterated
over a single RefSCC), we could have the same scheme to handle
SCC-splits, and a similar scheme to handle SCC-merges (when you merge
an SCC, the SCC that gets cleaned out gets a pointer to the SCC where
all the functions went, and if the SCC you were iterating over gets
such a pointer after running the inliner/FPM you chase that pointer
(possibly multiple times, if more than one SCC was merged) and
re-start iteration over that SCC).

By "incident data structure" I meant that with these additions the
RefSCC or SCC is no longer a "pure" function of the structure of the
module, but has state that is a function of what the pass manager did
which is not ideal. That is, in theory this isn't significantly
cleaner than the passes reaching out into and changing the CGSCC pass
manager's state, but perhaps we are okay with this kind of design for
practicality's sake?

[0]: One question I don't know the answer to -- how will we detect
that something has removed a call or ref edge? Will we rescan
functions to verify that edges that we though existed still exist? Or
will we have a ValueHandles-like scheme?

[1]: As Sean mentioned, by design nothing in the function pass
manaager pipeline could have invalidated the RefSCC by merging it with
other RefSCCs.

-- Sanjoy

Sanjoy Das via llvm-dev

unread,

Jun 8, 2016, 9:05:22 PM6/8/16

to Sean Silva, llvm-dev

Hi Sean,

On Wed, Jun 8, 2016 at 5:35 PM, Sean Silva <chiso...@gmail.com> wrote:
> In both of these examples you gave (out-of-thin-air devirtualization and
> forwarding into a callee) is the contract of a CGSCC pass being violated? (I
> believe the contract is that a CGSCC pass cannot invalidate the analyses of
> a different SCC
> (https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/CGSCCPassManager.h#L104)
> )

I'm not sure -- in both the cases I'm introducing a call to a
different SCC. Obviously, if your analysis state is "cache of all
incoming call edges at the function granularity" then it is violated.
But, OTOH, if your analysis state is "a count of all the CallInsts
coming into the function" then even basic things like unrolling loops
will invalidate that.

Xinliang David Li via llvm-dev

unread,

Jun 8, 2016, 11:15:07 PM6/8/16

to Sean Silva, llvm-dev

If we merge CGSCC based transformation passes with an IPA analysis pass that requires strict/valid bottom up order, then updating the SCCs on the fly might be needed (or not if the result will still be conservatively correct) -- but interleaving IPA analyses with IPA transformations like this is not the right thing to do also for other reasons.

Assuming CGSCC pass just does a predefined set of transformations at function level with some order, there is no issue of 'invalidation' -- the cached order will always be valid. In 2.b), why bother to revisit the nodes?

Speaking more broadly about the algorithm you just described, did you intend to omit an SCC visitation step?

It is independent of this decision. However we can actually go back one step and ask the question: is adding this layer really necessary? Why not just traversing the cgraph nodes in reverse topo-order (after removing the cycles)?

The goal of the CGSCC pass manager is the ability to visit an SCC (e.g. inliner visits an SCC), then immediately run function passes to simplify the result of inlining.

This is how current implementation is. Is it a fundamental requirement?

Only after the simplification has occurred do we visit the parent SCC. By running the simplifications in lock-step with the inliner SCC visitation we give parent SCC's a more accurate view of the real cost of inlining a function from a child SCC.

This works for all bottom up scheme regardless of a SCC layer is added or not.

Issues similar to 2.a) (i.e. adding edges to the CG) also affect the visitation order of SCC's (not just functions). For example, we visit an SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass on that SCC, which may add an edge (e.g. promote indirect call to direct call) that may enlarge the SCC. Do we continue running the remaining function passes? Do we re-visit this new enlarged SCC? (if so, when?) These are the types of questions that motivated this post (hence the name "Intended behavior of CGSCC pass manager.").

yes, I understand the motivation -- that is way I propose the agorithm that uses cached iteration order + worklist based iterative approach. In your example, when a new edge is exposed via devirtualization, only its caller/ancestor nodes need to be revisited in the next iteration.

David

Xinliang David Li via llvm-dev

unread,

Jun 8, 2016, 11:59:31 PM6/8/16

to Sanjoy Das, llvm-dev

On Wed, Jun 8, 2016 at 5:54 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:

Hi David,

On Wed, Jun 8, 2016 at 4:20 PM, Xinliang David Li <xinli...@gmail.com> wrote:
> Is it in the category of invalidating the iterator while iterating' which
> feels very wrong to me. We should avoid going there and find better ways to
> solve the motivating problems (perhaps defining them more clearly first ?).

I'm not a 100% sure of what you meant by that, so I'll try to give a
general answer, and hope that it covers the points you wanted to
raise. :)

In the scheme above I'm not trying to solve iterator invalidation --
I'm trying to solve the following problem: the CGSCC pass manager ran
the inliner and a set of function passes on a function, and they did
something to invalidate the RefSCC we were iterating over[0]. How do
we _continue_ our iteration over this now non-existent RefSCC?

What you described is: "the iterator pointing to the current SCC gets invalidated because the pointee disappears, so we need to find a way to let continue working" - - so it does sound like it is trying to solve the iterator invalidation problem?

What I suggested is that we should try to avoid getting into that situation in the first place -- not trying to find a solution for the problem we introduce in the design.

The complexity you described above is my main source of concerns. Complicated algorithms are nice, but simplicity is better :)

thanks,

David

Sean Silva via llvm-dev

unread,

Jun 9, 2016, 6:26:15 AM6/9/16

to Xinliang David Li, llvm-dev

Indeed. That is the type of doubt for why I was asking this question in the OP.

Speaking more broadly about the algorithm you just described, did you intend to omit an SCC visitation step?

It is independent of this decision. However we can actually go back one step and ask the question: is adding this layer really necessary? Why not just traversing the cgraph nodes in reverse topo-order (after removing the cycles)?

The goal of the CGSCC pass manager is the ability to visit an SCC (e.g. inliner visits an SCC), then immediately run function passes to simplify the result of inlining.

This is how current implementation is. Is it a fundamental requirement?

I think this is a good question. Hopefully Chandler will be able to reply to this thread. From what I can tell, the exact requirements have never really been discussed.

Only after the simplification has occurred do we visit the parent SCC. By running the simplifications in lock-step with the inliner SCC visitation we give parent SCC's a more accurate view of the real cost of inlining a function from a child SCC.

This works for all bottom up scheme regardless of a SCC layer is added or not.

Issues similar to 2.a) (i.e. adding edges to the CG) also affect the visitation order of SCC's (not just functions). For example, we visit an SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass on that SCC, which may add an edge (e.g. promote indirect call to direct call) that may enlarge the SCC. Do we continue running the remaining function passes? Do we re-visit this new enlarged SCC? (if so, when?) These are the types of questions that motivated this post (hence the name "Intended behavior of CGSCC pass manager.").

yes, I understand the motivation -- that is way I propose the agorithm that uses cached iteration order + worklist based iterative approach. In your example, when a new edge is exposed via devirtualization, only its caller/ancestor nodes need to be revisited in the next iteration.

Your proposal certainly sounds simpler and easier to reason about. (e.g. it naturally avoids any issues with deletion)

Since there are only 3 non-inliner CGSCC passes (ArgPromotion, PostOrderFunctionAttrsLegacyPass, PruneEH) it may be feasible to remove the entire abstraction from LLVM and replace it with a cached post-order function pass visitation like you suggest. The inliner doesn't seem to do anything special with the knowledge that it is visiting an SCC (besides moving call sites that call within the SCC to the end of its worklist) and so this may be fine.

Sean:~/pg/llvm % git grep 'public CallGraphSCCPass'

include/llvm/Transforms/IPO/InlinerPass.h:struct Inliner : public CallGraphSCCPass {

lib/Transforms/IPO/ArgumentPromotion.cpp: struct ArgPromotion : public CallGraphSCCPass {

lib/Transforms/IPO/FunctionAttrs.cpp:struct PostOrderFunctionAttrsLegacyPass : public CallGraphSCCPass {

lib/Transforms/IPO/PruneEH.cpp: struct PruneEH : public CallGraphSCCPass {

lib/Analysis/CallGraphSCCPass.cpp: class PrintCallGraphPass : public CallGraphSCCPass {

tools/opt/PassPrinters.cpp:struct CallGraphSCCPassPrinter : public CallGraphSCCPass {

CGSCC passes seem to have been added in what is now SVN r8247 (~Aug 2003) http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20030825/006619.html (LLVM appears to have been in CVS at the time).

Chris, do you remember the motivation for doing the CGSCC visitation instead of a pure post-order function visitation like David is mentioning? (or was it just an oversight / hindsight-20-20 thing?) Do you think it would make sense to replace CGSCC visitation with post-order function visitation in the current LLVM?

-- Sean Silva

Xinliang David Li via llvm-dev

unread,

Jun 15, 2016, 2:30:01 AM6/15/16

to Sean Silva, llvm-dev

A) Using cached callgraph walk (nodes or SCC) to avoid cgraph invalidation and B) whether SCC based walk has more advantages are two different things to answer.

For the later, I now tend to believe that introducing additional SCC layer has more advantages than alternative that uses node based bottom-up walk.

1) Bottom-up based IPA algorithms can be implemented more cleanly by using SCC. For instance the pruneEH or any bottom-up attribute propagations. Without using SCC, we will need to defined a generic (using template) worklist based bottom-up propagation engine.

2) All SCC passes can be nicely grouped together -- for compile time reasons or to create synergies for better performance.

For A), to repeat myself, SCC DAG caching to avoid mutation for one round of CG walk has many advantages. We should probably discuss more on pros/cons and various scenarios we want to handle ..

thanks,

David

Sean Silva via llvm-dev

unread,

Jun 16, 2016, 7:48:44 AM6/16/16

to Xinliang David Li, llvm-dev

One question is what invariants we want to provide for the visitation.

For example, should a CGSCC pass be able to assume that all "child" SCC's (SCC's it can reach via direct calls emanating from the SCC being visited) have already been visited? Currently I don't think it can, and IIRC from the discussion at the social this is one thing that Chandler is hoping to fix. The "ref edge" notion in LazyCallGraph ensures that we cannot create a call edge (devirtualizing a ref edge) that will point at an SCC that has not yet been visited.

E.g. consider this graph:

digraph G {

A -> B; B -> A; // SCC {A,B}

S -> T; T -> S; // SCC {S,T}

X -> Y; Y -> X; // SCC {X,Y}

B -> X;

B -> S;

T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC {S,T}",constraint=false,style=dashed]

}

(can visualize conveniently at http://www.webgraphviz.com/ or I have put an image at http://reviews.llvm.org/F2073104)

If we do not consider the ref graph, then it is possible for SCC {S,T} to be visited before SCC {X,Y}. So after devirtualizing the call T->Y we can no longer assume that the SCC's are visited in post-order (or must somehow try to ignore that the call edge T->Y exists).

A more complicated case is when SCC {S,T} and SCC {X,Y} both call into each other via function pointers. So eventually after devirtualizing the calls in both directions there will be a single SCC {S,T,X,Y}.

digraph G {

A -> B; B -> A; // SCC {A,B}

S -> T; T -> S; // SCC {S,T}

X -> Y; Y -> X; // SCC {X,Y}

B -> X;

B -> S;

T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC {S,T}",constraint=false,style=dashed]

X -> S [label="Ref edge that is devirtualized\nwhen visiting SCC {X,Y}",constraint=false,style=dashed]

}

(rendering at: http://reviews.llvm.org/F2073479)

Due to the cyclic dependence there is no SCC visitation order that can directly provide the invariant above. Indeed, the problem of maintaining the above invariant is ill-posed in this scenario.

Consider the pipeline `cgscc(function(...simplifications that can devirtualize...),foo-cgscc-pass)`. A possible visitation is as follows:

1. Visit SCC {S,T} and run `function(...simplifications that can devirtualize...)`. This reveals the call edge T->Y.

2. We continue visiting SCC {S,T} and run foo-cgscc-pass on SCC {S,T}.

3. Visit SCC {X,Y} and run `function(...simplifications that can devirtualize...)`. This reveals the call edge X->S.

4. ??? what do we do now.

Alternative 4.a) Should we continue the visitation and call foo-cgscc-pass on "SCC" {X,Y}?

Alternative 4.b) Should foo-cgscc-pass now run on SCC {S,T,X,Y}?

Alternative 4.c) Should we restart the entire outer `cgscc(...)` visitation on SCC {S,T,X,Y}?

(Without a cap both 4.b and 4.c could become quadratic on a graph like http://reviews.llvm.org/F2073607)

-- Sean Silva

Hal Finkel via llvm-dev

unread,

Jun 16, 2016, 1:13:46 PM6/16/16

to Sean Silva, llvm-dev, Xinliang David Li

From: "Sean Silva via llvm-dev" <llvm...@lists.llvm.org>

To clarify, we're trying to provide this invariant on the "ref" graph or on the graph with direct calls only? I think the invariant need only apply to the former if we're relying on this for correctness (i.e. an analysis must visit all callees before visiting the callers).

-Hal

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Xinliang David Li via llvm-dev

unread,

Jun 16, 2016, 1:46:00 PM6/16/16

to Hal Finkel, llvm-dev

To clarify, we're trying to provide this invariant on the "ref" graph or on the graph with direct calls only? I think the invariant need only apply to the former

More clarification needed :) What do you mean by 'invariant need only apply to the former'?

if we're relying on this for correctness (i.e. an analysis must visit all callees before visiting the callers).

Not necessarily. Due to lost edges (from caller to indirect callees), a callee node may be visited later. The analysis will just have to punt when a special edge to 'external' node is seen.

David

Hal Finkel via llvm-dev

unread,

Jun 16, 2016, 1:52:42 PM6/16/16

to Xinliang David Li, llvm-dev

From: "Xinliang David Li" <dav...@google.com>
To: "Hal Finkel" <hfi...@anl.gov>
Cc: "Sean Silva" <chiso...@gmail.com>, "llvm-dev" <llvm...@lists.llvm.org>
Sent: Thursday, June 16, 2016 12:45:50 PM
Subject: Re: [llvm-dev] Intended behavior of CGSCC pass manager.

To clarify, we're trying to provide this invariant on the "ref" graph or on the graph with direct calls only? I think the invariant need only apply to the former

More clarification needed :) What do you mean by 'invariant need only apply to the former'?

;)

I mean that we only need to visit children in the "ref" graph before their parents. Furthermore, I'm not even sure that we need an invariant on the SCC level, but rather on the functions themselves. Meaning that I don't think we need to specify an invariant that requires revisiting once we split an SCC (it might be useful to do so, but nothing comes to mind that would require that for correctness).

if we're relying on this for correctness (i.e. an analysis must visit all callees before visiting the callers).

Not necessarily. Due to lost edges (from caller to indirect callees), a callee node may be visited later. The analysis will just have to punt when a special edge to 'external' node is seen.

Yes, but my impression is that the "ref" graph has no lost edges (it is the conservative over-approximation). Is that right?

-Hal

Sanjoy Das via llvm-dev

unread,

Jun 16, 2016, 2:13:24 PM6/16/16

to Sean Silva, llvm-dev, Xinliang David Li

Hi Sean,

On Thu, Jun 16, 2016 at 4:48 AM, Sean Silva via llvm-dev
<llvm...@lists.llvm.org> wrote:
> One question is what invariants we want to provide for the visitation.
>
> For example, should a CGSCC pass be able to assume that all "child" SCC's
> (SCC's it can reach via direct calls emanating from the SCC being visited)
> have already been visited? Currently I don't think it can, and IIRC from the
> discussion at the social this is one thing that Chandler is hoping to fix.
> The "ref edge" notion in LazyCallGraph ensures that we cannot create a call
> edge (devirtualizing a ref edge) that will point at an SCC that has not yet
> been visited.
>
> E.g. consider this graph:
>
> digraph G {
> A -> B; B -> A; // SCC {A,B}
> S -> T; T -> S; // SCC {S,T}
> X -> Y; Y -> X; // SCC {X,Y}
>
> B -> X;
> B -> S;
> T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC
> {S,T}",constraint=false,style=dashed]
> }
>
> (can visualize conveniently at http://www.webgraphviz.com/ or I have put an
> image at http://reviews.llvm.org/F2073104)
>
> If we do not consider the ref graph, then it is possible for SCC {S,T} to be

I'm not sure why you wouldn't consider the ref graph? I think the
general idea is to visit RefSCCs in bottom up order, and when visiting
a RefSCC, visiting the SCC's inside the RefSCC in bottom up order.

So in your example, given the edges you've shown, we will visit {X,Y}
before visiting {S,T}.

> A more complicated case is when SCC {S,T} and SCC {X,Y} both call into each
> other via function pointers. So eventually after devirtualizing the calls in
> both directions there will be a single SCC {S,T,X,Y}.
>
> digraph G {
> A -> B; B -> A; // SCC {A,B}
> S -> T; T -> S; // SCC {S,T}
> X -> Y; Y -> X; // SCC {X,Y}
>
> B -> X;
> B -> S;
> T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC
> {S,T}",constraint=false,style=dashed]
> X -> S [label="Ref edge that is devirtualized\nwhen visiting SCC
> {X,Y}",constraint=false,style=dashed]
> }
>
> (rendering at: http://reviews.llvm.org/F2073479)
>
> Due to the cyclic dependence there is no SCC visitation order that can
> directly provide the invariant above. Indeed, the problem of maintaining the

I think the workflow in the above will (roughly) be:

Visit the RefSCC {X,Y,S,T}
Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}
Visit the SCC {S,T}
Optimize({S,T})
// Now {X,Y,S,T} collapses to form a single SCC
Visit the SCC {S,T,X,Y}
Optimize({S,T,X,Y})

The difficult bit is to make the inner "// Now.*" bits work well.

-- Sanjoy

Sean Silva via llvm-dev

unread,

Jun 16, 2016, 8:13:18 PM6/16/16

to Sanjoy Das, llvm-dev, Xinliang David Li

The simple answer is that this is the current state of things. The SCC visitation logic in the old PM does not consider the ref graph.

So in some sense the question is why *should* we consider the ref graph? What is it buying us? Presumably this will take the form of some invariant on the `run(SCC &)` calls. But I have yet to see any explicit statement of an invariant that it gives us.

For example, the examples I gave show that without bailing out in the middle of a cgscc pass manager (e.g. after the `function(...simplifications that can devirtualize...)`) then we cannot even guarantee that the thing passed to the `run(SCC &)` function is actually an SCC.

But consider that Optimize({S,T}) might be of the form: `cgscc(function(...simplifications that can devirtualize...),foo-cgscc-pass)`.

After running `function(...simplifications that can devirtualize...)` we would end up running `foo-cgscc-pass` on {S,T} which is no longer an SCC anymore.

What is the invariant here? What do we actually guarantee for the `run(SCC &)` function?

Xinliang David Li via llvm-dev

unread,

Jun 17, 2016, 12:43:29 AM6/17/16

to Hal Finkel, llvm-dev

On Thu, Jun 16, 2016 at 10:52 AM, Hal Finkel <hfi...@anl.gov> wrote:

From: "Xinliang David Li" <dav...@google.com>
To: "Hal Finkel" <hfi...@anl.gov>
Cc: "Sean Silva" <chiso...@gmail.com>, "llvm-dev" <llvm...@lists.llvm.org>
Sent: Thursday, June 16, 2016 12:45:50 PM
Subject: Re: [llvm-dev] Intended behavior of CGSCC pass manager.

To clarify, we're trying to provide this invariant on the "ref" graph or on the graph with direct calls only? I think the invariant need only apply to the former

More clarification needed :) What do you mean by 'invariant need only apply to the former'?
;)

I mean that we only need to visit children in the "ref" graph before their parents. Furthermore, I'm not even sure that we need an invariant on the SCC level, but rather on the functions themselves. Meaning that I don't think we need to specify an invariant that requires revisiting once we split an SCC (it might be useful to do so, but nothing comes to mind that would require that for correctness).

if we're relying on this for correctness (i.e. an analysis must visit all callees before visiting the callers).

Not necessarily. Due to lost edges (from caller to indirect callees), a callee node may be visited later. The analysis will just have to punt when a special edge to 'external' node is seen.
Yes, but my impression is that the "ref" graph has no lost edges (it is the conservative over-approximation). Is that right?

I am not sure. By looking at the code 'findReferences', it does not actually look at an indirect callsites -- the references are purely from referenced globals to function addresses -- so at least the edges from caller to indirect call targets are lost, right? On the other hand, if that is modeled, the memory consumption will be quadratic.

David

Xinliang David Li via llvm-dev

unread,

Jun 17, 2016, 12:53:45 AM6/17/16

to Sanjoy Das, llvm-dev

On Thu, Jun 16, 2016 at 11:12 AM, Sanjoy Das <san...@playingwithpointers.com> wrote:

Are we sure RefSCC has ref edges between {X, Y} and {S, T} in this case? I may miss the code handling it.

Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}

I am not sure if this makes sense. If dynamically, the call edge from {X, Y} to {S, T} does exist, but not discovered by the analysis, then the cached {X, Y} will still be invalid, but who is going to invalidate it?

David

Sean Silva via llvm-dev

unread,

Jun 17, 2016, 1:47:52 AM6/17/16

to Xinliang David Li, llvm-dev

The ref graph is conservative and so it would have the appropriate edges.

https://github.com/llvm-project/llvm-project/blob/master/llvm/lib/Analysis/LazyCallGraph.cpp#L90

Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}

I am not sure if this makes sense. If dynamically, the call edge from {X, Y} to {S, T} does exist, but not discovered by the analysis, then the cached {X, Y} will still be invalid, but who is going to invalidate it?

I assume that if dynamically there was a call from {X,Y} to {S,T}, then the analysis would have observed an indirect call and would have behaved conservatively.

-- Sean Silva

Sanjoy Das via llvm-dev

unread,

Jun 17, 2016, 1:48:38 AM6/17/16

to Xinliang David Li, llvm-dev

Hi David,

On Thu, Jun 16, 2016 at 9:53 PM, Xinliang David Li <dav...@google.com> wrote:
>> I think the workflow in the above will (roughly) be:
>>
>> Visit the RefSCC {X,Y,S,T}
>
>
> Are we sure RefSCC has ref edges between {X, Y} and {S, T} in this case? I
> may miss the code handling it.

I was going by the diagram -- the diagram explicitly has ref edges
between {X,Y} and {S,T}.

>> Visit the SCC {X,Y} // arbitrary
>> Optimize({X,Y})
>> // Now there's an edge to {S,T}, invalidate
>> // the analyses cached for {X,Y} and visit {S,T}
>
>
> I am not sure if this makes sense. If dynamically, the call edge from {X,
> Y} to {S, T} does exist, but not discovered by the analysis, then the cached
> {X, Y} will still be invalid, but who is going to invalidate it?

I cannot answer this with authority, since I'm not the one working on
the callgraph, but I'll jot down what I think is the case:

Whatever you analyze on {X,Y} will have to be conservative around
indirect calls that haven't yet been devirtualized. Say you're trying
to prove that an SCC is readnone. With the SCC iteration order,
you'll have to do:

for every call site in CurrentSCC
if the call is indirect, then ReadWrite, break out of loop
if the call is to SSC X, then
CurrentSCC.MemoryEffect.unionWith(X.MemoryEffect) // L1

To avoid re-walking the call-sites in common cases like the above,
we'll have to add a "HasNonAnalyzedCalls" bit on SCCs that we'll set
when building the call graph (Chandler had promised this a few socials
ago). That would let us directly walk the outgoing call edges (the
bit would be the moral equivalent of having a call edge to
"external").

The invariant provided by the bottom up SCC iteration order, as I
understand it, assures us that in line L1, X.MemoryEffect is as
precise as it can be. When we analyze a call site and the call target
is in a ReadOnly SCC, we are assured that the call target SCC could
not have been proved ReadNone -- we've already tried our best. So in
a way the bottom up order gives us precision, not correctness.

Xinliang David Li via llvm-dev

unread,

Jun 17, 2016, 2:08:05 AM6/17/16

to Sean Silva, llvm-dev

That is where the confusion is -- the findReferences's function body only handles constant operands which may be function addresses. Did I miss something obvious? Another point is that it may not be practical to model edges to indirect targets. For virtual calls, without CHA, each virtual callsite will end up referencing all potential address taken functions.

Visit the SCC {X,Y} // arbitrary
Optimize({X,Y})
// Now there's an edge to {S,T}, invalidate
// the analyses cached for {X,Y} and visit {S,T}

I am not sure if this makes sense. If dynamically, the call edge from {X, Y} to {S, T} does exist, but not discovered by the analysis, then the cached {X, Y} will still be invalid, but who is going to invalidate it?

I assume that if dynamically there was a call from {X,Y} to {S,T}, then the analysis would have observed an indirect call and would have behaved conservatively.

See above, I did not see how indirect call is handled. Also if the result for {X, Y} is conservatively correct before the direct call edge is discovered, why bother invalidating its analysis when the call edge is exposed?

David

Sean Silva via llvm-dev

unread,

Jun 17, 2016, 2:10:30 AM6/17/16

to Sanjoy Das, llvm-dev, Xinliang David Li

On Thu, Jun 16, 2016 at 10:48 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:

Hi David,

On Thu, Jun 16, 2016 at 9:53 PM, Xinliang David Li <dav...@google.com> wrote:
>> I think the workflow in the above will (roughly) be:
>>
>> Visit the RefSCC {X,Y,S,T}
>
>
> Are we sure RefSCC has ref edges between {X, Y} and {S, T} in this case? I
> may miss the code handling it.

I was going by the diagram -- the diagram explicitly has ref edges
between {X,Y} and {S,T}.

To clarify: initially the indirect calls are represented by a call edge to a dummy "unknown" node something like: http://reviews.llvm.org/F2078426

These call edges to "unknown" are what force the conservative analysis.

To put this another way, if we were to use a cached SCC visitation order, then after visiting all cached SCC's we may not have incorporated all information due to devirtualized or deleted calls. We would need to recompute SCC's and do another cached visitation in order to benefit fully from the information. If multiple devirtualizations are needed to fully simplify, then multiple visitations are needed to fully propagate this information.

With dynamic updates to the SCC structure, we can get increased precision from a single visitation and more effective bottom-up propagation.

Sanjoy Das via llvm-dev

unread,

Jun 17, 2016, 2:14:14 AM6/17/16

to Sean Silva, llvm-dev, Xinliang David Li

Hi Sean,

On Thu, Jun 16, 2016 at 5:13 PM, Sean Silva <chiso...@gmail.com> wrote:
> The simple answer is that this is the current state of things. The SCC
> visitation logic in the old PM does not consider the ref graph.
>
> So in some sense the question is why *should* we consider the ref graph?
> What is it buying us? Presumably this will take the form of some invariant
> on the `run(SCC &)` calls. But I have yet to see any explicit statement of
> an invariant that it gives us.

I'll have think harder to give a deeper answer, but trivially the
invariant you have is run(SCCA) is called before run(SCCB) if SCCB
Refs SCCA but not the other way around. This is just a fancy way of
re-stating that we'll iterate over RefSCC's in bottom up order, of
course; but it already helps in cases like your first example -- if we
don't iterate in bottom up RefSCC order there is no ordering between
{X,Y} and {S,T}, but we would like to visit {X,Y} before {S,T}.

> For example, the examples I gave show that without bailing out in the middle

I'm not sure what design point Chandler's is pursuing, but bailing out
in the middle of a pass manager because the data structure it was
operating on is gone is not new to LLVM. The LoopPassManager does
exactly this when a loop is deleted (due to full unrolling, for
instance).

> of a cgscc pass manager (e.g. after the `function(...simplifications that
> can devirtualize...)`) then we cannot even guarantee that the thing passed
> to the `run(SCC &)` function is actually an SCC.
>

> But consider that Optimize({S,T}) might be of the form:
> `cgscc(function(...simplifications that can
> devirtualize...),foo-cgscc-pass)`.
> After running `function(...simplifications that can devirtualize...)` we
> would end up running `foo-cgscc-pass` on {S,T} which is no longer an SCC
> anymore.
> What is the invariant here? What do we actually guarantee for the `run(SCC
> &)` function?

As I said, one possibility is to bail out of the current pipeline, and
re-start from the new leaves (similar to what we do for the loop pass
manager today).

Xinliang David Li via llvm-dev

unread,

Jun 17, 2016, 2:38:14 AM6/17/16

to Sanjoy Das, llvm-dev

On Thu, Jun 16, 2016 at 10:48 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:

Hi David,

On Thu, Jun 16, 2016 at 9:53 PM, Xinliang David Li <dav...@google.com> wrote:
>> I think the workflow in the above will (roughly) be:
>>
>> Visit the RefSCC {X,Y,S,T}
>
>
> Are we sure RefSCC has ref edges between {X, Y} and {S, T} in this case? I
> may miss the code handling it.

I was going by the diagram -- the diagram explicitly has ref edges
between {X,Y} and {S,T}.

Ok -- I thought the example is showing indirect calls across {X, Y} and {S, T}, and call graph builder magically discovered the ref edges between them.

>> Visit the SCC {X,Y} // arbitrary
>> Optimize({X,Y})
>> // Now there's an edge to {S,T}, invalidate
>> // the analyses cached for {X,Y} and visit {S,T}
>
>
> I am not sure if this makes sense. If dynamically, the call edge from {X,
> Y} to {S, T} does exist, but not discovered by the analysis, then the cached
> {X, Y} will still be invalid, but who is going to invalidate it?

I cannot answer this with authority, since I'm not the one working on
the callgraph, but I'll jot down what I think is the case:

Whatever you analyze on {X,Y} will have to be conservative around
indirect calls that haven't yet been devirtualized. Say you're trying
to prove that an SCC is readnone. With the SCC iteration order,
you'll have to do:

for every call site in CurrentSCC
if the call is indirect, then ReadWrite, break out of loop
if the call is to SSC X, then
CurrentSCC.MemoryEffect.unionWith(X.MemoryEffect) // L1

yes, ff RefSCC has such special edge to 'unknown' node to model icall, there is no problem, or the analysis still has to walk through the IR?

To avoid re-walking the call-sites in common cases like the above,
we'll have to add a "HasNonAnalyzedCalls" bit on SCCs that we'll set
when building the call graph (Chandler had promised this a few socials
ago). That would let us directly walk the outgoing call edges (the
bit would be the moral equivalent of having a call edge to
"external").

There is a disadvantage of setting bit on SCC compared with special call edge -- the later can be per-callsite, so elimination of the last such edge automatically makes caller 'clean'. With the special bit, it is not so easy to get rid of it.

The invariant provided by the bottom up SCC iteration order, as I
understand it, assures us that in line L1, X.MemoryEffect is as
precise as it can be. When we analyze a call site and the call target
is in a ReadOnly SCC, we are assured that the call target SCC could
not have been proved ReadNone -- we've already tried our best. So in
a way the bottom up order gives us precision, not correctness.

you mean 'correctness' not 'precision'?

thanks,

David

-- Sanjoy

Sean Silva via llvm-dev

unread,

Jun 17, 2016, 2:51:17 AM6/17/16

to Xinliang David Li, llvm-dev

It is somewhat subtle. There are 3 potential meanings of "call graph" in this thread:

1. The graph of *direct calls* in the current module. (this is mutated during optimization)

2. A supergraph of 1., conservatively chosen such 1. remains a subgraph under arbitrary semantics-preserving function transformations (note: 2. and 1. must be updated in sync during deletion of functions).

3. The conservative graph representing all edges which may exist *at runtime*.

LazyCallGraph models 1. (the "call graph") and 2. (the "ref graph"). It does not model 3.

The search for constants in findReferences guarantees that we find (a conserative superset of) all call destinations addresses that transformations may add direct calls (and hence update 1.).

The existing CallGraph data structure (used by e.g. the old PM CGSCC visitation) only models 1.

Another point is that it may not be practical to model edges to indirect targets. For virtual calls, without CHA, each virtual callsite will end up referencing all potential address taken functions.

In this statement, you are thinking in terms of 3.

Note that in both 1. and 2. the indirect calls are modeled as calling an external dummy node. This dummy node is *distinct* from a second external dummy node that calls all address-taken functions. (hence the indirect calls do not end up forming a RefSCC with all address taken functions).

Note that `opt -dot-callgraph` can produce a graphviz file for the old PM callgraph. I'm not sure if we have one for LazyCallGraph (which has both ref graph and call graph); I'll post a patch adding one if not.

-- Sean Silva

Sanjoy Das via llvm-dev

unread,

Jun 17, 2016, 2:55:56 AM6/17/16

to Xinliang David Li, llvm-dev

Hi David,

On Thu, Jun 16, 2016 at 11:38 PM, Xinliang David Li <dav...@google.com> wrote:
> yes, ff RefSCC has such special edge to 'unknown' node to model icall, there
> is no problem, or the analysis still has to walk through the IR?

I don't think RefSCC has such a bit right now -- I believe it is in
the "coming soon" stage. :)

Again, I'm not doing the work or the planning; you'll have to ask
Chandler for specifics.

> There is a disadvantage of setting bit on SCC compared with special call
> edge -- the later can be per-callsite, so elimination of the last such edge
> automatically makes caller 'clean'. With the special bit, it is not so easy
> to get rid of it.

Yes, but I suppose we can keep a list of unanalyzable calls instead of
a single bit to make it easier to update.

>> The invariant provided by the bottom up SCC iteration order, as I
>> understand it, assures us that in line L1, X.MemoryEffect is as
>> precise as it can be. When we analyze a call site and the call target
>> is in a ReadOnly SCC, we are assured that the call target SCC could
>> not have been proved ReadNone -- we've already tried our best. So in
>> a way the bottom up order gives us precision, not correctness.
>
>
> you mean 'correctness' not 'precision'?

I did mean "precision, not correctness", but I phrased it badly. I
meant the motivation for bottom up strategy is that we're more
precise. We're also correct, but the bottom up strategy has no direct
hand it in that; we have to write our SCC passes in a way that they're
conservative around the cases they should be conservative in anyway.

Xinliang David Li via llvm-dev

unread,

Jun 17, 2016, 3:10:13 AM6/17/16

to Sean Silva, llvm-dev

For C++ programs, most ref edges are probably from ctors and dtors that reference vtable -- they will have large fan-out of ref edges to virtual member functions they are unlikely to call. In other words, such ref edges mostly do not represent real caller->callee relationship, so I wonder what is the advantage of forming refSCC from SCCs?

David

Sean Silva via llvm-dev

unread,

Jun 17, 2016, 3:27:37 AM6/17/16

to Xinliang David Li, llvm-dev

I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").

-- Sean Silva

Xinliang David Li via llvm-dev

unread,

Jun 17, 2016, 3:49:32 AM6/17/16

to Sean Silva, llvm-dev

This is what it can do -- but what benefit does it provide?

Sanjoy Das via llvm-dev

unread,

Jun 19, 2016, 3:02:11 AM6/19/16

to Xinliang David Li, llvm-dev

Hi David,

Xinliang David Li wrote:
>> I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> This is what it can do -- but what benefit does it provide?

One benefit is that once you get to a function F that constructs an
instance of a class with virtual functions and then calls a virtual
function on the instance, then the virtual function being called and
the constructor will have been maximally simplified (F refs the
constructor, and the constructor refs all the virtual functions), and
you're more likely to inline the constructor and devirtualize the
call. I don't have any real data to back up that this will materially
help, though.

Sean Silva via llvm-dev

unread,

Jun 20, 2016, 11:43:35 AM6/20/16

to Sanjoy Das, llvm-dev, Xinliang David Li

On Sun, Jun 19, 2016 at 12:01 AM, Sanjoy Das <san...@playingwithpointers.com> wrote:

Hi David,

Xinliang David Li wrote:
>> I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> This is what it can do -- but what benefit does it provide?

One benefit is that once you get to a function F that constructs an
instance of a class with virtual functions and then calls a virtual
function on the instance, then the virtual function being called and
the constructor will have been maximally simplified (F refs the
constructor, and the constructor refs all the virtual functions), and
you're more likely to inline the constructor and devirtualize the
call.

That is true for a graph like http://reviews.llvm.org/F2073104 but not one like http://reviews.llvm.org/F2073479

That is, there is no real guarantee.

I don't have any real data to back up that this will materially
help, though.

And we haven't had an RFC for any of this...

Sanjoy Das via llvm-dev

unread,

Jun 20, 2016, 12:44:59 PM6/20/16

to Sean Silva, llvm-dev, Xinliang David Li

Hi Sean,

Sean Silva wrote:
>
>
> On Sun, Jun 19, 2016 at 12:01 AM, Sanjoy Das <san...@playingwithpointers.com <mailto:san...@playingwithpointers.com>>

wrote:
>
> Hi David,
>
> Xinliang David Li wrote:
> > > I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> > This is what it can do -- but what benefit does it provide?
>
> One benefit is that once you get to a function F that constructs an
> instance of a class with virtual functions and then calls a virtual
> function on the instance, then the virtual function being called and
> the constructor will have been maximally simplified (F refs the
> constructor, and the constructor refs all the virtual functions), and
> you're more likely to inline the constructor and devirtualize the
> call.
>
>
> That is true for a graph like http://reviews.llvm.org/F2073104 but not one like http://reviews.llvm.org/F2073479

I agree with this ^

> That is, there is no real guarantee.

But not with this ^ :)

The *guarantee*, as I understand it, is bottom up order on the RefSCC
DAG, and once inside a RefSCC bottom up on the SCC DAG contained in
it. This guarantee (the part about RefSCCs) helps more in cases like
http://reviews.llvm.org/F2073104 and the situation David Li described,
and does not quite help as much on http://reviews.llvm.org/F2073479.

In cases like http://reviews.llvm.org/F2073479 the "bottom up on SCCs
inside a RefSCC" part of the guarantee helps more since we still get
to see (depending on whether we first picked {S,T} or {X,Y}) a direct
call from {X,Y} (when iterating over {X,Y}) to {S,T} with {S,T} fully
simplified.

> I don't have any real data to back up that this will materially
> help, though.
>
> And we haven't had an RFC for any of this...

Yes, an RFC would have helped here.

Xinliang David Li via llvm-dev

unread,

Jun 20, 2016, 4:18:03 PM6/20/16

to Sanjoy Das, llvm-dev

On Sun, Jun 19, 2016 at 12:01 AM, Sanjoy Das <san...@playingwithpointers.com> wrote:

Hi David,

Xinliang David Li wrote:
>> I believe it is primarily used for ordering the visitation of CallSCC's (i.e. SCC's in the "call graph").
> This is what it can do -- but what benefit does it provide?

One benefit is that once you get to a function F that constructs an
instance of a class with virtual functions and then calls a virtual
function on the instance, then the virtual function being called and
the constructor will have been maximally simplified (F refs the
constructor, and the constructor refs all the virtual functions), and
you're more likely to inline the constructor and devirtualize the
call. I don't have any real data to back up that this will materially
help, though.

Sanjoy, this is a good example. The code pattern is basically like this:

Worker(Base *B) {

B->vCall();

}

Factory::create(Kind K) {

if (K == ..) return new D1();

else ...

}

Caller() {

..

Base *B = Factory::create(K, ...);

Worker(B);

}

The added ordering constraints from Factory::create() node to all virtual methods in Base's hierarchy ensures that after

1) Factory::create gets inlined to Caller, and

2) Worker(..) method gets inlined to Caller, and

3) newly exposed vcall gets devirtualized

the inliner sees a callee to say D1::vCall which is already simplified.

However, in real applications, what I see is the following pattern (for instances LLVM's Pass )

Caller() {

Base *B = Factory::create(...);

Stash (B); // store the object in some container to be retrieved later

...

}

SomeTask() {

Base *B = findObject(...);

B->vCall(); // do the work

}

Driver() {

Caller(); // create objects ...

SomeTask();

}

Set aside the fact that it is usually much harder to do de-viritualization in this case, assuming the virtual call in SomeTask can be devritualized. What we need is that the virtual functions are processed before SomeTask node, but this is not guaranteed unless we also model the call edge ordering imposed by control flow.

However, this is enforcing virtual methods to be processed before their object's creators. Are there other simpler ways to achieve the effect (if we have data to justify it)?

David

-- Sanjoy

Sanjoy Das via llvm-dev

unread,

Jun 20, 2016, 4:49:57 PM6/20/16

to Xinliang David Li, llvm-dev

Hi David,

Xinliang David Li wrote:
> [snip]

>
> However, in real applications, what I see is the following pattern (for
> instances LLVM's Pass )
>
> Caller() {
> Base *B = Factory::create(...);
> Stash (B); // store the object in some container to be
retrieved later
> ...
> }
>
> SomeTask() {
>
> Base *B = findObject(...);
> B->vCall(); // do the work
> }
>
> Driver() {
> Caller(); // create objects ...
> SomeTask();
> }
>
> Set aside the fact that it is usually much harder to do
> de-viritualization in this case, assuming the virtual call in
> SomeTask can be devritualized. What we need is that the virtual
> functions are processed before SomeTask node, but this is not guaranteed
> unless we also model the call edge ordering imposed by control flow.

I think the thesis here is you cannot devirtualize the call in
`SomeTask` without also looking at `Caller` [0]. So the flow is:

- Optimize Caller, SomeTask independently as much as you want
* Caller -refs-> Factory::create which -refs-> the constructors
which -refs-> the various implementation of virtual functions
(based on my current understanding of how C++ vtables are
lowered); so these implementations should have been simplified by
the time we look at Caller.

- Then look at Driver. Caller, SomeTask are all maximally
simplified. We now (presumably) inline Caller and SomeTask,
devirtualize the B->vCall (as you said: theoretically possible, but
if findObject etc. are complex then practically maybe not), and now
inline the maximally simplified devirtualized call targets.

> However, this is enforcing virtual methods to be processed before their
> object's creators. Are there other simpler ways to achieve the effect
> (if we have data to justify it)?

Honestly: I'll have to think about it. It is entirely possible that a
(much?) simpler design will catch 99% (or even better) of the
idiomatic cases, I just don't have a good mental model for what those
cases are.

At this point I'm waiting for Chandler to upload his patch so that we
can have this discussion on the review thread. :)

[0]: This breaks down when we allow "out of thin air"
devirtualizations (I'm stealing this term from memory models, but I
think it is appropriate here :) ), where you look at call site and
"magically" (i.e. in a way not expressible in terms of "normal"
optimizations like store forwarding, pre, gvn etc.) are able to
devirtualize the call site. We do this all the time in Java (we'll
look at the type of the receiver object, look at the current class
hierarchy and directly mandate that a certain call site has to have a
certain target), but the RefSCC call graph does not allow for that.
These kinds of out-of-thin-air devirtualizations will have to be
modeled as ModulePass es, IIUC.

Xinliang David Li via llvm-dev

unread,

Jun 20, 2016, 5:07:49 PM6/20/16

to Sanjoy Das, llvm-dev

I agree with the analysis. Practically speaking, this pretty much means the theoretical

opportunities won't be exposed until after lots of functions are inlined to the top level

functions which usually don't happen. I have not seen some cases practically myself.

> However, this is enforcing virtual methods to be processed before their
> object's creators. Are there other simpler ways to achieve the effect
> (if we have data to justify it)?

Honestly: I'll have to think about it. It is entirely possible that a
(much?) simpler design will catch 99% (or even better) of the
idiomatic cases, I just don't have a good mental model for what those
cases are.

At this point I'm waiting for Chandler to upload his patch so that we
can have this discussion on the review thread. :)

[0]: This breaks down when we allow "out of thin air"
devirtualizations (I'm stealing this term from memory models, but I
think it is appropriate here :) ), where you look at call site and
"magically" (i.e. in a way not expressible in terms of "normal"
optimizations like store forwarding, pre, gvn etc.) are able to
devirtualize the call site. We do this all the time in Java (we'll
look at the type of the receiver object, look at the current class
hierarchy and directly mandate that a certain call site has to have a
certain target), but the RefSCC call graph does not allow for that.
These kinds of out-of-thin-air devirtualizations will have to be
modeled as ModulePass es, IIUC.

yes, not all bottom passes have to be grouped with other SCC passes, nor are all IPA optimizations suitable to be implemented as bottom-up passes.

David

-- Sanjoy

Sean Silva via llvm-dev

unread,

Jul 1, 2016, 2:46:49 AM7/1/16

to Hal Finkel, llvm-dev

On Wed, Jun 8, 2016 at 9:32 AM, Hal Finkel <hfi...@anl.gov> wrote:

From: "Sean Silva via llvm-dev" <llvm...@lists.llvm.org>
To: "llvm-dev" <llvm...@lists.llvm.org>
Sent: Wednesday, June 8, 2016 6:19:03 AM
Subject: [llvm-dev] Intended behavior of CGSCC pass manager.

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last LLVM bay area social, and partially a discussion about the direction of the CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager. It seems like Chandler has a CGSCC pass manager working, but it is still unresolved exactly which semantics we want (more about this below) that are reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite complex logic surrounding needing to run function passes nested within an SCC pass manager, while providing some guarantees about exactly what order the function passes are run. The existing CGSCC pass manager just punts on some of the problems that arise (look in CGPassManager::runOnModule, CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting? Because LLVM can do inlining on an SCC (often just a single function) and then run function passes to simplify the function(s) in the SCC before it tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can tremendously simplify the function, and we want to do so before considering this function for inlining into its callers so that we get an accurate evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this regard and other compilers don't do this (which is why we can't just look at how other compilers solve this problem; they don't have this problem (maybe they should? or maybe we shouldn't?)). For example, he described that GCC uses different inlining "phases"; e.g. it does early inlining on the entire module, then does simplifications on the entire module, then does late inlining on the entire module; so it is not able to incrementally simplify as it inlines like LLVM does.

This incremental simplification is an important feature of our inliner, and one we should endeavor to keep. We might also want different phases at some point (e.g. a top-down and a bottom-up phase), but that's another story.

2. What is the intended behavior of CGSCC passes when SCC's are split or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some function passes nested inside the CGSCC pass manager (e.g. to simplify things after inlining). Consider:

This is not how I thought the current scheme worked ;) -- I was under the impression that we had a call graph with conservatively-connected dummy nodes for external/indirect functions. As a result, there is no semantics-preserving optimization that will merge SCCs, only split them. In that case, I'd expect that once an SCC is split, we re-run the CGSCC passes over the newly-separated SCCs. But this corresponds to running over the "ref graph", as you describe it. I don't understand why we want the non-conservative graph.

a) These function passes are e.g. now able to devirtualize a call, adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass (say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge from the CG. This converts, say, a CG SCC which is a cycle graph (like a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner had already visited a, b, and c as a single SCC. Now does it have to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm. Inside the "strongconnect" routine when we have identified an SCC (the true branch of `if (v.lowlink = v.index)` test ) we can visit stack[v.index:stack.size()] as an SCC. This may or may not invalidate some things on the stack (the variable `S` in the pseudocode) and we may need to fix it up (e.g. inlining deleted a function, so we can't have an entry on the stack). Then, we can run function passes as we pop individual functions off the stack, but it is easier to think about IMO than merging of SCC data structures: if we add edges to the CG then we have to do more DFS on the new edges and if we delete edges then the DFS order of the stack gives us certain guarantees.
Personally I find this much easier to reason about than the description in terms of splitting and merging SCC's in the CG and ref graph (which the LazyCallGraph API makes one to think about since it hides the underlying Tarjan's algorithm).
The LazyCallGraph API makes the current loop in http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100 very clean, but at least for my thinking about the problem, it seems like the wrong abstraction (and most of the LazyCallGraph API seems to be unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current approach inside out: instead of having the pass manager logic be the "normal code" and forcing the Tarjan algorithm to become a state machine of iterators, use an open-coded Tarjan algorithm with some callbacks and make the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of the ref graph, since e.g. in the example I gave above, we can mark the `funcs` array itself as already having been visited during the walk. In the current LazyCallGraph, this would require adding some sort of notion of hyperedge.

FWIW, I see no purpose in abstracting one algorithm, especially if that makes things algorithmically harder. Also, the LazyCallGraph abstraction and the iterator abstraction seem like separate issues. Iterator abstractions are often useful because you can use them in generic algorithms, etc.

Since this is such a high priority (due to blocking PGO inlining), I will probably try my hand at implementing the CGSCC pass manager sometime soon unless somebody beats me to it. (I'll probably try the "open-coded SCC visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses the same visitation semantics as the one in the old PM, and then we can refactor that as needed. In fact, that may be the best approach so that porting to the new PM is as NFC as possible and we can isolate the functional (i.e., need benchmarks, measurements ...) changes in separate commits.

I'm in favor of this approach for exactly the reason you mention. Being able to bisect regressions to the algorithmic change, separate from the infrastructure change, will likely make things easier in the long run (and will avoid the problem, to the extent possible, of performance regressions blocking the pass-manager work).

I've just posted a proof of concept: http://reviews.llvm.org/D21921

The proof of concept also ports the remaining 3 CGSCC passes (inliner, argpromotion, prune-eh)

Let me know what you think.

-- Sean Silva

Sorry for the wall of text.

No problem; much appreciated.

-Hal

-- Sean Silva

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Reply all

Reply to author

Forward