We (Google) have started to look more closely at the profiling infrastructure in LLVM. Internally, we have a large dependency on PGO to get peak performance in generated code.
Some of the dependencies we have on profiling are still not present in LLVM (e.g., the inliner) but we will still need to incorporate changes to support our work on these optimizations. Some of the changes may be addressed as individual bug fixes on the existing profiling infrastructure. Other changes may be better implemented as either new extensions or as replacements of existing code.
I think we will try to minimize infrastructure replacement at least in the short/medium term. After all, it doesn't make too much sense to replace infrastructure that is broken for code that doesn't exist yet.
David Li and I are preparing a document where we describe the major issues that we'd like to address. The document is a bit on the lengthy side, so it may be easier to start with an email discussion.
This is a summary of the main changes we are looking at:
- Need to faithfully represent the execution count taken from dynamic profiles. Currently, MD_prof does not really represent an execution count. This makes things like comparing hotness across functions hard or impossible. We need a concept of global hotness.
- When the CFG or callgraph change, there need to exist an API for incrementally updating/scaling counts. For instance, when a function is inlined or partially inlined, when the CFG is modified, etc. These counts need to be updated incrementally (or perhaps re-computed as a first step into that direction).
- The inliner (and other optimizations) needs to use profile information and update it accordingly. This is predicated on Chandler's work on the pass manager, of course.
- Need to represent global profile summary data. For example, for global hotness determination, it is useful to compute additional global summary info, such as a histogram of counts that can be used to determine hotness and working set size estimates for a large percentage of the profiled execution.
There are other changes that we will need to incorporate. David, Teresa, Chandler, please add anything large that I missed.
My main question at the moment is what would be the best way of addressing them. Some seem to require new concepts to be implemented (e.g., execution counts). Others could be addressed as simple bugs to be fixed in the current framework.
Would it make sense to present everything in a unified document and discuss that? I've got some reservations about that approach because we will end up discussing everything at once and it may not lead to concrete progress. Another approach would be to present each issue individually either as patches or RFCs or bugs.
I will be taking on the implementation of several of these issues. Some of them involve the SamplePGO harness that I added last year. I would also like to know what other bugs or problems people have in mind that I could also roll into this work.
Thanks. Diego.
On 02/25/2015 12:11 PM, Teresa Johnson wrote:
On Wed, Feb 25, 2015 at 10:52 AM, Philip ReamesCorrect me if I'm wrong, but we already have "function layout" (i.e. basic block placement). It may need improved, but I've found it to be reasonable effective.
<list...@philipreames.com> wrote:
Other than the inliner, can you list the passes you think are profitable toAlso, code layout (bb layout, function layout, function splitting).
teach about profiling data? My list so far is: PRE (particularly of
loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
(avoiding code size explosion in cold code). I'm much more interested in
sources of improved performance than I am simply code size reduction.
(Reducing code size can improve performance of course.)
What do you mean by "bb layout"?
I'm assuming you're referring to a form of outlining as "function splitting". This seems like a clearly good idea.
Ah, got it. That makes sense for a C/C++ use case.
Need to represent global profile summary data. For example, for globalThe idea is to get a sense of a good global profile count threshold to
hotness determination, it is useful to compute additional global summary
info, such as a histogram of counts that can be used to determine hotness
and working set size estimates for a large percentage of the profiled
execution.
Er, not clear what you're trying to say here?
use given an application's profile, i.e. when determining whether a
profile count is hot in the given profile. For example, what is the
minimum profile count contributing to the hottest 99% of the
application's profile.
In my use case, we're only compiling the hottest methods. As such, I care mostly about relative hotness within a method or a small set of methods.
On Feb 24, 2015, at 3:31 PM, Diego Novillo <dnov...@google.com> wrote:We (Google) have started to look more closely at the profiling infrastructure in LLVM. Internally, we have a large dependency on PGO to get peak performance in generated code.Some of the dependencies we have on profiling are still not present in LLVM (e.g., the inliner) but we will still need to incorporate changes to support our work on these optimizations. Some of the changes may be addressed as individual bug fixes on the existing profiling infrastructure. Other changes may be better implemented as either new extensions or as replacements of existing code.I think we will try to minimize infrastructure replacement at least in the short/medium term. After all, it doesn't make too much sense to replace infrastructure that is broken for code that doesn't exist yet.David Li and I are preparing a document where we describe the major issues that we'd like to address. The document is a bit on the lengthy side, so it may be easier to start with an email discussion. This is a summary of the main changes we are looking at:
- Need to faithfully represent the execution count taken from dynamic profiles. Currently, MD_prof does not really represent an execution count. This makes things like comparing hotness across functions hard or impossible. We need a concept of global hotness.
- When the CFG or callgraph change, there need to exist an API for incrementally updating/scaling counts. For instance, when a function is inlined or partially inlined, when the CFG is modified, etc. These counts need to be updated incrementally (or perhaps re-computed as a first step into that direction).
- The inliner (and other optimizations) needs to use profile information and update it accordingly. This is predicated on Chandler's work on the pass manager, of course.
Need to represent global profile summary data. For example, for global hotness determination, it is useful to compute additional global summary info, such as a histogram of counts that can be used to determine hotness and working set size estimates for a large percentage of the profiled execution.There are other changes that we will need to incorporate. David, Teresa, Chandler, please add anything large that I missed.My main question at the moment is what would be the best way of addressing them. Some seem to require new concepts to be implemented (e.g., execution counts). Others could be addressed as simple bugs to be fixed in the current framework.Would it make sense to present everything in a unified document and discuss that? I've got some reservations about that approach because we will end up discussing everything at once and it may not lead to concrete progress. Another approach would be to present each issue individually either as patches or RFCs or bugs.I will be taking on the implementation of several of these issues. Some of them involve the SamplePGO harness that I added last year. I would also like to know what other bugs or problems people have in mind that I could also roll into this work.Thanks. Diego.
Yes -- there are two aspects of the problems.
1) raw profile data representation in IR and
2) the profile count data represented for CFG.
What you said is for 2) which is one of the possibilities. There is a
third issue that is also going to be covered in more detail -- that is
the Block Frequency propagation algorithm is limited (leading to
information loss). When profile count is available, block frequency
data can be directly computed via simple normalization and scaling.
This requires the raw edge count data to be represented in 1)
truthfully.
>
> When the CFG or callgraph change, there need to exist an API for
> incrementally updating/scaling counts. For instance, when a function is
> inlined or partially inlined, when the CFG is modified, etc. These counts
> need to be updated incrementally (or perhaps re-computed as a first step
> into that direction).
>
> One of the main reasons that we chose to use branch weights to represent
> profile information within functions is that it makes this problem easier.
> Of course, we still need to update the branch weights when transforming the
> CFG, but I believe most of that work has already been done. Are you
> suggesting that we should work on incremental BlockFrequencyInfo updates? We
> have discussed that in the past, but so far, it has worked reasonably well
> to just re-run that analysis. (I wouldn’t be surprised if we’re missing some
> places where the analysis needs to be invalidated so that it gets re-run.)
Diego is going to share the proposal in more detail. I will give a
brief summary to answer your question:
1) making raw profile data (in MD_prof) to truthfully does not change
its original meaning -- the branch count is still branch weight -- but
it happens to be also execution count which contains more information.
2) At CFG level (BranchProbabilityInfo), using real branch probability
does not change the way branch weight is used either -- the branch
probability is still branch weight (but normalized to be 0 <= prob
<=1). Benefits and reasons behind that will be provided.
The infrastructure is ready to update the raw MD_prof data during
intra-procedural transformations when branch instructions are cloned,
but not CFG level branch prob and block-frequency/block count update.
This is especially important for inter-procedural transformations like
inliner (i.e., update the profile data associated with inline
instance in the caller context). Recomputing via freq propagation is
not only not precise (as mentioned above), but also very compile time
consuming.
thanks,
David
22716 | librarie | Global A | dnov...@google.com | NEW | --- | Need a mechanism to represent global profile counts in CFG and MachineCFG | 16:47:01 |
22718 | librarie | Global A | dnov...@google.com | NEW | --- | Information loss and performance issues in branch probability representation | 17:37:39 |
22719 | librarie | Global A | dnov...@google.com | NEW | --- | Improvements in raw branch profile data representation | 17:48:34 |
I've created a few bugzilla issues with details of some of the things I'll be looking into. I'm not yet done wordsmithing the overall design document. I'll try to finish it by early next week at the latest.
Diego is planning to give more detailed reply next Monday. There seem
to be some misunderstanding about the proposals, so I will just give
some highlights here:
1) The proposal is not intending to fundamentally change the current
framework, but to enhanced the framework so that
a) more profile information is preserved
b) block/edge count/frequency becomes faster to compute
b) profile information becomes faster to access and update
(inter-procedurally)
2) Changes to profile APIs and profile client code will be minimized,
except that we will add IPA clients (once Chandler's pass manager
change is ready)
3) The proposed change does *not* give up relative hotness as
mentioned by Philiip. All clients that relies on relative hotness are
not affected -- except that the data is better and more reliable
4) With real profile data available, current infrastructure does *not*
provide reasonable hotness (e.g., you can try comparing the BBs that
execute the same number times, but in loops with different depths in
the same function and see how big the diff is), let alone fast
updating.
I am reasonably confident that the proposal
1) does not affect compilations using static profile (with branch prediction)
2) strictly better for -fprofile-instr-use optimizations.
The area I am not so sure is the JIT, but I am really interested in
knowing the details and propose solutions for you if the current
proposal does not work for you (which I doubt -- because if the
current framework works, the new one should work too :) ).
I am looking forward to more detailed discussions next week! We shall
sit down together and discuss changes, rationale, concerns one by one
-- with concrete examples.
thanks,
David
_______________________________________________
On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnov...@google.com> wrote:On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnov...@google.com> wrote:I've created a few bugzilla issues with details of some of the things I'll be looking into. I'm not yet done wordsmithing the overall design document. I'll try to finish it by early next week at the latest.The document is available atThere are several topics covered. Ideally, I would prefer that we discuss each topic separately. The main ones I will start working on are the ones described in the bugzilla links we have in the doc.This is just a starting point for us. I am not at all concerned with implementing exactly what is proposed in the document. In fact, if we can get the same value using the existing support, all the better.OTOH, any other ideas that folks may have that work better than this are more than welcome. I don't have really strong opinions on the matter. I am fine with whatever works.Thanks for the detailed write-up on this. Some of the issues definitely need to be addressed. I am concerned, though, that some of the ideas may be leading toward a scenario where we have essentially two completely different ways of representing profile information in LLVM IR. It is great to have two complementary approaches to collecting profile data, but two representations in the IR would not make sense.
The first issue raised is that profile execution counts are not represented in the IR. This was a very intentional decision. I know it goes against what other compilers have done in the past. It took me a while to get used to the idea when Andy first suggested it, so I know it seems awkward at first. The advantage is that branch probabilities are much easier to keep updated in the face of compiler transformations, compared to execution counts.
We are definitely missing the per-function execution counts that are needed to be able to compare relative “hotness” across functions, and I think that would be a good place to start making improvements. In the long term, we should keep our options open to making major changes, but before we go there, we should try to make incremental improvements to fix the existing infrastructure.
Many of the other issues you raise seem like they could also be addressed without major changes to the existing infrastructure. Let’s try to fix those first.
(Sorry for the delay responding; I've been on holiday.)
There are two things going on here.
Firstly, the loop scales are being capped at 4096. I propagated this
approximation from the previous version of BFI. If it's causing a
problem (which it looks like it is), we should drop it and fix what
breaks. You can play around with this by commenting out the `if`
statement at the end of `computeLoopScale()` in
BlockFrequencyInfoImpl.cpp.
For example, without that logic this testcase gives:
Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 51.5, int = 411
- for.body: float = 50.5, int = 403
- for.cond1: float = 5051.0, int = 40407
- for.body3: float = 5000.5, int = 40003
- for.cond4: float = 505001.0, int = 4040007
- for.body6: float = 500000.5, int = 4000003
- for.inc: float = 500000.5, int = 4000003
- for.end: float = 5000.5, int = 40003
- for.inc7: float = 5000.5, int = 40003
- for.end9: float = 50.5, int = 403
- for.inc10: float = 50.5, int = 403
- for.end12: float = 1.0, int = 8
- for.cond13: float = 51.5, int = 411
- for.body15: float = 50.5, int = 403
- for.cond16: float = 500051.0, int = 4000407
- for.body18: float = 500000.5, int = 4000003
- for.inc19: float = 500000.5, int = 4000003
- for.end21: float = 50.5, int = 403
- for.inc22: float = 50.5, int = 403
- for.end24: float = 1.0, int = 8
- for.cond26: float = 500001.5, int = 4000011
- for.body28: float = 500000.5, int = 4000003
- for.inc29: float = 500000.5, int = 4000003
- for.end31: float = 1.0, int = 8
(Now we get 500000.5 for all the inner loop bodies.)
Secondly, instrumentation-based profiling intentionally fuzzes the
profile data in the frontend using Laplace's Rule of Succession (look at
`scaleBranchWeight()` in CodeGenPGO.cpp).
For example, "loop 1" (which isn't affected by the 4096 cap) should give
a loop scale of 500000.5, not 1000000. (The profile data says
1000000/10000 for the inner loop, 10000/100 for the middle, and 100/1
for the outer loop. Laplace says that we should fuzz these branch
weights to 1000001/10001, 10001/101, and 101/2, which works out to
1000001/2 == 500000.5 total.)
On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.w...@apple.com> wrote:
On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnov...@google.com> wrote:
On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnov...@google.com> wrote:
I've created a few bugzilla issues with details of some of the things I'll be looking into. I'm not yet done wordsmithing the overall design document. I'll try to finish it by early next week at the latest.
The document is available at
There are several topics covered. Ideally, I would prefer that we discuss each topic separately. The main ones I will start working on are the ones described in the bugzilla links we have in the doc.
This is just a starting point for us. I am not at all concerned with implementing exactly what is proposed in the document. In fact, if we can get the same value using the existing support, all the better.
OTOH, any other ideas that folks may have that work better than this are more than welcome. I don't have really strong opinions on the matter. I am fine with whatever works.
Thanks for the detailed write-up on this. Some of the issues definitely need to be addressed. I am concerned, though, that some of the ideas may be leading toward a scenario where we have essentially two completely different ways of representing profile information in LLVM IR. It is great to have two complementary approaches to collecting profile data, but two representations in the IR would not make sense.
Yeah, I don't think I'll continue to push for a new MD_count attribute. If we were to make MD_prof be a "real" execution count, that would be enough. Note that by re-using MD_prof we are not changing its meaning at all. The execution count is still a weight and the ratio is still branch probability. All that we are changing are the absolute values of the number and increasing its data type width to remove the 32bit limitation.
The first issue raised is that profile execution counts are not represented in the IR. This was a very intentional decision. I know it goes against what other compilers have done in the past. It took me a while to get used to the idea when Andy first suggested it, so I know it seems awkward at first. The advantage is that branch probabilities are much easier to keep updated in the face of compiler transformations, compared to execution counts.
Sorry. I don't follow. Updating counts as the CFG is transformed is not difficult at all. What examples do you have in mind? The big advantage of making MD_prof an actual execution count is that it is a meaningful metric wrt scaling and transformation.
Say, for instance, that we have a branch instruction with two targets with counts {100, 300} inside a function 'foo' that has entry count 2. The edge probability for the first edge (count 100) is 100/(100+300) = 25%.
If we inline foo() inside another function bar() at a callsite with profile count == 1, the cloned branch instruction gets its counters scaled with the callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} = {50, 150}. But the branch probability did not change. Currently, we are cloning the branch without changing the edge weights.
This scaling is not difficult at all and can be incrementally very quickly. We cannot afford to recompute all frequencies on the fly because it would be detrimental to compile time. If foo() itself has N callees inlined into it, each inlined callee needs to trigger a re-computation. When foo() is inlined into bar(), the frequencies will need to be recomputed for foo() and all N callees inlined into foo().
We are definitely missing the per-function execution counts that are needed to be able to compare relative “hotness” across functions, and I think that would be a good place to start making improvements. In the long term, we should keep our options open to making major changes, but before we go there, we should try to make incremental improvements to fix the existing infrastructure.
Right, and that's the core of our proposal. We don't really want to make major infrastructure changes at this point. The only thing I'd like to explore is making MD_prof a real count. This will be useful for the inliner changes and it would also make incremental updates easier, because the scaling that needs to be done is very straightforward and quick.
Note that this change should not modify the current behaviour we get from profile analysis. Things that were hot before should continue to be hot now.
Many of the other issues you raise seem like they could also be addressed without major changes to the existing infrastructure. Let’s try to fix those first.
That's exactly the point of the proposal. We definitely don't want to make major changes to the infrastructure at first. My thinking is to start working on making MD_prof a real count. One of the things that are happening is that the combination of real profile data plus the frequency propagation that we are currently doing is misleading the analysis.
Thanks. Diego.
1) keep MD_prof definition as it is today; also keep using the
frequency propagation as it is (assuming programs with irreducible
loops are not common and not important. If it turns out to be
otherwise, we will revisit this).
2) fix all problems that lead to wrong 'frequency/count' computed from
the frequency propagation algorithm
2.1) relax 32bit limit
2.2) remove 4096 iteration count cap
2.3) remove the 'laplace rule of succession' which can be very harmful
2.4) Fix the bug in 'BranchProbabilityInfo::calcMetadataWeights'
that does capping without scaling
2.5) Fix a similar bug in 'MachineBranchProbabilityInfo::getSumForBlock('
.. other bugs found
3) introduce a new meta data to represent function entry global
hotness (the implementation can choose to use entry execution count)
4) implement frequency/profile update interfaces for Inline
transformations -- i.e., allowing importing callee's profile info into
caller with scaling factor computed from callsite count and callee
entry count
5) Implement mechanism to compute, record and use profile program
summary data (This will be discussed separately).
thanks,
David
No, we won't be using profile counts in the absolute sense, a.k.a. if
(count == 100000). The counts is used as global hotness relative to
others. It is also used together with program summary data.
>
> I have no objection to adding a mechanism for expressing an entry count. I
> am still very hesitant about the proposals with regards to redefining the
> current MD_prof.
For now, we won't touch MD_prof's definition.
>
> I'd encourage you to post a patch for the entry count mechanism, but not tie
> its semantics to exact execution count. (Something like "the value provided
> must correctly describe the relative hotness of this routine against others
> in the program annoatated with the same metadata. It is the relative
> scaling that is important, not the absolute value. In particular, the value
> need not be an exact execution count.")
>
Define entry count in a more flexible way is fine -- as long as the
implementation can choose to use 'execution count' :)
thanks,
David
I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.
Duncan and I chatted briefly about this on IRC. I'm going to
experiment with only applying laplace smoothing if any scaled weights
are 0. AFAIU, usage of this rule is really just trying to avoid having
downstream users deal with 0 weights. Duncan, please whack me with a
clue stick if I totally misrepresented our conclusions.
Zero is the extreme case, but it’s also important for other small counts. I’d like to see some specific examples of why you think the current behavior is harmful.
> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>
> Diego and I have discussed this according to the feedback received. We
> have revised plan for this (see Diego's last reply). Here is a more
> detailed re-cap:
>
> 1) keep MD_prof definition as it is today; also keep using the
> frequency propagation as it is (assuming programs with irreducible
> loops are not common and not important. If it turns out to be
> otherwise, we will revisit this).
> 2) fix all problems that lead to wrong 'frequency/count' computed from
> the frequency propagation algorithm
> 2.1) relax 32bit limit
> 2.2) remove 4096 iteration count cap
> 2.3) remove the 'laplace rule of succession' which can be very harmful
I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.
On Tue, Mar 24, 2015 at 11:29 AM, Chandler Carruth <chan...@google.com> wrote:Sorry I haven't responded earlier, but one point here still doesn't make sense to me:I still don't understand why this is important or useful.... Maybe I'm just missing something.Given the current meaning of MD_prof, it seems like the result of limiting this to 32-bits is that the maximum relative ratio of probabilities between two successors of a basic block with N successors is (2 billion / N):1 -- what is the circumstance that makes this resolution insufficient?It also doesn't seem *bad* per-se, I just don't see what it improves, and it does cost memory...right -- there is some ambiguity here -- it is needed if we were to change MD_prof's definition to represent branch count. However, with the new plan, the removal of the limit only applies to the function entry count representation planned.
Capping also leads to other kinds of problems -- e.g., sum of incoming edge count (callgraph) does not match the callee entry count etc.
The design is basically to augment the existing frequency data with one 64bit data which is the global hotness of the function entry (e.g. it be the entry execution count). With the execution count available, the BB count (or global hotness if you will) is simply:count(BB) = freq (BB) * count(ENTRY)/freq(ENTRY)You can view count(ENTRY) as an extension to the current 'hot'/'cold' attribute
Note that for IPA, callsite count is obtained from enclosing BB's count.
Example. Assuming the cap is 'C'void bar(){// ENTRY count is 4*C, after capping it becomes 'C'...}void test(){// BB1: count(BB1) = Cbar();// BB2: count(BB2) = Cbar();}void test2(){// BB3: count(BB3) = Cbar();// BB4: count(BB4) = Cbar();}What would inliner see here ? When it sees callsite1 -- it might mistaken that is the only dominating callsite to 'bar'.
On Tue, Mar 24, 2015 at 1:03 PM, Xinliang David Li <xinli...@gmail.com> wrote:The design is basically to augment the existing frequency data with one 64bit data which is the global hotness of the function entry (e.g. it be the entry execution count). With the execution count available, the BB count (or global hotness if you will) is simply:count(BB) = freq (BB) * count(ENTRY)/freq(ENTRY)You can view count(ENTRY) as an extension to the current 'hot'/'cold' attributeYes, this is what I'm saying the current design was aiming for. Is this what you're now suggesting, or are you suggesting something different?
Note that for IPA, callsite count is obtained from enclosing BB's count.
Yes, there should clearly be a relationship here.
One way I would like to see this tested is to look at the relative error at each level -- if we compute the global "count" as you describe for a callsite's BB, and aggregate it with all other callsite BBs, we should get something close to function's global value. If we get wildly different results, there is probably some flaw in the algorithm used.
On Tue, Mar 24, 2015 at 1:08 PM, Xinliang David Li <xinli...@gmail.com> wrote:Example. Assuming the cap is 'C'void bar(){// ENTRY count is 4*C, after capping it becomes 'C'...}void test(){// BB1: count(BB1) = Cbar();// BB2: count(BB2) = Cbar();}void test2(){// BB3: count(BB3) = Cbar();// BB4: count(BB4) = Cbar();}What would inliner see here ? When it sees callsite1 -- it might mistaken that is the only dominating callsite to 'bar'.
I don't understand, why would it assume this?
I was suggesting that each function would need to be associated with some global weight, and the global weight ration between test and bar should provide the needed information here?
Thanks for working on this!
I remain skeptical. If the loop only ran once in your profile run, how much confidence do you have in the trip count? The LaPlace rule will tend to make the compiler more conservative in exactly the cases where there is insufficient information to be confident about what to do.
> 2) result in bad inlining decisions. For instance:
> for (...)
> bar(); // (1)
>
> where (1) is the only callsite to bar(). Using the rule, BB count
> enclosing the call to bar() can be as low as half of the entry count
> of bar(). Inliner will get confused and think there are more hot
> callsites to 'bar' and make suboptimal decisions ..
>
> Also if bar has calls to other functions, those callsites will look
> hotter than the call to 'bar' …
Your own proposal for recording entry counts is to record “relative hotness”, not absolute profile counts. On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule that is then used to compute BlockFrequency and you’re concerned about a mismatch between that the “relative hotness” recorded for the callee??
>
> The attached are two cases as well as the frequency graph computed
> today (with the laplace rule) and the correct frequency expected.
I’d be a lot more interested to see a real-world example.
Training run is expensive, it is common for PGO users to reduce
workload as much as possible without sacrificing the
representativeness of the workload.
Workload 1: loop run once, iterating 100 times in total
workload 2: loop run twice, iterating 200 times in total
Which workload will user prefer to use?
Without the rule, the two workload at least produces consistent
profile data. With the Laplace rule, you get 50 in one case, and 66
in the other.
Having some technology to improve confidence of the profile data is
fine, but I don't see
1) how laplace rule is good for it
2) why this can not be done in the consumer side (i.e., faithfully
record the profile data).
>
>> 2) result in bad inlining decisions. For instance:
>> for (...)
>> bar(); // (1)
>>
>> where (1) is the only callsite to bar(). Using the rule, BB count
>> enclosing the call to bar() can be as low as half of the entry count
>> of bar(). Inliner will get confused and think there are more hot
>> callsites to 'bar' and make suboptimal decisions ..
>>
>> Also if bar has calls to other functions, those callsites will look
>> hotter than the call to 'bar' …
>
> Your own proposal for recording entry counts is to record “relative hotness”, not absolute profile counts.
The proposal is to record 'global hotness' that can used to compare
relative hotness across procedural boundaries (e.g. callsites in
different callers). Profile counts satisfies this condition.
>On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule that is then used to compute BlockFrequency and you’re concerned about a mismatch between that the “relative hotness” recorded for the callee??
Basically, say the caller is test()
bar(){
// ENTRY count = 100 (from profile data)
// ENTRY freq = 1
// BB2: Freq(BB2) = 1, count = 100
foo (); (2)
}
test() {
// ENTRY count = 1 (from profile data)
// Entry Freq = 1
for (i = 0; i < 100; i++) {
// BB1: Freq(BB1) = 50 due to Laplace rule
bar(); // Freq = 50, count = 50 (1)
}
}
With laplace rule, the block freq computed for bar's enclosing BB will
be wrong -- as a result, the bar's enclosing BB's count will be wrong
too: 50*1/1 = 50.
The global hotness of call site (1) & (2) should be the same, but
distorted when Laplace rule is used.
Yes, we care about using PGO across routine boundaries for IPO.
>
>>
>> The attached are two cases as well as the frequency graph computed
>> today (with the laplace rule) and the correct frequency expected.
>
> I’d be a lot more interested to see a real-world example.
See my reply above. On the other hand, I'd like to see examples where
LaPlace Rule can actually help improve the profile data quality.
thanks,
David
Without the rule, the two workload at least produces consistent
profile data. With the Laplace rule, you get 50 in one case, and 66
in the other.
Having some technology to improve confidence of the profile data is
fine, but I don't see
1) how laplace rule is good for it
2) why this can not be done in the consumer side (i.e., faithfully
record the profile data).
The attached are two cases as well as the frequency graph computed
today (with the laplace rule) and the correct frequency expected.
I’d be a lot more interested to see a real-world example.
See my reply above. On the other hand, I'd like to see examples where
LaPlace Rule can actually help improve the profile data quality.
> Which workload is better? I don’t at all trust users to get this right, at
> least for real, non-benchmark code.
We do have a lot of users (real world apps) who can get this right --
I am not joking ;)
>
>
> Without the rule, the two workload at least produces consistent
> profile data. With the Laplace rule, you get 50 in one case, and 66
> in the other.
>
>
> Yes, but you’ve got more information in one case than the other. This is a
> feature IMO, not a bug. It’s entirely possible that with workload 2, the
> loop may have executed for a drastically different number of iterations. The
> fact that it did not, i.e., that it was consistent with workload 1, is more
> information that you did not have before. It makes sense for the compiler to
> be more aggressive when it has more data.
But the decision by the compiler is arbitrary and not necessarily
correct. For instance, the single run used in the training may have
actually executed much fewer number of iterations than average. With
Laplace rule, the iteration count becomes even smaller. My point is
that there is no way for compiler to tell how good the data is nor is
the compiler in a good position to make that judgement. By so doing,
the users who carefully prune their workload to reduce runtime gets
punished for no reason.
>
>
> Having some technology to improve confidence of the profile data is
> fine, but I don't see
> 1) how laplace rule is good for it
>>
> What do you not understand about it? As the counts get larger, LaPlace’s
> rule fades into the noise. It only makes a difference for cases where some
> of the counts are *very* small, and in those cases, it very simply adjust
> the weights to make optimizations less aggressive.
Strictly speaking, in loop context, it just makes optimizations to
assume shorter trip counts.
>
> 2) why this can not be done in the consumer side (i.e., faithfully
> record the profile data).
>
>
> What does this have to do with how faithfully the profile is recorded? We’ve
> got fully accurate data, but if the profiling inputs are too small or not
> representative, you may still get poor optimization choices.
The point is that there is no need to adjust the weights. It is very
easy to check the loop header's profile count to determine how much
confidence you want to give (and possibly controlled with flag). The
control in this way is more fine grained than blindly changing the
weight right after reading the profile data.
I am not sure what you mean by 'hypothetical comparison of block
frequencies vs entry counts', but it does not seem to be what I mean.
What I mean is that
1) We need a metric to represent global hotness. Profile (execution)
count fits the bill
2) There are two ways to compute profile count for BBs
a) directly compute it from the edge count recorded in profile data
(and BB Frequency can be directly scaled from it), but this change
requires slightly changing MD_prof's meaning or introducing MD_count
to record edge count without capping/scaling.
b) Just recording the entry profile count (minimal change), but do
not change MD_prof. This approach will reuse block frequency
propagation, but the later relies on unaltered branch
probability/weight in order to recompute precisely the count (combined
with entry count).
Since people have concerns on a), we chose b). For b), I merely
pointed out in the above example that with Laplace rule, the
recomputed profile count at the only callsite of 'Bar' can be greatly
different from the recorded entry profile count Bar. Incoming
callsite's profile distribution can be good signal for inlining
decisions. Such difference will be bad.
>
> I don’t see how you can argue that LaPlace’s rule is bad because it could
> affect an apples vs. oranges comparison of something that does not even
> exist yet.
>
Of course, PGO for IPA support is exactly the missing (and very
important) piece we plan to add -- if it already existed, there will
be no problems.
thanks,
David
>
>
>
> The attached are two cases as well as the frequency graph computed
> today (with the laplace rule) and the correct frequency expected.
>
>
> I’d be a lot more interested to see a real-world example.
>
>
> See my reply above. On the other hand, I'd like to see examples where
> LaPlace Rule can actually help improve the profile data quality.
>
>
> It’s not about improving the results — it’s about preventing clang from
> being overly aggressive about optimizing based on limited profile data.
_______________________________________________
Also, do you generate GCOV files, some yaml, or is this a separate format?
And finally in the meeting you had given how you assign counters to
the blocks, an algorithm to minimize the number of insertions. Is that
algorithm a well-known one or a custom one? Is that described
somewhere?
Yes, that is the approach we’re using.
>
> Also, do you generate GCOV files, some yaml, or is this a separate format?
It is a separate format. The code for reading/writing profile data is in compiler-rt/lib/profile/.
There is also a separate format for mapping the profile data back to source locations for code coverage testing. Details here: http://www.llvm.org/docs/CoverageMappingFormat.html
>
> And finally in the meeting you had given how you assign counters to
> the blocks, an algorithm to minimize the number of insertions. Is that
> algorithm a well-known one or a custom one? Is that described
> somewhere?
It is a custom approach. I don’t think we have a written description but the code is pretty straightforward. Look at ComputeRegionCounts in clang’s lib/CodeGen/CodeGenPGO.cpp source file.
We are still collecting more data, but since the question is asked, I
will share our plan now.
Instrumentation in Clang AST is too slow for large C++ applications
(the data and explanation will be sent out in a different thread). To
solve that problem, we have introduced the 'late' instrumentation pass
that is done in LLVM. The late instrumentation uses the exact same
runtime support as the current frontent based instrumentation, but
assign edge counters based on minimal spanning trees. We have seen
very promising results. Note This is not intended to replace the
current front-end based instrumentation, but as an alternative under
an option.
thanks,
david
Bob, was GCOV skipped because of any specific reason?
David, when you say you have a late instrumentation pass, is that done
at the IR level that is already optimized and therefore changed? How
are mapping back to the source code?
yes at IR level but only after some early optimizations.
>How
> are mapping back to the source code?
It is using CFG based matching. It is not used for code coverage
purpose, so source mapping is less important.
David
Yes, basically. There are two ways to do PGO with clang - using the
sample based profiling or the instrumentation based profiling. The on
disk formats for these two types of profiling have nothing to do with
each other, and the instrumentation based profiling is also useful for
coverage.
> In which case I would recommend to add a note to the "Profiling with
> Instrumentation" section as well, to state that the format produced by
> "llvm-profdata merge" is not the same as the one detailed just above that
> section.
It sounds to me like this documentation just needs some clean up to be
clearer up front about what's going on. If nobody else gets to it first,
I'll take a shot at improving the situation when I have a chance.