[LLVMdev] RFC - Improvements to PGO profile support

96 views
Skip to first unread message

Diego Novillo

unread,
Feb 24, 2015, 6:33:48 PM2/24/15
to LLVM Developers Mailing List, Xinliang David Li, Chandler Carruth, Teresa Johnson

We (Google) have started to look more closely at the profiling infrastructure in LLVM. Internally, we have a large dependency on PGO to get peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM (e.g., the inliner) but we will still need to incorporate changes to support our work on these optimizations. Some of the changes may be addressed as individual bug fixes on the existing profiling infrastructure. Other changes  may be better implemented as either new extensions or as replacements of existing code.

I think we will try to minimize infrastructure replacement at least in the short/medium term. After all, it doesn't make too much sense to replace infrastructure that is broken for code that doesn't exist yet.

David Li and I are preparing a document where we describe the major issues that we'd like to address. The document is a bit on the lengthy side, so it may be easier to start with an email discussion. This is a summary of the main changes we are looking at:
  1. Need to faithfully represent the execution count taken from dynamic profiles. Currently, MD_prof does not really represent an execution count. This makes things like comparing hotness across functions hard or impossible. We need a concept of global hotness.
  2. When the CFG or callgraph change, there need to exist an API for incrementally updating/scaling counts. For instance, when a function is inlined or partially inlined, when the CFG is modified, etc. These counts need to be updated incrementally (or perhaps re-computed as a first step into that direction).
  3. The inliner (and other optimizations) needs to use profile information and update it accordingly. This is predicated on Chandler's work on the pass manager, of course.
    Need to represent global profile summary data. For example, for global hotness determination, it is useful to compute additional global summary info, such as a histogram of counts that can be used to determine hotness and working set size estimates for a large percentage of the profiled execution.
There are other changes that we will need to incorporate. David, Teresa, Chandler, please add anything large that I missed.

My main question at the moment is what would be the best way of addressing them. Some seem to require new concepts to be implemented (e.g., execution counts). Others could be addressed as simple bugs to be fixed in the current framework.

Would it make sense to present everything in a unified document and discuss that? I've got some reservations about that approach because we will end up discussing everything at once and it may not lead to concrete progress. Another approach would be to present each issue individually either as patches or RFCs or bugs.

I will be taking on the implementation of several of these issues. Some of them involve the SamplePGO harness that I added last year. I would also like to know what other bugs or problems people have in mind that I could also roll into this work.


Thanks. Diego.

Xinliang David Li

unread,
Feb 24, 2015, 6:51:06 PM2/24/15
to Diego Novillo, LLVM Developers Mailing List
I don't mind whatever ways we take to upstream -- all these are
generally good problems to solve. I expect discussions more on
approaches to tackle the problem, not the problems themselves.

David
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Justin Bogner

unread,
Feb 24, 2015, 8:09:07 PM2/24/15
to Diego Novillo, Xinliang David Li, LLVM Developers Mailing List
Diego Novillo <dnov...@google.com> writes:
> David Li and I are preparing a document where we describe the major issues
> that we'd like to address. The document is a bit on the lengthy side, so it
> may be easier to start with an email discussion. This is a summary of the main
> changes we are looking at:
>
> 1. Need to faithfully represent the execution count taken from dynamic
> profiles. Currently, MD_prof does not really represent an execution count.
> This makes things like comparing hotness across functions hard or
> impossible. We need a concept of global hotness.
> 2. When the CFG or callgraph change, there need to exist an API for
> incrementally updating/scaling counts. For instance, when a function is
> inlined or partially inlined, when the CFG is modified, etc. These counts
> need to be updated incrementally (or perhaps re-computed as a first step
> into that direction).
> 3. The inliner (and other optimizations) needs to use profile information and
> update it accordingly. This is predicated on Chandler's work on the pass
> manager, of course.
> Need to represent global profile summary data. For example, for global
> hotness determination, it is useful to compute additional global summary
> info, such as a histogram of counts that can be used to determine hotness
> and working set size estimates for a large percentage of the profiled
> execution.

Great! Looking forward to hearing more.

> My main question at the moment is what would be the best way of addressing
> them. Some seem to require new concepts to be implemented (e.g., execution
> counts). Others could be addressed as simple bugs to be fixed in the current
> framework.
>
> Would it make sense to present everything in a unified document and discuss
> that? I've got some reservations about that approach because we will end up
> discussing everything at once and it may not lead to concrete progress.
> Another approach would be to present each issue individually either as patches
> or RFCs or bugs.

While a unified document is likely to lead to a an unfocused discussion,
talking about the problems piecemeal will make it harder to think about
the big picture. I suspect that an overview of the issues and some brief
notes on how you're thinking of approaching each will be useful. We can
break out separate discussions from there on any points that are
contentious or otherwise need to be discussed in further detail.

Philip Reames

unread,
Feb 25, 2015, 1:54:36 PM2/25/15
to Diego Novillo, LLVM Developers Mailing List, Xinliang David Li, Chandler Carruth, Teresa Johnson
On 02/24/2015 03:31 PM, Diego Novillo wrote:

We (Google) have started to look more closely at the profiling infrastructure in LLVM. Internally, we have a large dependency on PGO to get peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM (e.g., the inliner) but we will still need to incorporate changes to support our work on these optimizations. Some of the changes may be addressed as individual bug fixes on the existing profiling infrastructure. Other changes  may be better implemented as either new extensions or as replacements of existing code.

I think we will try to minimize infrastructure replacement at least in the short/medium term. After all, it doesn't make too much sense to replace infrastructure that is broken for code that doesn't exist yet.

David Li and I are preparing a document where we describe the major issues that we'd like to address. The document is a bit on the lengthy side, so it may be easier to start with an email discussion.
I would personally be interested in seeing a copy of that document, but it might be more appropriate for a blog post then a discussion on llvm-dev.  I worry that we'd end up with a very unfocused discussion.  It might be better to frame this as your plan of attack and reserve discussion on llvm-dev for things that are being proposed semi near term.  Just my 2 cents.


This is a summary of the main changes we are looking at:
  1. Need to faithfully represent the execution count taken from dynamic profiles. Currently, MD_prof does not really represent an execution count. This makes things like comparing hotness across functions hard or impossible. We need a concept of global hotness.
What does MD_prof actually represent when used from Clang?  I know I've been using it for execution counters in my frontend.  Am I approaching that wrong?

As a side comment: I'm a bit leery of the notion of a consistent notion of hotness based on counters across functions.  These counters are almost always approximate in practice and counting problems run rampant.  I'd almost rather see a consistent count inferred from data that's assumed to be questionable than make the frontend try to generate consistent profiling metadata.  I think either approach could be made to work, we just need to think about it carefully. 
  1. When the CFG or callgraph change, there need to exist an API for incrementally updating/scaling counts. For instance, when a function is inlined or partially inlined, when the CFG is modified, etc. These counts need to be updated incrementally (or perhaps re-computed as a first step into that direction).
Agreed.  Do you have a sense how much of an issue this in practice?  I haven't see it kick in much, but it's also not something I've been looking for. 
  1. The inliner (and other optimizations) needs to use profile information and update it accordingly. This is predicated on Chandler's work on the pass manager, of course.
Its worth noting that the inliner work can be done independently of the pass manager work.  We can always explicitly recompute relevant analysis in the inliner if needed.  This will cost compile time, so we might need to make this an off by default option.  (Maybe -O3 only?)  Being able to work on the inliner independently of the pass management structure is valuable enough that we should probably consider doing this.

PGO inlining is an area I'm very interested in.  I'd really encourage you to work incrementally in tree.  I'm likely to start putting non-trivial amounts of time into this topic in the next few weeks.  I just need to clear a few things off my plate first. 

Other than the inliner, can you list the passes you think are profitable to teach about profiling data?  My list so far is: PRE (particularly of loads!), the vectorizer (i.e. duplicate work down both a hot and cold path when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll (avoiding code size explosion in cold code).  I'm much more interested in sources of improved performance than I am simply code size reduction.  (Reducing code size can improve performance of course.)

  1. Need to represent global profile summary data. For example, for global hotness determination, it is useful to compute additional global summary info, such as a histogram of counts that can be used to determine hotness and working set size estimates for a large percentage of the profiled execution.
Er, not clear what you're trying to say here?

There are other changes that we will need to incorporate. David, Teresa, Chandler, please add anything large that I missed.

My main question at the moment is what would be the best way of addressing them. Some seem to require new concepts to be implemented (e.g., execution counts). Others could be addressed as simple bugs to be fixed in the current framework.

Would it make sense to present everything in a unified document and discuss that? I've got some reservations about that approach because we will end up discussing everything at once and it may not lead to concrete progress. Another approach would be to present each issue individually either as patches or RFCs or bugs.
See above. 

I will be taking on the implementation of several of these issues. Some of them involve the SamplePGO harness that I added last year. I would also like to know what other bugs or problems people have in mind that I could also roll into this work.


Thanks. Diego.


Teresa Johnson

unread,
Feb 25, 2015, 3:16:07 PM2/25/15
to Philip Reames, Xinliang David Li, LLVM Developers Mailing List
Also, code layout (bb layout, function layout, function splitting).

>
> Need to represent global profile summary data. For example, for global
> hotness determination, it is useful to compute additional global summary
> info, such as a histogram of counts that can be used to determine hotness
> and working set size estimates for a large percentage of the profiled
> execution.
>
> Er, not clear what you're trying to say here?

The idea is to get a sense of a good global profile count threshold to
use given an application's profile, i.e. when determining whether a
profile count is hot in the given profile. For example, what is the
minimum profile count contributing to the hottest 99% of the
application's profile.

Teresa

>
> There are other changes that we will need to incorporate. David, Teresa,
> Chandler, please add anything large that I missed.
>
> My main question at the moment is what would be the best way of addressing
> them. Some seem to require new concepts to be implemented (e.g., execution
> counts). Others could be addressed as simple bugs to be fixed in the current
> framework.
>
> Would it make sense to present everything in a unified document and discuss
> that? I've got some reservations about that approach because we will end up
> discussing everything at once and it may not lead to concrete progress.
> Another approach would be to present each issue individually either as
> patches or RFCs or bugs.
>
> See above.
>
>
> I will be taking on the implementation of several of these issues. Some of
> them involve the SamplePGO harness that I added last year. I would also like
> to know what other bugs or problems people have in mind that I could also
> roll into this work.
>
>
> Thanks. Diego.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>



--
Teresa Johnson | Software Engineer | tejo...@google.com | 408-460-2413

Xinliang David Li

unread,
Feb 25, 2015, 3:49:57 PM2/25/15
to Philip Reames, LLVM Developers Mailing List
On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
<list...@philipreames.com> wrote:
Having representative training runs is pre-requisite for using FDO/PGO.

> I'd
> almost rather see a consistent count inferred from data that's assumed to be
> questionable than
>make the frontend try to generate consistent profiling
> metadata.

Frontend does not generate profile data -- it is just a messenger that
should pass the data faithfully to the middle end. That messenger
(profile reader) can be in middle end too.
PGO is very effective in code size reduction. In reality, large
percentage of functions are globally cold.

David

Philip Reames

unread,
Feb 25, 2015, 5:06:12 PM2/25/15
to Teresa Johnson, Xinliang David Li, LLVM Developers Mailing List

On 02/25/2015 12:11 PM, Teresa Johnson wrote:
> On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
> <list...@philipreames.com> wrote:
>> Other than the inliner, can you list the passes you think are profitable to
>> teach about profiling data? My list so far is: PRE (particularly of
>> loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
>> when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
>> (avoiding code size explosion in cold code). I'm much more interested in
>> sources of improved performance than I am simply code size reduction.
>> (Reducing code size can improve performance of course.)
> Also, code layout (bb layout, function layout, function splitting).
Correct me if I'm wrong, but we already have "function layout" (i.e.
basic block placement). It may need improved, but I've found it to be
reasonable effective.

What do you mean by "bb layout"?

I'm assuming you're referring to a form of outlining as "function
splitting". This seems like a clearly good idea.
>
>> Need to represent global profile summary data. For example, for global
>> hotness determination, it is useful to compute additional global summary
>> info, such as a histogram of counts that can be used to determine hotness
>> and working set size estimates for a large percentage of the profiled
>> execution.
>>
>> Er, not clear what you're trying to say here?
> The idea is to get a sense of a good global profile count threshold to
> use given an application's profile, i.e. when determining whether a
> profile count is hot in the given profile. For example, what is the
> minimum profile count contributing to the hottest 99% of the
> application's profile.
Ah, got it. That makes sense for a C/C++ use case.

In my use case, we're only compiling the hottest methods. As such, I
care mostly about relative hotness within a method or a small set of
methods.

Philip Reames

unread,
Feb 25, 2015, 5:19:41 PM2/25/15
to Xinliang David Li, LLVM Developers Mailing List

On 02/25/2015 12:40 PM, Xinliang David Li wrote:
> On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
> <list...@philipreames.com> wrote:
>> On 02/24/2015 03:31 PM, Diego Novillo wrote:
>>
>> Need to faithfully represent the execution count taken from dynamic
>> profiles. Currently, MD_prof does not really represent an execution count.
>> This makes things like comparing hotness across functions hard or
>> impossible. We need a concept of global hotness.
>>
>> What does MD_prof actually represent when used from Clang? I know I've been
>> using it for execution counters in my frontend. Am I approaching that
>> wrong?
>>
>> As a side comment: I'm a bit leery of the notion of a consistent notion of
>> hotness based on counters across functions. These counters are almost
>> always approximate in practice and counting problems run rampant.
> Having representative training runs is pre-requisite for using FDO/PGO.
Representativeness is not the issue I'm raising. Profiling systems
(particularly instrumentation based ones) have systemic biases. Not
accounting for that can lead to some very odd results. As an example:
void foo() {
if (member)
for(int i = 0; i < 100000; i++)
if (member2)
bar();
}

With multiple threads in play, it's entirely possible that the sum of
the absolute weights on the second branch are lower than the sum of the
absolute counts on the first branch. (i.e. due to racy updating) While
you can avoid this by using race free updates, I know of very few
systems that actually do.

If your optimization is radically unstable in such scenarios, that's a
serious problem. Pessimization is bad enough (if tolerable), incorrect
transforms are not. It's very easy to write a transform that implicitly
assumes the counts for the first branch must be less than the counts for
the second.

>
>> I'd
>> almost rather see a consistent count inferred from data that's assumed to be
>> questionable than
>> make the frontend try to generate consistent profiling
>> metadata.
> Frontend does not generate profile data -- it is just a messenger that
> should pass the data faithfully to the middle end. That messenger
> (profile reader) can be in middle end too.
Er, we may be arguing terminology here. I was including the profiling
system as part of the "frontend" - I'm working with a JIT - whereas
you're assuming a separate collection system. It doesn't actually
matter which terms we use. My point was that assuming clean profiling
data is just not reasonable in practice. At minimum, some type of
normalization step is required.
>
>> Other than the inliner, can you list the passes you think are profitable to
>> teach about profiling data? My list so far is: PRE (particularly of
>> loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
>> when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
>> (avoiding code size explosion in cold code). I'm much more interested in
>> sources of improved performance than I am simply code size reduction.
>> (Reducing code size can improve performance of course.)
> PGO is very effective in code size reduction. In reality, large
> percentage of functions are globally cold.
For a traditional C++ application, yes. For a JIT which is only
compiling warm code paths in hot methods, not so much. It's still
helpful, but the impact is much smaller.

Philip

Teresa Johnson

unread,
Feb 25, 2015, 5:22:56 PM2/25/15
to Philip Reames, Xinliang David Li, LLVM Developers Mailing List
On Wed, Feb 25, 2015 at 2:03 PM, Philip Reames <list...@philipreames.com> wrote:

On 02/25/2015 12:11 PM, Teresa Johnson wrote:
On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
<list...@philipreames.com> wrote:
Other than the inliner, can you list the passes you think are profitable to
teach about profiling data?  My list so far is: PRE (particularly of
loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
when it can be vectorized on the hot path), LoopUnswitch, IRCE, & LoopUnroll
(avoiding code size explosion in cold code).  I'm much more interested in
sources of improved performance than I am simply code size reduction.
(Reducing code size can improve performance of course.)
Also, code layout (bb layout, function layout, function splitting).
Correct me if I'm wrong, but we already have "function layout" (i.e. basic block placement).  It may need improved, but I've found it to be reasonable effective.

What do you mean by "bb layout"?

By bb layout I was referring to basic block placement - I am not overly familiar with LLVM's implementation, but I know that this typically benefits from profile information.

By function layout, I meant layout of functions within the module and then the executable. This could simply be marking/separating hot vs cold functions, or could be fancier via a linker plugin to use profile data to colocate functions with affinity.


I'm assuming you're referring to a form of outlining as "function splitting".  This seems like a clearly good idea.

Right.

Thanks,
Teresa
 


Need to represent global profile summary data. For example, for global
hotness determination, it is useful to compute additional global summary
info, such as a histogram of counts that can be used to determine hotness
and working set size estimates for a large percentage of the profiled
execution.

Er, not clear what you're trying to say here?
The idea is to get a sense of a good global profile count threshold to
use given an application's profile, i.e. when determining whether a
profile count is hot in the given profile. For example, what is the
minimum profile count contributing to the hottest 99% of the
application's profile.
Ah, got it.  That makes sense for a C/C++ use case.

In my use case, we're only compiling the hottest methods.  As such, I care mostly about relative hotness within a method or a small set of methods.

Xinliang David Li

unread,
Feb 25, 2015, 5:33:34 PM2/25/15
to Philip Reames, LLVM Developers Mailing List
On Wed, Feb 25, 2015 at 2:14 PM, Philip Reames
Are you speculating or you have data to show it? We have large
programs run with hundreds of threads, race condition only contribute
to very small count variations -- and there are ways to smooth out the
difference.

>
> If your optimization is radically unstable in such scenarios, that's a
> serious problem. Pessimization is bad enough (if tolerable), incorrect
> transforms are not.

This is never our experience with using PGO in the past. We also
have tools to compare profile consistency from one training run to
another.

If you experience such problems in real apps, can you file a bug?

> It's very easy to write a transform that implicitly
> assumes the counts for the first branch must be less than the counts for the
> second.

Compiler can detect insane profile -- it can either ignore it, correct
it, or uses it with warnings depending on options.

>
>>
>>> I'd
>>> almost rather see a consistent count inferred from data that's assumed to
>>> be
>>> questionable than
>>> make the frontend try to generate consistent profiling
>>> metadata.
>>
>> Frontend does not generate profile data -- it is just a messenger that
>> should pass the data faithfully to the middle end. That messenger
>> (profile reader) can be in middle end too.
>
> Er, we may be arguing terminology here. I was including the profiling
> system as part of the "frontend" - I'm working with a JIT - whereas you're
> assuming a separate collection system. It doesn't actually matter which
> terms we use. My point was that assuming clean profiling data is just not
> reasonable in practice. At minimum, some type of normalization step is
> required.

If you are talking about making slightly consistent profile to be flow
consistent, yes there are mechanisms to do that.


David

Bob Wilson

unread,
Feb 26, 2015, 4:09:10 PM2/26/15
to Diego Novillo, Xinliang David Li, LLVM Developers Mailing List
On Feb 24, 2015, at 3:31 PM, Diego Novillo <dnov...@google.com> wrote:


We (Google) have started to look more closely at the profiling infrastructure in LLVM. Internally, we have a large dependency on PGO to get peak performance in generated code.

Some of the dependencies we have on profiling are still not present in LLVM (e.g., the inliner) but we will still need to incorporate changes to support our work on these optimizations. Some of the changes may be addressed as individual bug fixes on the existing profiling infrastructure. Other changes  may be better implemented as either new extensions or as replacements of existing code.

I think we will try to minimize infrastructure replacement at least in the short/medium term. After all, it doesn't make too much sense to replace infrastructure that is broken for code that doesn't exist yet.

David Li and I are preparing a document where we describe the major issues that we'd like to address. The document is a bit on the lengthy side, so it may be easier to start with an email discussion. This is a summary of the main changes we are looking at:
  1. Need to faithfully represent the execution count taken from dynamic profiles. Currently, MD_prof does not really represent an execution count. This makes things like comparing hotness across functions hard or impossible. We need a concept of global hotness.

The plan that we have discussed in the past (I don’t remember when) was to record simple function entry execution counts. Those could be combined with the  BlockFrequencyInfo to compare “hotness” across functions.
  1. When the CFG or callgraph change, there need to exist an API for incrementally updating/scaling counts. For instance, when a function is inlined or partially inlined, when the CFG is modified, etc. These counts need to be updated incrementally (or perhaps re-computed as a first step into that direction).
One of the main reasons that we chose to use branch weights to represent profile information within functions is that it makes this problem easier. Of course, we still need to update the branch weights when transforming the CFG, but I believe most of that work has already been done. Are you suggesting that we should work on incremental BlockFrequencyInfo updates? We have discussed that in the past, but so far, it has worked reasonably well to just re-run that analysis. (I wouldn’t be surprised if we’re missing some places where the analysis needs to be invalidated so that it gets re-run.)
  1. The inliner (and other optimizations) needs to use profile information and update it accordingly. This is predicated on Chandler's work on the pass manager, of course.
    Need to represent global profile summary data. For example, for global hotness determination, it is useful to compute additional global summary info, such as a histogram of counts that can be used to determine hotness and working set size estimates for a large percentage of the profiled execution.
There are other changes that we will need to incorporate. David, Teresa, Chandler, please add anything large that I missed.

My main question at the moment is what would be the best way of addressing them. Some seem to require new concepts to be implemented (e.g., execution counts). Others could be addressed as simple bugs to be fixed in the current framework.

Would it make sense to present everything in a unified document and discuss that? I've got some reservations about that approach because we will end up discussing everything at once and it may not lead to concrete progress. Another approach would be to present each issue individually either as patches or RFCs or bugs.

I will be taking on the implementation of several of these issues. Some of them involve the SamplePGO harness that I added last year. I would also like to know what other bugs or problems people have in mind that I could also roll into this work.


Thanks. Diego.

Xinliang David Li

unread,
Feb 26, 2015, 4:44:17 PM2/26/15
to Bob Wilson, LLVM Developers Mailing List

Yes -- there are two aspects of the problems.
1) raw profile data representation in IR and
2) the profile count data represented for CFG.

What you said is for 2) which is one of the possibilities. There is a
third issue that is also going to be covered in more detail -- that is
the Block Frequency propagation algorithm is limited (leading to
information loss). When profile count is available, block frequency
data can be directly computed via simple normalization and scaling.
This requires the raw edge count data to be represented in 1)
truthfully.


>
> When the CFG or callgraph change, there need to exist an API for
> incrementally updating/scaling counts. For instance, when a function is
> inlined or partially inlined, when the CFG is modified, etc. These counts
> need to be updated incrementally (or perhaps re-computed as a first step
> into that direction).
>
> One of the main reasons that we chose to use branch weights to represent
> profile information within functions is that it makes this problem easier.
> Of course, we still need to update the branch weights when transforming the
> CFG, but I believe most of that work has already been done. Are you
> suggesting that we should work on incremental BlockFrequencyInfo updates? We
> have discussed that in the past, but so far, it has worked reasonably well
> to just re-run that analysis. (I wouldn’t be surprised if we’re missing some
> places where the analysis needs to be invalidated so that it gets re-run.)

Diego is going to share the proposal in more detail. I will give a
brief summary to answer your question:

1) making raw profile data (in MD_prof) to truthfully does not change
its original meaning -- the branch count is still branch weight -- but
it happens to be also execution count which contains more information.
2) At CFG level (BranchProbabilityInfo), using real branch probability
does not change the way branch weight is used either -- the branch
probability is still branch weight (but normalized to be 0 <= prob
<=1). Benefits and reasons behind that will be provided.

The infrastructure is ready to update the raw MD_prof data during
intra-procedural transformations when branch instructions are cloned,
but not CFG level branch prob and block-frequency/block count update.
This is especially important for inter-procedural transformations like
inliner (i.e., update the profile data associated with inline
instance in the caller context). Recomputing via freq propagation is
not only not precise (as mentioned above), but also very compile time
consuming.

thanks,

David

Diego Novillo

unread,
Feb 26, 2015, 6:57:18 PM2/26/15
to LLVM Developers Mailing List, Xinliang David Li, Chandler Carruth, Teresa Johnson
Folks,

I've created a few bugzilla issues with details of some of the things I'll be looking into. I'm not yet done wordsmithing the overall design document. I'll try to finish it by early next week at the latest.

In the meantime, these are the specific bugzilla issues I've opened:

I'm hoping the descriptions in the bugs make sense on their own. If not, please use the bugs to beat me with a clue stick and I'll clarify.


Thanks.  Diego.

Diego Novillo

unread,
Mar 2, 2015, 7:44:41 PM3/2/15
to LLVM Developers Mailing List, Xinliang David Li, Chandler Carruth, Teresa Johnson
On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnov...@google.com> wrote:

I've created a few bugzilla issues with details of some of the things I'll be looking into. I'm not yet done wordsmithing the overall design document. I'll try to finish it by early next week at the latest.

The document is available at


There are several topics covered. Ideally, I would prefer that we discuss each topic separately. The main ones I will start working on are the ones described in the bugzilla links we have in the doc.

This is just a starting point for us. I am not at all concerned with implementing exactly what is proposed in the document. In fact, if we can get the same value using the existing support, all the better.

OTOH, any other ideas that folks may have that work better than this are more than welcome. I don't have really strong opinions on the matter. I am fine with whatever works.


Thanks.  Diego.

Bob Wilson

unread,
Mar 5, 2015, 11:37:37 AM3/5/15
to Diego Novillo, Xinliang David Li, LLVM Developers Mailing List
Thanks for the detailed write-up on this. Some of the issues definitely need to be addressed. I am concerned, though, that some of the ideas may be leading toward a scenario where we have essentially two completely different ways of representing profile information in LLVM IR. It is great to have two complementary approaches to collecting profile data, but two representations in the IR would not make sense.

The first issue raised is that profile execution counts are not represented in the IR. This was a very intentional decision. I know it goes against what other compilers have done in the past. It took me a while to get used to the idea when Andy first suggested it, so I know it seems awkward at first. The advantage is that branch probabilities are much easier to keep updated in the face of compiler transformations, compared to execution counts. We are definitely missing the per-function execution counts that are needed to be able to compare relative “hotness” across functions, and I think that would be a good place to start making improvements. In the long term, we should keep our options open to making major changes, but before we go there, we should try to make incremental improvements to fix the existing infrastructure.

Many of the other issues you raise seem like they could also be addressed without major changes to the existing infrastructure. Let’s try to fix those first.

Ivan Baev

unread,
Mar 5, 2015, 7:38:06 PM3/5/15
to Bob Wilson, Xinliang David Li, llv...@cs.uiuc.edu
Relative hotness across functions is a major requirement for most
profile-based inter-procedural optimizations - e.g. profile-based
inlining, optimizations for code size, function placement.

The proposed CFG edge/BB profile execution counts will overcome this
issue. In addition to Bob's arguments, one challenge for edge/BB execution
counts would be scaling. For example, if we reroll a loop and execution
counts for BBs/edges in the loop exceed the limit, then scaling is needed.
However this scaling will necessitate scaling BBs/edges execution counts
for the entire module. Another similar example is function merging.

On the other hand, how could we incrementally extend the current approach
of branch metadata/branch probabilities/block frequencies to provide
support for the relative hotness across functions? One idea is to
introduce a metadata for function entry count (FEC) - which is currently
available as region counter 0 in clang - and properly update it through
transformations. Take for example the inliner. When we inline callee B in
function A, we need to decrease FEC(B) by BB execution count of the
callsite, and (incrementally) recompute BlockFrequencyInfo for function A.
One issue here is the lack of function-level metadata.

Thanks,
Ivan


> Date: Thu, 5 Mar 2015 08:29:52 -0800
> From: Bob Wilson <bob.w...@apple.com>
> To: Diego Novillo <dnov...@google.com>
> Cc: Xinliang David Li <dav...@google.com>, LLVM Developers Mailing
> List <llv...@cs.uiuc.edu>
> Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support
Message-ID: <AC046380-247D-4AFA...@apple.com>
> Content-Type: text/plain; charset="utf-8"
>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnov...@google.com> wrote:
On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnov...@google.com
<mailto:dnov...@google.com>> wrote:
>> I've created a few bugzilla issues with details of some of the things
I'll be looking into. I'm not yet done wordsmithing the overall design
document. I'll try to finish it by early next week at the latest. The
document is available at
>> https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing?
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
There are several topics covered. Ideally, I would prefer that we
discuss each topic separately. The main ones I will start working on are
>> the ones described in the bugzilla links we have in the doc.
>> This is just a starting point for us. I am not at all concerned with
implementing exactly what is proposed in the document. In fact, if we can
get the same value using the existing support, all the better. OTOH, any
other ideas that folks may have that work better than this are
>> more than welcome. I don't have really strong opinions on the matter. I
am fine with whatever works.
> Thanks for the detailed write-up on this. Some of the issues definitely
need to be addressed. I am concerned, though, that some of the ideas may
be leading toward a scenario where we have essentially two completely
different ways of representing profile information in LLVM IR. It is great
> to have two complementary approaches to collecting profile data, but two
representations in the IR would not make sense.
> The first issue raised is that profile execution counts are not
> represented in the IR. This was a very intentional decision. I know it
goes against what other compilers have done in the past. It took me a
while to get used to the idea when Andy first suggested it, so I know it
seems awkward at first. The advantage is that branch probabilities are
much easier to keep updated in the face of compiler transformations,
compared to execution counts. We are definitely missing the per-function
execution counts that are needed to be able to compare relative
?hotness?
> across functions, and I think that would be a good place to start making
improvements. In the long term, we should keep our options open to making
> major changes, but before we go there, we should try to make incremental
improvements to fix the existing infrastructure.
> Many of the other issues you raise seem like they could also be
addressed
> without major changes to the existing infrastructure. Let?s try to fix
those first.







Philip Reames

unread,
Mar 6, 2015, 8:52:52 PM3/6/15
to Bob Wilson, Diego Novillo, Xinliang David Li, LLVM Developers Mailing List
After reading the document, I agree with Bob's perspective here. 

I would strongly recommend that you start with the optimizations than can be implemented within the current framework.  The current infrastructure gives a fairly reasonable idea of relative hotness within a function.  There's a lot to be done to exploit that information (even in the inliner!) without resorting to cross function analysis.  If, after most of those have been implemented, we need more fundamental changes we could consider them.  Starting with a fundamental rewrite of the profiling system within LLVM seems like a mistake. 

At a meta level, as someone who uses LLVM for JITing I would be opposed to a system that assumed consistent profiling counts across function boundaries and gave up on relative hotness information.  At least if I'm understanding your proposal, this would *completely break* a multi-tiered JIT.  In practice, you generally stop collecting instrumentation profiling once something is compiled at a high enough tier.  When compiling it's caller, you'll get very deceptive results if you rely on the execution counts to line up across functions.  On the other hand, merging two relative hotness profiles by scaling based on the hotness of the callsite works out quite well in practice.  You can use some information about global hotness to make decisions, but those decisions need to be resilient to such systematic under-counting.

Philip

Xinliang David Li

unread,
Mar 6, 2015, 10:07:19 PM3/6/15
to Philip Reames, LLVM Developers Mailing List
Bob, Philip, thanks for the feedback.

Diego is planning to give more detailed reply next Monday. There seem
to be some misunderstanding about the proposals, so I will just give
some highlights here:

1) The proposal is not intending to fundamentally change the current
framework, but to enhanced the framework so that
a) more profile information is preserved
b) block/edge count/frequency becomes faster to compute
b) profile information becomes faster to access and update
(inter-procedurally)

2) Changes to profile APIs and profile client code will be minimized,
except that we will add IPA clients (once Chandler's pass manager
change is ready)

3) The proposed change does *not* give up relative hotness as
mentioned by Philiip. All clients that relies on relative hotness are
not affected -- except that the data is better and more reliable

4) With real profile data available, current infrastructure does *not*
provide reasonable hotness (e.g., you can try comparing the BBs that
execute the same number times, but in loops with different depths in
the same function and see how big the diff is), let alone fast
updating.

I am reasonably confident that the proposal
1) does not affect compilations using static profile (with branch prediction)
2) strictly better for -fprofile-instr-use optimizations.

The area I am not so sure is the JIT, but I am really interested in
knowing the details and propose solutions for you if the current
proposal does not work for you (which I doubt -- because if the
current framework works, the new one should work too :) ).

I am looking forward to more detailed discussions next week! We shall
sit down together and discuss changes, rationale, concerns one by one
-- with concrete examples.

thanks,

David

_______________________________________________

Diego Novillo

unread,
Mar 10, 2015, 1:20:53 PM3/10/15
to Bob Wilson, Xinliang David Li, LLVM Developers Mailing List
On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.w...@apple.com> wrote:

On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnov...@google.com> wrote:

On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnov...@google.com> wrote:

I've created a few bugzilla issues with details of some of the things I'll be looking into. I'm not yet done wordsmithing the overall design document. I'll try to finish it by early next week at the latest.

The document is available at


There are several topics covered. Ideally, I would prefer that we discuss each topic separately. The main ones I will start working on are the ones described in the bugzilla links we have in the doc.

This is just a starting point for us. I am not at all concerned with implementing exactly what is proposed in the document. In fact, if we can get the same value using the existing support, all the better.

OTOH, any other ideas that folks may have that work better than this are more than welcome. I don't have really strong opinions on the matter. I am fine with whatever works.

Thanks for the detailed write-up on this. Some of the issues definitely need to be addressed. I am concerned, though, that some of the ideas may be leading toward a scenario where we have essentially two completely different ways of representing profile information in LLVM IR. It is great to have two complementary approaches to collecting profile data, but two representations in the IR would not make sense.

Yeah, I don't think I'll continue to push for a new MD_count attribute. If we were to make MD_prof be a "real" execution count, that would be enough. Note that by re-using MD_prof we are not changing its meaning at all. The execution count is still a weight and the ratio is still branch probability. All that we are changing are the absolute values of the number and increasing its data type width to remove the 32bit limitation.



The first issue raised is that profile execution counts are not represented in the IR. This was a very intentional decision. I know it goes against what other compilers have done in the past. It took me a while to get used to the idea when Andy first suggested it, so I know it seems awkward at first. The advantage is that branch probabilities are much easier to keep updated in the face of compiler transformations, compared to execution counts.

Sorry. I don't follow. Updating counts as the CFG is transformed is not difficult at all. What examples do you have in mind?  The big advantage of making MD_prof an actual execution count is that it is a meaningful metric wrt scaling and transformation.

Say, for instance, that we have a branch instruction with two targets with counts {100, 300} inside a function 'foo' that has entry count 2. The edge probability for the first edge (count 100) is 100/(100+300) = 25%.

If we inline foo() inside another function bar() at a callsite with profile count == 1, the cloned branch instruction gets its counters scaled with the callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} = {50, 150}.  But the branch probability did not change. Currently, we are cloning the branch without changing the edge weights.

This scaling is not difficult at all and can be incrementally very quickly. We cannot afford to recompute all frequencies on the fly because it would be detrimental to compile time. If foo() itself has N callees inlined into it, each inlined callee needs to trigger a re-computation. When foo() is inlined into bar(), the frequencies will need to be recomputed for foo() and all N callees inlined into foo().
 
 
We are definitely missing the per-function execution counts that are needed to be able to compare relative “hotness” across functions, and I think that would be a good place to start making improvements. In the long term, we should keep our options open to making major changes, but before we go there, we should try to make incremental improvements to fix the existing infrastructure.

Right, and that's the core of our proposal. We don't really want to make major infrastructure changes at this point. The only thing I'd like to explore is making MD_prof a real count. This will be useful for the inliner changes and it would also make incremental updates easier, because the scaling that needs to be done is very straightforward and quick.

Note that this change should not modify the current behaviour we get from profile analysis. Things that were hot before should continue to be hot now.


Many of the other issues you raise seem like they could also be addressed without major changes to the existing infrastructure. Let’s try to fix those first.

That's exactly the point of the proposal.  We definitely don't want to make major changes to the infrastructure at first. My thinking is to start working on making MD_prof a real count. One of the things that are happening is that the combination of real profile data plus the frequency propagation that we are currently doing is misleading the analysis.

For example (thanks David for the code and data). In the following code:

int g;
__attribute__((noinline)) void bar() {
 g++;
}

extern int printf(const char*, ...);

int main()
{
  int i, j, k;

  g = 0;

  // Loop 1.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 100; j++)
       for (k = 0; k < 100; k++)
           bar();

  printf ("g = %d\n", g);
  g = 0;

  // Loop 2.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 10000; j++)
        bar();

  printf ("g = %d\n", g);
  g = 0;


  // Loop 3.
  for (i = 0; i < 1000000; i++)
    bar();

  printf ("g = %d\n", g);
  g = 0;
}

When compiled with profile instrumentation, frequency propagation is distorting the real profile because it gives different frequency to the calls to bar() in the 3 different loops. All 3 loops execute 1,000,000 times, but after frequency propagation, the first call to bar() gets a weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3. In reality, every call to bar() should have a weight of 1,000,000.


Thanks.  Diego.

Ivan Baev

unread,
Mar 10, 2015, 9:33:28 PM3/10/15
to Diego Novillo, Bob Wilson, Xinliang David Li, llv...@cs.uiuc.edu
Hi Diego,

I work for Qualcomm and we are also interested in improvements to PGO
profile support, especially for supporting relative hotness across
functions. We are willing to contribute in this area so please keep me in
this discussion and work distribution. Some questions and comments below.


> Date: Tue, 10 Mar 2015 13:14:22 -0400
> From: Diego Novillo <dnov...@google.com>
> To: Bob Wilson <bob.w...@apple.com>
> Cc: Xinliang David Li <dav...@google.com>, LLVM Developers Mailing
> List <llv...@cs.uiuc.edu>
> Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support
> Message-ID:
> <CAD_=9DSzNSqpku0FjpeLyugaxshcxY9fLyyZ5EmD=i5TV...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
[ivan] What new data type do you suggest? Increasing it to 64 bit for
64-bit architectures would be reasonable, and would reduce the need of
scaling counts.
[ivan] Scaling branch counts when inlining looks good, but the current PGO
support doesn't maintain function entry counts at LLVM IR. How do you plan
to maintain these at LLVM IR?

>
>> We are definitely missing the per-function execution counts that are
>> needed to be able to compare relative ?hotness? across functions, and I
>> think that would be a good place to start making improvements. In the
>> long
>> term, we should keep our options open to making major changes, but
>> before
>> we go there, we should try to make incremental improvements to fix the
>> existing infrastructure.
>>
>
> Right, and that's the core of our proposal. We don't really want to make
> major infrastructure changes at this point. The only thing I'd like to
> explore is making MD_prof a real count. This will be useful for the
> inliner
> changes and it would also make incremental updates easier, because the
> scaling that needs to be done is very straightforward and quick.
>
> Note that this change should not modify the current behaviour we get from
> profile analysis. Things that were hot before should continue to be hot
> now.
>
>
>> Many of the other issues you raise seem like they could also be
>> addressed
>> without major changes to the existing infrastructure. Let?s try to fix
[ivan] I ran the example with
> opt -analyze -block-freq test1.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 101.0, int = 807
- for.cond1: float = 10100.0, int = 80799
- for.cond4: float = 1010000.0, int = 8079999
- for.body6: float = 1000000.0, int = 7999999
- for.inc7: float = 10000.0, int = 79999
- for.inc10: float = 100.0, int = 799
- for.end12: float = 1.0, int = 8
- for.cond13: float = 101.0, int = 807
- for.cond16: float = 409600.0, int = 3276799
- for.body18: float = 409559.0441, int = 3276472
- for.inc22: float = 100.0, int = 799
- for.end24: float = 1.0, int = 8
- for.cond26: float = 4096.0, int = 32768
- for.body28: float = 4095.995904, int = 32767
- for.end31: float = 1.0, int = 8

As shown above, the three calls to bar() get float BBFreq of 1000000 (in
for.body6), 409559.0441 (in for.body18), and 4095.995904 (in for.body28).

When the example is modified for trip counts of 10, 100, and 1000
> opt -analyze -block-freq test1.10.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8
- for.cond: float = 11.0, int = 87
- for.cond1: float = 110.0, int = 879
- for.cond4: float = 1100.0, int = 8799
- for.body6: float = 1000.0, int = 7999
- for.inc7: float = 100.0, int = 799
- for.inc10: float = 10.0, int = 79
- for.end12: float = 1.0, int = 8
- for.cond13: float = 11.0, int = 87
- for.cond16: float = 1010.0, int = 8079
- for.body18: float = 1000.0, int = 7999
- for.inc22: float = 10.0, int = 79
- for.end24: float = 1.0, int = 8
- for.cond26: float = 1001.0, int = 8007
- for.body28: float = 1000.0, int = 7999
- for.end31: float = 1.0, int = 8

the three calls to bar() get the correct count of 1000.
We should try to remove the limitation of 4096 in the block frequency
propagation algorithm if possible.

Ivan

Xinliang David Li

unread,
Mar 11, 2015, 12:49:43 AM3/11/15
to Ivan Baev, <llvmdev@cs.uiuc.edu> List
> [ivan] What new data type do you suggest? Increasing it to 64 bit for
> 64-bit architectures would be reasonable, and would reduce the need of
> scaling counts.

64 bit int in MD_prof.
MD_prof is the only thing that needs to be maintained at IR level. For
in-memory CFG profile representation, function entry count is needed.

This means when CFG profile data is incrementally updated during
inlining, the MD_profile weight/count for the cloned branches needs to
be scaled in the same way.
4096 is needed for static profile estimation -- there is no need to
remove the limit when real profile data is available -- the solution
is to simply not use the frequency propagation at all.

For your example, the values 11, 1010 etc are all bogus. When loop
trip count is small, the ratio of loop body count and preheader count
(average of trip count) can be very different from the real trip count
(see PR22719). In the same bug, there is another example (multi-entry
loop) showing the weakness.

Storing profile count in MD_prof allows the compiler to skip the freq
propagation completely. It also allows faster incremental update
during inlining via simple scaling.

David

Diego Novillo

unread,
Mar 11, 2015, 3:53:12 PM3/11/15
to Ivan Baev, Bob Wilson, Xinliang David Li, llv...@cs.uiuc.edu


On 03/10/15 21:30, Ivan Baev wrote:
>
> [ivan] What new data type do you suggest? Increasing it to 64 bit for
> 64-bit architectures would be reasonable, and would reduce the need of
> scaling counts.

Yes. That's what we have in mind. MD_prof would eventually become a
64bit integer.
>>
>> [ivan] Scaling branch counts when inlining looks good, but the
>> current PGO support doesn't maintain function entry counts at LLVM
>> IR. How do you plan to maintain these at LLVM IR?
Well, we'd have to incrementally update the entry counts as inlining
happens while scaling the MD_prof counts in the IR. This is a bit far
out right now. The inliner is not even capable of reading profiles
today. This is the part of the plan that will happen much further down
the road.

>
>>
>> opt -analyze -block-freq test1.10.ll
> Printing analysis 'Block Frequency Analysis' for function 'main':
> block-frequency-info: main
> - entry: float = 1.0, int = 8
> - for.cond: float = 11.0, int = 87
> - for.cond1: float = 110.0, int = 879
> - for.cond4: float = 1100.0, int = 8799
> - for.body6: float = 1000.0, int = 7999
> - for.inc7: float = 100.0, int = 799
> - for.inc10: float = 10.0, int = 79
> - for.end12: float = 1.0, int = 8
> - for.cond13: float = 11.0, int = 87
> - for.cond16: float = 1010.0, int = 8079
> - for.body18: float = 1000.0, int = 7999
> - for.inc22: float = 10.0, int = 79
> - for.end24: float = 1.0, int = 8
> - for.cond26: float = 1001.0, int = 8007
> - for.body28: float = 1000.0, int = 7999
> - for.end31: float = 1.0, int = 8
>
> the three calls to bar() get the correct count of 1000.
> We should try to remove the limitation of 4096 in the block frequency
> propagation algorithm if possible.
Long term, the presence of real profile data would make the frequency
propagator unnecessary. In the shorter term, coming up with some
intermediate fix to this is certainly possible. I'd rather continue this
discussion in the bug itself (https://llvm.org/bugs/show_bug.cgi?id=22719)


Diego.

Duncan P. N. Exon Smith

unread,
Mar 12, 2015, 5:48:04 PM3/12/15
to Diego Novillo, Xinliang David Li, LLVM Developers Mailing List

(Sorry for the delay responding; I've been on holiday.)

There are two things going on here.

Firstly, the loop scales are being capped at 4096. I propagated this
approximation from the previous version of BFI. If it's causing a
problem (which it looks like it is), we should drop it and fix what
breaks. You can play around with this by commenting out the `if`
statement at the end of `computeLoopScale()` in
BlockFrequencyInfoImpl.cpp.

For example, without that logic this testcase gives:

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
- entry: float = 1.0, int = 8

- for.cond: float = 51.5, int = 411
- for.body: float = 50.5, int = 403
- for.cond1: float = 5051.0, int = 40407
- for.body3: float = 5000.5, int = 40003
- for.cond4: float = 505001.0, int = 4040007
- for.body6: float = 500000.5, int = 4000003
- for.inc: float = 500000.5, int = 4000003
- for.end: float = 5000.5, int = 40003
- for.inc7: float = 5000.5, int = 40003
- for.end9: float = 50.5, int = 403
- for.inc10: float = 50.5, int = 403


- for.end12: float = 1.0, int = 8

- for.cond13: float = 51.5, int = 411
- for.body15: float = 50.5, int = 403
- for.cond16: float = 500051.0, int = 4000407
- for.body18: float = 500000.5, int = 4000003
- for.inc19: float = 500000.5, int = 4000003
- for.end21: float = 50.5, int = 403
- for.inc22: float = 50.5, int = 403


- for.end24: float = 1.0, int = 8

- for.cond26: float = 500001.5, int = 4000011
- for.body28: float = 500000.5, int = 4000003
- for.inc29: float = 500000.5, int = 4000003


- for.end31: float = 1.0, int = 8

(Now we get 500000.5 for all the inner loop bodies.)

Secondly, instrumentation-based profiling intentionally fuzzes the
profile data in the frontend using Laplace's Rule of Succession (look at
`scaleBranchWeight()` in CodeGenPGO.cpp).

For example, "loop 1" (which isn't affected by the 4096 cap) should give
a loop scale of 500000.5, not 1000000. (The profile data says
1000000/10000 for the inner loop, 10000/100 for the middle, and 100/1
for the outer loop. Laplace says that we should fuzz these branch
weights to 1000001/10001, 10001/101, and 101/2, which works out to
1000001/2 == 500000.5 total.)

Diego Novillo

unread,
Mar 16, 2015, 10:25:57 AM3/16/15
to Duncan P. N. Exon Smith, Xinliang David Li, LLVM Developers Mailing List
Thanks, Duncan. I've started working on this as a first step. Long term, the capping done during propagation is not much of a concern because we would probably not using propagation if real profile information is available (at least that's the initial thinking).

Thanks for all the feedback, folks. I'll start sending out patches soon that address specific issues. We can continue the discussion there.


Diego.

Philip Reames

unread,
Mar 24, 2015, 1:02:57 PM3/24/15
to Diego Novillo, Bob Wilson, Xinliang David Li, LLVM Developers Mailing List
On 03/10/2015 10:14 AM, Diego Novillo wrote:


On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.w...@apple.com> wrote:

On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnov...@google.com> wrote:

On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnov...@google.com> wrote:

I've created a few bugzilla issues with details of some of the things I'll be looking into. I'm not yet done wordsmithing the overall design document. I'll try to finish it by early next week at the latest.

The document is available at


There are several topics covered. Ideally, I would prefer that we discuss each topic separately. The main ones I will start working on are the ones described in the bugzilla links we have in the doc.

This is just a starting point for us. I am not at all concerned with implementing exactly what is proposed in the document. In fact, if we can get the same value using the existing support, all the better.

OTOH, any other ideas that folks may have that work better than this are more than welcome. I don't have really strong opinions on the matter. I am fine with whatever works.

Thanks for the detailed write-up on this. Some of the issues definitely need to be addressed. I am concerned, though, that some of the ideas may be leading toward a scenario where we have essentially two completely different ways of representing profile information in LLVM IR. It is great to have two complementary approaches to collecting profile data, but two representations in the IR would not make sense.

Yeah, I don't think I'll continue to push for a new MD_count attribute. If we were to make MD_prof be a "real" execution count, that would be enough. Note that by re-using MD_prof we are not changing its meaning at all. The execution count is still a weight and the ratio is still branch probability. All that we are changing are the absolute values of the number and increasing its data type width to remove the 32bit limitation.
Independent of everything else, relaxing the 32 bit restriction is clearly a good idea.  This would make a great standalone patch. 



The first issue raised is that profile execution counts are not represented in the IR. This was a very intentional decision. I know it goes against what other compilers have done in the past. It took me a while to get used to the idea when Andy first suggested it, so I know it seems awkward at first. The advantage is that branch probabilities are much easier to keep updated in the face of compiler transformations, compared to execution counts.

Sorry. I don't follow. Updating counts as the CFG is transformed is not difficult at all. What examples do you have in mind?  The big advantage of making MD_prof an actual execution count is that it is a meaningful metric wrt scaling and transformation.

Say, for instance, that we have a branch instruction with two targets with counts {100, 300} inside a function 'foo' that has entry count 2. The edge probability for the first edge (count 100) is 100/(100+300) = 25%.

If we inline foo() inside another function bar() at a callsite with profile count == 1, the cloned branch instruction gets its counters scaled with the callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} = {50, 150}.  But the branch probability did not change. Currently, we are cloning the branch without changing the edge weights.

This scaling is not difficult at all and can be incrementally very quickly. We cannot afford to recompute all frequencies on the fly because it would be detrimental to compile time. If foo() itself has N callees inlined into it, each inlined callee needs to trigger a re-computation. When foo() is inlined into bar(), the frequencies will need to be recomputed for foo() and all N callees inlined into foo().
It really sounds like your proposal is to essentially eagerly compute scaling rather than lazyily compute it on demand.  Assuming perfect implementations for both (with no rounding losses), the results should be the same.  Is that a correct restatement?  I'm going to hold off on responding to why that's a bad idea until you confirm, because I'm not sure I follow what you're trying to say.  :)

Also, trusting exact entry counts is going to be somewhat suspect.  These are *highly* susceptible to racy updates, overflow, etc...  Anything which puts too much implicit trust in these numbers is going to be problematic. 
 
 
We are definitely missing the per-function execution counts that are needed to be able to compare relative “hotness” across functions, and I think that would be a good place to start making improvements. In the long term, we should keep our options open to making major changes, but before we go there, we should try to make incremental improvements to fix the existing infrastructure.

Right, and that's the core of our proposal. We don't really want to make major infrastructure changes at this point. The only thing I'd like to explore is making MD_prof a real count. This will be useful for the inliner changes and it would also make incremental updates easier, because the scaling that needs to be done is very straightforward and quick.

Note that this change should not modify the current behaviour we get from profile analysis. Things that were hot before should continue to be hot now.
I have no objection to adding a mechanism for expressing an entry count.  I am still very hesitant about the proposals with regards to redefining the current MD_prof. 

I'd encourage you to post a patch for the entry count mechanism, but not tie its semantics to exact execution count.  (Something like "the value provided must correctly describe the relative hotness of this routine against others in the program annoatated with the same metadata.  It is the relative scaling that is important, not the absolute value.  In particular, the value need not be an exact execution count.")



Many of the other issues you raise seem like they could also be addressed without major changes to the existing infrastructure. Let’s try to fix those first.

That's exactly the point of the proposal.  We definitely don't want to make major changes to the infrastructure at first. My thinking is to start working on making MD_prof a real count. One of the things that are happening is that the combination of real profile data plus the frequency propagation that we are currently doing is misleading the analysis.
I consider this a major change.  You're trying to redefine a major part of the current system. 

Multiple people have spoken up and objected to this change (as currently described).  Please start somewhere else.   
Duncan responded to this.  My conclusion from his response: this is a bug, not a fundamental issue.  Remove the max scaling factor, switch the counts to 64 bits and everything should be fine.  If you disagree, let's discuss. 


Thanks.  Diego.

Philip Reames

unread,
Mar 24, 2015, 1:16:11 PM3/24/15
to Diego Novillo, Bob Wilson, Xinliang David Li, LLVM Developers Mailing List
Sorry, "objected" is too strong a word here.  I should have said "have reservations about". 

My point here - which I'm not sure I expressed well - is that I'm concerned we're going to bog down on a controversial change rather than make progress on the parts we agree on.  I'm not saying that we should never redefine things in the way your proposing, but I would like to see the parts that work under both schemes be addressed first.  We can continue this discussion in parallel. 

Diego Novillo

unread,
Mar 24, 2015, 1:21:09 PM3/24/15
to Philip Reames, Xinliang David Li, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 1:13 PM, Philip Reames
<list...@philipreames.com> wrote:

> Sorry, "objected" is too strong a word here. I should have said "have
> reservations about".
>
> My point here - which I'm not sure I expressed well - is that I'm concerned
> we're going to bog down on a controversial change rather than make progress
> on the parts we agree on. I'm not saying that we should never redefine
> things in the way your proposing, but I would like to see the parts that
> work under both schemes be addressed first. We can continue this discussion
> in parallel.

No worries. As I stated upthread, my plan is to address all the issues
bottom up, starting with the more immediately obvious (no changes for
change sake). If we can get the usefulness we are looking for out of
the current harness, then no big changes will be needed.

If we run into some brick wall down the road (say when we're doing the
profile-based inliner changes), then we'll see what other options we
have available.

Xinliang David Li

unread,
Mar 24, 2015, 1:30:17 PM3/24/15
to Philip Reames, LLVM Developers Mailing List
Diego and I have discussed this according to the feedback received. We
have revised plan for this (see Diego's last reply). Here is a more
detailed re-cap:

1) keep MD_prof definition as it is today; also keep using the
frequency propagation as it is (assuming programs with irreducible
loops are not common and not important. If it turns out to be
otherwise, we will revisit this).
2) fix all problems that lead to wrong 'frequency/count' computed from
the frequency propagation algorithm
2.1) relax 32bit limit
2.2) remove 4096 iteration count cap
2.3) remove the 'laplace rule of succession' which can be very harmful
2.4) Fix the bug in 'BranchProbabilityInfo::calcMetadataWeights'
that does capping without scaling
2.5) Fix a similar bug in 'MachineBranchProbabilityInfo::getSumForBlock('
.. other bugs found
3) introduce a new meta data to represent function entry global
hotness (the implementation can choose to use entry execution count)
4) implement frequency/profile update interfaces for Inline
transformations -- i.e., allowing importing callee's profile info into
caller with scaling factor computed from callsite count and callee
entry count
5) Implement mechanism to compute, record and use profile program
summary data (This will be discussed separately).


thanks,

David

Xinliang David Li

unread,
Mar 24, 2015, 1:37:02 PM3/24/15
to Philip Reames, LLVM Developers Mailing List
> I follow what you're trying to say. :)
>
> Also, trusting exact entry counts is going to be somewhat suspect. These
> are *highly* susceptible to racy updates, overflow, etc... Anything which
> puts too much implicit trust in these numbers is going to be problematic.

No, we won't be using profile counts in the absolute sense, a.k.a. if
(count == 100000). The counts is used as global hotness relative to
others. It is also used together with program summary data.

>

> I have no objection to adding a mechanism for expressing an entry count. I
> am still very hesitant about the proposals with regards to redefining the
> current MD_prof.

For now, we won't touch MD_prof's definition.

>
> I'd encourage you to post a patch for the entry count mechanism, but not tie
> its semantics to exact execution count. (Something like "the value provided
> must correctly describe the relative hotness of this routine against others
> in the program annoatated with the same metadata. It is the relative
> scaling that is important, not the absolute value. In particular, the value
> need not be an exact execution count.")
>

Define entry count in a more flexible way is fine -- as long as the
implementation can choose to use 'execution count' :)

thanks,

David

Bob Wilson

unread,
Mar 24, 2015, 1:55:15 PM3/24/15
to Xinliang David Li, LLVM Developers Mailing List

> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>
> Diego and I have discussed this according to the feedback received. We
> have revised plan for this (see Diego's last reply). Here is a more
> detailed re-cap:
>
> 1) keep MD_prof definition as it is today; also keep using the
> frequency propagation as it is (assuming programs with irreducible
> loops are not common and not important. If it turns out to be
> otherwise, we will revisit this).
> 2) fix all problems that lead to wrong 'frequency/count' computed from
> the frequency propagation algorithm
> 2.1) relax 32bit limit
> 2.2) remove 4096 iteration count cap
> 2.3) remove the 'laplace rule of succession' which can be very harmful

I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.

Diego Novillo

unread,
Mar 24, 2015, 2:07:58 PM3/24/15
to Bob Wilson, Duncan P. N. Exon Smith, Xinliang David Li, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.w...@apple.com> wrote:
>
>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>>
>> 2.3) remove the 'laplace rule of succession' which can be very harmful
>
> I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.

Duncan and I chatted briefly about this on IRC. I'm going to
experiment with only applying laplace smoothing if any scaled weights
are 0. AFAIU, usage of this rule is really just trying to avoid having
downstream users deal with 0 weights. Duncan, please whack me with a
clue stick if I totally misrepresented our conclusions.

Bob Wilson

unread,
Mar 24, 2015, 2:12:16 PM3/24/15
to Diego Novillo, Xinliang David Li, LLVM Developers Mailing List

> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnov...@google.com> wrote:
>
> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.w...@apple.com> wrote:
>>
>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>>>
>>> 2.3) remove the 'laplace rule of succession' which can be very harmful
>>
>> I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.
>
> Duncan and I chatted briefly about this on IRC. I'm going to
> experiment with only applying laplace smoothing if any scaled weights
> are 0. AFAIU, usage of this rule is really just trying to avoid having
> downstream users deal with 0 weights. Duncan, please whack me with a
> clue stick if I totally misrepresented our conclusions.

Zero is the extreme case, but it’s also important for other small counts. I’d like to see some specific examples of why you think the current behavior is harmful.

Xinliang David Li

unread,
Mar 24, 2015, 2:19:26 PM3/24/15
to Bob Wilson, Xinliang David Li, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 10:48 AM, Bob Wilson <bob.w...@apple.com> wrote:

> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>
> Diego and I have discussed this according to the feedback received. We
> have revised plan for this (see Diego's last reply).  Here is a more
> detailed re-cap:
>
> 1) keep MD_prof definition as it is today; also keep using the
> frequency propagation as it is (assuming programs with irreducible
> loops are not common and not important. If it turns out to be
> otherwise, we will revisit this).
> 2) fix all problems that lead to wrong 'frequency/count' computed from
> the frequency propagation algorithm
>   2.1) relax 32bit limit
>   2.2) remove 4096 iteration count cap
>   2.3) remove the 'laplace rule of succession' which can be very harmful

I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.


This rule has big impact on loops. On the other hand, if the count is small, it is usually not hot and not interesting. However if this is considered important, we can introduce a flag to control the behavior.

thanks,

David

Chandler Carruth

unread,
Mar 24, 2015, 2:33:39 PM3/24/15
to Xinliang David Li, LLVM Developers Mailing List
Sorry I haven't responded earlier, but one point here still doesn't make sense to me:

On Tue, Mar 24, 2015 at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
Diego and I have discussed this according to the feedback received. We
have revised plan for this (see Diego's last reply).  Here is a more
detailed re-cap:

1) keep MD_prof definition as it is today; also keep using the
frequency propagation as it is (assuming programs with irreducible
loops are not common and not important. If it turns out to be
otherwise, we will revisit this).
2) fix all problems that lead to wrong 'frequency/count' computed from
the frequency propagation algorithm
   2.1) relax 32bit limit

I still don't understand why this is important or useful.... Maybe I'm just missing something.

Given the current meaning of MD_prof, it seems like the result of limiting this to 32-bits is that the maximum relative ratio of probabilities between two successors of a basic block with N successors is (2 billion / N):1 -- what is the circumstance that makes this resolution insufficient?

It also doesn't seem *bad* per-se, I just don't see what it improves, and it does cost memory...

Xinliang David Li

unread,
Mar 24, 2015, 2:49:12 PM3/24/15
to Chandler Carruth, Xinliang David Li, LLVM Developers Mailing List
right -- there is some ambiguity here -- it is needed if we were to change MD_prof's definition to represent branch count.  However, with the new plan, the removal of the limit only applies to the function entry count representation planned. 

David

Xinliang David Li

unread,
Mar 24, 2015, 3:11:08 PM3/24/15
to Bob Wilson, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 10:54 AM, Bob Wilson <bob.w...@apple.com> wrote:
>
>> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnov...@google.com> wrote:
>>
>> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.w...@apple.com> wrote:
>>>
>>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>>>>
>>>> 2.3) remove the 'laplace rule of succession' which can be very harmful
>>>
>>> I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.
>>
>> Duncan and I chatted briefly about this on IRC. I'm going to
>> experiment with only applying laplace smoothing if any scaled weights
>> are 0. AFAIU, usage of this rule is really just trying to avoid having
>> downstream users deal with 0 weights. Duncan, please whack me with a
>> clue stick if I totally misrepresented our conclusions.
>
> Zero is the extreme case, but it’s also important for other small counts. I’d like to see some specific examples of why you think the current behavior is harmful.

Using the rule has at least the following bad side effects:
1) wrong estimation of average loop trip count -- possibly leading to
wrong unroll decisions
2) result in bad inlining decisions. For instance:
for (...)
bar(); // (1)

where (1) is the only callsite to bar(). Using the rule, BB count
enclosing the call to bar() can be as low as half of the entry count
of bar(). Inliner will get confused and think there are more hot
callsites to 'bar' and make suboptimal decisions ..

Also if bar has calls to other functions, those callsites will look
hotter than the call to 'bar' ...

The attached are two cases as well as the frequency graph computed
today (with the laplace rule) and the correct frequency expected.

David
loops2.c
loops2.dot
loops2.prof
loops2_correct.dot
loops3.c
loops3.dot
loops3.prof
loops3_correct.dot

Chandler Carruth

unread,
Mar 24, 2015, 3:48:25 PM3/24/15
to Xinliang David Li, Xinliang David Li, LLVM Developers Mailing List

On Tue, Mar 24, 2015 at 11:46 AM, Xinliang David Li <xinli...@gmail.com> wrote:
On Tue, Mar 24, 2015 at 11:29 AM, Chandler Carruth <chan...@google.com> wrote:
Sorry I haven't responded earlier, but one point here still doesn't make sense to me:

On Tue, Mar 24, 2015 at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
Diego and I have discussed this according to the feedback received. We
have revised plan for this (see Diego's last reply).  Here is a more
detailed re-cap:

1) keep MD_prof definition as it is today; also keep using the
frequency propagation as it is (assuming programs with irreducible
loops are not common and not important. If it turns out to be
otherwise, we will revisit this).
2) fix all problems that lead to wrong 'frequency/count' computed from
the frequency propagation algorithm
   2.1) relax 32bit limit

I still don't understand why this is important or useful.... Maybe I'm just missing something.

Given the current meaning of MD_prof, it seems like the result of limiting this to 32-bits is that the maximum relative ratio of probabilities between two successors of a basic block with N successors is (2 billion / N):1 -- what is the circumstance that makes this resolution insufficient?

It also doesn't seem *bad* per-se, I just don't see what it improves, and it does cost memory...

right -- there is some ambiguity here -- it is needed if we were to change MD_prof's definition to represent branch count.  However, with the new plan, the removal of the limit only applies to the function entry count representation planned. 

Ah, ok, that makes more sense.

I'm still curious, is the ratio of 2 billion : 1 insufficient between the hottest basic block in the inner most loop and the entry block? My intuition is that this ratio encapsulates all the information we could meaningfully make decisions based upon, and I don't have any examples where it falls over, but perhaps you have some examples?

(Note, the 4096 scaling limit thing is completely separate, that makes perfect sense to me.)

Xinliang David Li

unread,
Mar 24, 2015, 3:52:01 PM3/24/15
to Chandler Carruth, Xinliang David Li, LLVM Developers Mailing List
The ratio is not the problem. The problem is that we can no longer effectively differentiate hot functions. 2 billion vs 4 billion will look the same with the small capping.

David

Chandler Carruth

unread,
Mar 24, 2015, 3:57:18 PM3/24/15
to Xinliang David Li, Xinliang David Li, LLVM Developers Mailing List
The current design for the entry frequency is that it should be interpreted relative to the global hotness of the function. Today, we only have attributes "hot" and "cold" which are terrible models of this, but if you imagine having a function-level metadata signifying the detailed function profile weight, then you could interpret the basic block frequencies between two functions only after normalizing with these function-level weights.

Does that make sense?

(Note, I'm not actually saying that I think this is necessarily the right design, just that I believe it is the intent of the current design, and if it is flawed I think that flaw hasn't yet been effectively shown.)

Xinliang David Li

unread,
Mar 24, 2015, 3:58:45 PM3/24/15
to Chandler Carruth, Xinliang David Li, LLVM Developers Mailing List
Capping also leads to other kinds of problems -- e.g., sum of incoming edge count (callgraph) does not match the callee entry count etc.

David

Chandler Carruth

unread,
Mar 24, 2015, 4:00:45 PM3/24/15
to Xinliang David Li, Xinliang David Li, LLVM Developers Mailing List

On Tue, Mar 24, 2015 at 12:53 PM, Xinliang David Li <xinli...@gmail.com> wrote:
Capping also leads to other kinds of problems -- e.g., sum of incoming edge count (callgraph) does not match the callee entry count etc.

Can you explain these problems in more detail? I think that's essential for understanding why you think the design should be work in this way.

Xinliang David Li

unread,
Mar 24, 2015, 4:06:35 PM3/24/15
to Chandler Carruth, Xinliang David Li, LLVM Developers Mailing List
The design is basically to augment the existing frequency data with one 64bit data which is the global hotness of the function entry (e.g. it be the entry execution count). With the execution count available, the BB  count (or global hotness if you will) is simply:

  count(BB)  = freq (BB) * count(ENTRY)/freq(ENTRY)

You can view count(ENTRY) as an extension to the current 'hot'/'cold' attribute

Note that for IPA, callsite count is obtained from enclosing BB's count.

David

Chandler Carruth

unread,
Mar 24, 2015, 4:10:00 PM3/24/15
to Xinliang David Li, Xinliang David Li, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 1:03 PM, Xinliang David Li <xinli...@gmail.com> wrote:
The design is basically to augment the existing frequency data with one 64bit data which is the global hotness of the function entry (e.g. it be the entry execution count). With the execution count available, the BB  count (or global hotness if you will) is simply:

  count(BB)  = freq (BB) * count(ENTRY)/freq(ENTRY)

You can view count(ENTRY) as an extension to the current 'hot'/'cold' attribute

Yes, this is what I'm saying the current design was aiming for. Is this what you're now suggesting, or are you suggesting something different?
 

Note that for IPA, callsite count is obtained from enclosing BB's count.

Yes, there should clearly be a relationship here. One way I would like to see this tested is to look at the relative error at each level -- if we compute the global "count" as you describe for a callsite's BB, and aggregate it with all other callsite BBs, we should get something close to function's global value. If we get wildly different results, there is probably some flaw in the algorithm used.

Xinliang David Li

unread,
Mar 24, 2015, 4:11:19 PM3/24/15
to Chandler Carruth, Xinliang David Li, LLVM Developers Mailing List
Example. Assuming the cap is 'C'

void bar()
{
    // ENTRY count is 4*C, after capping it becomes 'C'
    ...
}

void test()
{
  // BB1:   count(BB1) = C
  bar();

  // BB2:   count(BB2) = C
  bar();

}

void test2()
{
  // BB3:   count(BB3) = C
  bar();

  // BB4:   count(BB4) = C
  bar();
}

What would inliner see here ? When it sees callsite1 -- it might mistaken that is the only dominating callsite to 'bar'.

David

Chandler Carruth

unread,
Mar 24, 2015, 4:15:57 PM3/24/15
to Xinliang David Li, Xinliang David Li, LLVM Developers Mailing List

On Tue, Mar 24, 2015 at 1:08 PM, Xinliang David Li <xinli...@gmail.com> wrote:
Example. Assuming the cap is 'C'

void bar()
{
    // ENTRY count is 4*C, after capping it becomes 'C'
    ...
}

void test()
{
  // BB1:   count(BB1) = C
  bar();

  // BB2:   count(BB2) = C
  bar();

}

void test2()
{
  // BB3:   count(BB3) = C
  bar();

  // BB4:   count(BB4) = C
  bar();
}

What would inliner see here ? When it sees callsite1 -- it might mistaken that is the only dominating callsite to 'bar'.

I don't understand, why would it assume this?

I was suggesting that each function would need to be associated with some global weight, and the global weight ration between test and bar should provide the needed information here?

Xinliang David Li

unread,
Mar 24, 2015, 4:28:10 PM3/24/15
to Chandler Carruth, Xinliang David Li, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 1:07 PM, Chandler Carruth <chan...@google.com> wrote:

On Tue, Mar 24, 2015 at 1:03 PM, Xinliang David Li <xinli...@gmail.com> wrote:
The design is basically to augment the existing frequency data with one 64bit data which is the global hotness of the function entry (e.g. it be the entry execution count). With the execution count available, the BB  count (or global hotness if you will) is simply:

  count(BB)  = freq (BB) * count(ENTRY)/freq(ENTRY)

You can view count(ENTRY) as an extension to the current 'hot'/'cold' attribute

Yes, this is what I'm saying the current design was aiming for. Is this what you're now suggesting, or are you suggesting something different?
 

I am not sure what current design are you referring to.  What I am saying is that we need to represent 'count(ENTRY)' for IPA purpose, which is currently missing.
 

Note that for IPA, callsite count is obtained from enclosing BB's count.

Yes, there should clearly be a relationship here.

By the definition of 'count', it can be used for IPA. This is the key difference to 'Freq'.
 
One way I would like to see this tested is to look at the relative error at each level -- if we compute the global "count" as you describe for a callsite's BB, and aggregate it with all other callsite BBs, we should get something close to function's global value. If we get wildly different results, there is probably some flaw in the algorithm used.

We don't need to 'compute' global count -- with PGO, it is already there. We just need to keep it without dropping it.

With PGO, the current block frequency computation flow is roughly like this:

1) read the raw profile count from profile data file. 
2) compute the true/false branch target weight from the raw profile data, and create MD_prof. Note that with capping and other things, the data associated with MD_prof can only be used as branch probability 
3) Invoke BranchProbabilityInfo analysis pass. Converting MD_prof into CFG level branch probability data (weight) with more capping
4) Invoke BlockFrequencyInfo analysis pass -- this pass sole relies on branch probability and assumes Freq(ENTRY) = 1 (for fraction representation).
5) (roughly) Scale up the Freq(BB) data in 4) by making the Integer freq of BB with the the lowest fractional frequency to be 1


With the proposed change, we just need to add one more step
2.1) Record the count(ENTRY) for the function as a meta data.

David


 

Xinliang David Li

unread,
Mar 24, 2015, 4:36:49 PM3/24/15
to Chandler Carruth, Xinliang David Li, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 1:13 PM, Chandler Carruth <chan...@google.com> wrote:

On Tue, Mar 24, 2015 at 1:08 PM, Xinliang David Li <xinli...@gmail.com> wrote:
Example. Assuming the cap is 'C'

void bar()
{
    // ENTRY count is 4*C, after capping it becomes 'C'
    ...
}

void test()
{
  // BB1:   count(BB1) = C
  bar();

  // BB2:   count(BB2) = C
  bar();

}

void test2()
{
  // BB3:   count(BB3) = C
  bar();

  // BB4:   count(BB4) = C
  bar();
}

What would inliner see here ? When it sees callsite1 -- it might mistaken that is the only dominating callsite to 'bar'.

I don't understand, why would it assume this?

because 

count(ENTRY_bar) = SUM (count(callsite_i(bar)), for all i.

If the inliner sees the first callsite has the same count as the bar's entry count, what assumption can it make (especially when the module has only one callsite to be looked at)? With more callsites visible, it can detect the insanity, but still it makes the data less useful.
 

I was suggesting that each function would need to be associated with some global weight, and the global weight ration between test and bar should provide the needed information here?

See my previous reply. The data is there. We just need to preserve it.

David
 

Philip Reames

unread,
Mar 24, 2015, 4:58:22 PM3/24/15
to Xinliang David Li, LLVM Developers Mailing List

On 03/24/2015 10:34 AM, Xinliang David Li wrote:
>> I follow what you're trying to say. :)
>>
>> Also, trusting exact entry counts is going to be somewhat suspect. These
>> are *highly* susceptible to racy updates, overflow, etc... Anything which
>> puts too much implicit trust in these numbers is going to be problematic.
> No, we won't be using profile counts in the absolute sense, a.k.a. if
> (count == 100000). The counts is used as global hotness relative to
> others. It is also used together with program summary data.
>
>> I have no objection to adding a mechanism for expressing an entry count. I
>> am still very hesitant about the proposals with regards to redefining the
>> current MD_prof.
> For now, we won't touch MD_prof's definition.
>
>> I'd encourage you to post a patch for the entry count mechanism, but not tie
>> its semantics to exact execution count. (Something like "the value provided
>> must correctly describe the relative hotness of this routine against others
>> in the program annoatated with the same metadata. It is the relative
>> scaling that is important, not the absolute value. In particular, the value
>> need not be an exact execution count.")
>>
> Define entry count in a more flexible way is fine -- as long as the
> implementation can choose to use 'execution count' :)
Absolutely agreed here. :)

Philip Reames

unread,
Mar 24, 2015, 5:00:22 PM3/24/15
to Xinliang David Li, LLVM Developers Mailing List
This sounds like a very workable approach to me. I think 2.3 needs a
bit more discussion, but given it doesn't really directly effect me, I'm
probably going to stay out of it for now.

Thanks for working on this!

Bob Wilson

unread,
Mar 24, 2015, 5:51:21 PM3/24/15
to Xinliang David Li, LLVM Developers Mailing List

> On Mar 24, 2015, at 12:08 PM, Xinliang David Li <dav...@google.com> wrote:
>
> On Tue, Mar 24, 2015 at 10:54 AM, Bob Wilson <bob.w...@apple.com> wrote:
>>
>>> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnov...@google.com> wrote:
>>>
>>> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.w...@apple.com> wrote:
>>>>
>>>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>>>>>
>>>>> 2.3) remove the 'laplace rule of succession' which can be very harmful
>>>>
>>>> I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.
>>>
>>> Duncan and I chatted briefly about this on IRC. I'm going to
>>> experiment with only applying laplace smoothing if any scaled weights
>>> are 0. AFAIU, usage of this rule is really just trying to avoid having
>>> downstream users deal with 0 weights. Duncan, please whack me with a
>>> clue stick if I totally misrepresented our conclusions.
>>
>> Zero is the extreme case, but it’s also important for other small counts. I’d like to see some specific examples of why you think the current behavior is harmful.
>
> Using the rule has at least the following bad side effects:
> 1) wrong estimation of average loop trip count -- possibly leading to
> wrong unroll decisions

I remain skeptical. If the loop only ran once in your profile run, how much confidence do you have in the trip count? The LaPlace rule will tend to make the compiler more conservative in exactly the cases where there is insufficient information to be confident about what to do.

> 2) result in bad inlining decisions. For instance:
> for (...)
> bar(); // (1)
>
> where (1) is the only callsite to bar(). Using the rule, BB count
> enclosing the call to bar() can be as low as half of the entry count
> of bar(). Inliner will get confused and think there are more hot
> callsites to 'bar' and make suboptimal decisions ..
>
> Also if bar has calls to other functions, those callsites will look

> hotter than the call to 'bar' …

Your own proposal for recording entry counts is to record “relative hotness”, not absolute profile counts. On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule that is then used to compute BlockFrequency and you’re concerned about a mismatch between that the “relative hotness” recorded for the callee??

>
> The attached are two cases as well as the frequency graph computed
> today (with the laplace rule) and the correct frequency expected.

I’d be a lot more interested to see a real-world example.

Xinliang David Li

unread,
Mar 24, 2015, 6:24:46 PM3/24/15
to Bob Wilson, LLVM Developers Mailing List
On Tue, Mar 24, 2015 at 2:47 PM, Bob Wilson <bob.w...@apple.com> wrote:
>
>> On Mar 24, 2015, at 12:08 PM, Xinliang David Li <dav...@google.com> wrote:
>>
>> On Tue, Mar 24, 2015 at 10:54 AM, Bob Wilson <bob.w...@apple.com> wrote:
>>>
>>>> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnov...@google.com> wrote:
>>>>
>>>> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.w...@apple.com> wrote:
>>>>>
>>>>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <dav...@google.com> wrote:
>>>>>>
>>>>>> 2.3) remove the 'laplace rule of succession' which can be very harmful
>>>>>
>>>>> I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large.
>>>>
>>>> Duncan and I chatted briefly about this on IRC. I'm going to
>>>> experiment with only applying laplace smoothing if any scaled weights
>>>> are 0. AFAIU, usage of this rule is really just trying to avoid having
>>>> downstream users deal with 0 weights. Duncan, please whack me with a
>>>> clue stick if I totally misrepresented our conclusions.
>>>
>>> Zero is the extreme case, but it’s also important for other small counts. I’d like to see some specific examples of why you think the current behavior is harmful.
>>
>> Using the rule has at least the following bad side effects:
>> 1) wrong estimation of average loop trip count -- possibly leading to
>> wrong unroll decisions
>
> I remain skeptical. If the loop only ran once in your profile run, how much confidence do you have in the trip count? The LaPlace rule will tend to make the compiler more conservative in exactly the cases where there is insufficient information to be confident about what to do.

Training run is expensive, it is common for PGO users to reduce
workload as much as possible without sacrificing the
representativeness of the workload.

Workload 1: loop run once, iterating 100 times in total
workload 2: loop run twice, iterating 200 times in total

Which workload will user prefer to use?

Without the rule, the two workload at least produces consistent
profile data. With the Laplace rule, you get 50 in one case, and 66
in the other.

Having some technology to improve confidence of the profile data is
fine, but I don't see
1) how laplace rule is good for it
2) why this can not be done in the consumer side (i.e., faithfully
record the profile data).


>
>> 2) result in bad inlining decisions. For instance:
>> for (...)
>> bar(); // (1)
>>
>> where (1) is the only callsite to bar(). Using the rule, BB count
>> enclosing the call to bar() can be as low as half of the entry count
>> of bar(). Inliner will get confused and think there are more hot
>> callsites to 'bar' and make suboptimal decisions ..
>>
>> Also if bar has calls to other functions, those callsites will look
>> hotter than the call to 'bar' …
>
> Your own proposal for recording entry counts is to record “relative hotness”, not absolute profile counts.

The proposal is to record 'global hotness' that can used to compare
relative hotness across procedural boundaries (e.g. callsites in
different callers). Profile counts satisfies this condition.

>On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule that is then used to compute BlockFrequency and you’re concerned about a mismatch between that the “relative hotness” recorded for the callee??

Basically, say the caller is test()

bar(){
// ENTRY count = 100 (from profile data)
// ENTRY freq = 1

// BB2: Freq(BB2) = 1, count = 100
foo (); (2)
}


test() {
// ENTRY count = 1 (from profile data)
// Entry Freq = 1
for (i = 0; i < 100; i++) {
// BB1: Freq(BB1) = 50 due to Laplace rule
bar(); // Freq = 50, count = 50 (1)
}
}

With laplace rule, the block freq computed for bar's enclosing BB will
be wrong -- as a result, the bar's enclosing BB's count will be wrong
too: 50*1/1 = 50.

The global hotness of call site (1) & (2) should be the same, but
distorted when Laplace rule is used.

Yes, we care about using PGO across routine boundaries for IPO.

>
>>
>> The attached are two cases as well as the frequency graph computed
>> today (with the laplace rule) and the correct frequency expected.
>
> I’d be a lot more interested to see a real-world example.

See my reply above. On the other hand, I'd like to see examples where
LaPlace Rule can actually help improve the profile data quality.

thanks,

David

Ivan Baev

unread,
Mar 25, 2015, 8:59:02 PM3/25/15
to Xinliang David Li, LLVM Developers Mailing List
Hi David,
The plan makes sense. A question on:

>> 4) implement frequency/profile update interfaces for Inline
>> transformations -- i.e., allowing importing callee's profile info into
>> caller with scaling factor computed from callsite count and callee
>> entry count

Do you envision the following implementation for it? Say we've decided to
inline callee B at callsite C in caller A. In addition to the current
inlining update work, we need to:

1. get block frequency BF of the basic block containing the callsite C by
invoking BlockFrequency on function A;
2. scale BF with the function entry count of A, to get scaledBF
3. decrease function entry count for B with scaledBF
4. compute and scale block frequencies of the basic blocks of B inlined
into A, or re-invoke BlockFrequency on the modified function A.

Thanks,
Ivan

>
> Date: Tue, 24 Mar 2015 13:57:30 -0700
> From: Philip Reames <list...@philipreames.com>
> To: Xinliang David Li <dav...@google.com>
> Cc: LLVM Developers Mailing List <llv...@cs.uiuc.edu>
> Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support
> Message-ID: <5511CFBA...@philipreames.com>
> Content-Type: text/plain; charset="utf-8"; format=flowed
>
>
> On 03/24/2015 10:27 AM, Xinliang David Li wrote:
>> Diego and I have discussed this according to the feedback received. We
>> have revised plan for this (see Diego's last reply). Here is a more
>> detailed re-cap:
>>
>> 1) keep MD_prof definition as it is today; also keep using the
>> frequency propagation as it is (assuming programs with irreducible
>> loops are not common and not important. If it turns out to be
>> otherwise, we will revisit this).
>> 2) fix all problems that lead to wrong 'frequency/count' computed from
>> the frequency propagation algorithm
>> 2.1) relax 32bit limit
>> 2.2) remove 4096 iteration count cap
>> 2.3) remove the 'laplace rule of succession' which can be very
>> harmful
>> 2.4) Fix the bug in 'BranchProbabilityInfo::calcMetadataWeights'
>> that does capping without scaling
>> 2.5) Fix a similar bug in
>> 'MachineBranchProbabilityInfo::getSumForBlock('
>> .. other bugs found
>> 3) introduce a new meta data to represent function entry global
>> hotness (the implementation can choose to use entry execution count)
>> 4) implement frequency/profile update interfaces for Inline
>> transformations -- i.e., allowing importing callee's profile info into
>> caller with scaling factor computed from callsite count and callee
>> entry count
>> 5) Implement mechanism to compute, record and use profile program
>> summary data (This will be discussed separately).
> This sounds like a very workable approach to me. I think 2.3 needs a
> bit more discussion, but given it doesn't really directly effect me, I'm
> probably going to stay out of it for now.
>
> Thanks for working on this!
>>
>>

Bob Wilson

unread,
Mar 26, 2015, 12:37:25 AM3/26/15
to Xinliang David Li, LLVM Developers Mailing List
Which workload is better? I don’t at all trust users to get this right, at least for real, non-benchmark code.


Without the rule, the two workload at least produces consistent
profile data.  With the Laplace rule, you get 50 in one case, and 66
in the other.

Yes, but you’ve got more information in one case than the other. This is a feature IMO, not a bug. It’s entirely possible that with workload 2, the loop may have executed for a drastically different number of iterations. The fact that it did not, i.e., that it was consistent with workload 1, is more information that you did not have before. It makes sense for the compiler to be more aggressive when it has more data.


Having some technology to improve confidence of the profile data is
fine, but I don't see
1) how laplace rule is good for it

What do you not understand about it? As the counts get larger, LaPlace’s rule fades into the noise. It only makes a difference for cases where some of the counts are *very* small, and in those cases, it very simply adjust the weights to make optimizations less aggressive.

2) why this can not be done in the consumer side (i.e., faithfully
record the profile data).

What does this have to do with how faithfully the profile is recorded? We’ve got fully accurate data, but if the profiling inputs are too small or not representative, you may still get poor optimization choices.
I understand the issue, but my point was that you should simply not do that. You’re objecting to LaPlace’s rule based on a hypothetical comparison of block frequencies vs. entry counts. There is nothing in LLVM that does that now. We don’t even have entry counts.

I don’t see how you can argue that LaPlace’s rule is bad because it could affect an apples vs. oranges comparison of something that does not even exist yet.




The attached are two cases as well as the frequency graph computed
today (with the laplace rule) and the correct frequency expected.

I’d be a lot more interested to see a real-world example.

See my reply above. On the other hand, I'd like to see examples where
LaPlace Rule can actually help improve the profile data quality.

It’s not about improving the results — it’s about preventing clang from being overly aggressive about optimizing based on limited profile data.

Xinliang David Li

unread,
Mar 26, 2015, 12:49:06 AM3/26/15
to Ivan Baev, LLVM Developers Mailing List
On Wed, Mar 25, 2015 at 5:54 PM, Ivan Baev <ib...@codeaurora.org> wrote:
> Hi David,
> The plan makes sense. A question on:
>
>>> 4) implement frequency/profile update interfaces for Inline
>>> transformations -- i.e., allowing importing callee's profile info into
>>> caller with scaling factor computed from callsite count and callee
>>> entry count
>
> Do you envision the following implementation for it? Say we've decided to
> inline callee B at callsite C in caller A. In addition to the current
> inlining update work, we need to:
>
> 1. get block frequency BF of the basic block containing the callsite C by
> invoking BlockFrequency on function A;
> 2. scale BF with the function entry count of A, to get scaledBF
> 3. decrease function entry count for B with scaledBF
> 4. compute and scale block frequencies of the basic blocks of B inlined
> into A, or re-invoke BlockFrequency on the modified function A.
>

The update should work for both PGO and non-PGO. Without profile
feedback, the callee's BF also needs to be re-scaled and merged with
caller's BF (since we can not afford to recompute BF for the caller
with every inline).

The assumption is that with the new pass manager ready, BF info for
caller A and callee B can co-exist.

Assuming BB is a basic block in callee B, BB' is the cloned block of
BB in the inline instance of B. Freq_A represents block frequency in
caller A, and Freq_B represents block frequency in B.

Without PGO, the incremental update is:

Freq_A(BB') = Freq_B(BB)*Freq_A(C)/Freq_B(Entry_B)

With PGO, the Frequency update is the same, but with additional update
on the callee's Entry count:

Count(Entry_B) = Count(Entry_B) - Count(Callsite_C)
= Count(Entry_B) -
Count(Entry_A)*Freq_A(Callsite_C)/Freq_A(Entry_A)

David

Xinliang David Li

unread,
Mar 26, 2015, 1:50:20 AM3/26/15
to Bob Wilson, LLVM Developers Mailing List
Bob,

> Which workload is better? I don’t at all trust users to get this right, at
> least for real, non-benchmark code.

We do have a lot of users (real world apps) who can get this right --
I am not joking ;)

>
>
> Without the rule, the two workload at least produces consistent
> profile data. With the Laplace rule, you get 50 in one case, and 66
> in the other.
>
>
> Yes, but you’ve got more information in one case than the other. This is a
> feature IMO, not a bug. It’s entirely possible that with workload 2, the
> loop may have executed for a drastically different number of iterations. The
> fact that it did not, i.e., that it was consistent with workload 1, is more
> information that you did not have before. It makes sense for the compiler to
> be more aggressive when it has more data.

But the decision by the compiler is arbitrary and not necessarily
correct. For instance, the single run used in the training may have
actually executed much fewer number of iterations than average. With
Laplace rule, the iteration count becomes even smaller. My point is
that there is no way for compiler to tell how good the data is nor is
the compiler in a good position to make that judgement. By so doing,
the users who carefully prune their workload to reduce runtime gets
punished for no reason.


>
>
> Having some technology to improve confidence of the profile data is
> fine, but I don't see
> 1) how laplace rule is good for it
>>
> What do you not understand about it? As the counts get larger, LaPlace’s
> rule fades into the noise. It only makes a difference for cases where some
> of the counts are *very* small, and in those cases, it very simply adjust
> the weights to make optimizations less aggressive.

Strictly speaking, in loop context, it just makes optimizations to
assume shorter trip counts.

>
> 2) why this can not be done in the consumer side (i.e., faithfully
> record the profile data).
>
>
> What does this have to do with how faithfully the profile is recorded? We’ve
> got fully accurate data, but if the profiling inputs are too small or not
> representative, you may still get poor optimization choices.

The point is that there is no need to adjust the weights. It is very
easy to check the loop header's profile count to determine how much
confidence you want to give (and possibly controlled with flag). The
control in this way is more fine grained than blindly changing the
weight right after reading the profile data.

I am not sure what you mean by 'hypothetical comparison of block
frequencies vs entry counts', but it does not seem to be what I mean.
What I mean is that

1) We need a metric to represent global hotness. Profile (execution)
count fits the bill
2) There are two ways to compute profile count for BBs
a) directly compute it from the edge count recorded in profile data
(and BB Frequency can be directly scaled from it), but this change
requires slightly changing MD_prof's meaning or introducing MD_count
to record edge count without capping/scaling.

b) Just recording the entry profile count (minimal change), but do
not change MD_prof. This approach will reuse block frequency
propagation, but the later relies on unaltered branch
probability/weight in order to recompute precisely the count (combined
with entry count).

Since people have concerns on a), we chose b). For b), I merely
pointed out in the above example that with Laplace rule, the
recomputed profile count at the only callsite of 'Bar' can be greatly
different from the recorded entry profile count Bar. Incoming
callsite's profile distribution can be good signal for inlining
decisions. Such difference will be bad.

>
> I don’t see how you can argue that LaPlace’s rule is bad because it could
> affect an apples vs. oranges comparison of something that does not even
> exist yet.
>

Of course, PGO for IPA support is exactly the missing (and very
important) piece we plan to add -- if it already existed, there will
be no problems.

thanks,

David


>
>
>
> The attached are two cases as well as the frequency graph computed
> today (with the laplace rule) and the correct frequency expected.
>
>
> I’d be a lot more interested to see a real-world example.
>
>
> See my reply above. On the other hand, I'd like to see examples where
> LaPlace Rule can actually help improve the profile data quality.
>
>
> It’s not about improving the results — it’s about preventing clang from
> being overly aggressive about optimizing based on limited profile data.

_______________________________________________

Hayden Livingston

unread,
May 7, 2015, 3:58:09 AM5/7/15
to Xinliang David Li, LLVM Developers Mailing List
Can you tell us if you're continuing to use the same approach as
described in one of the LLVM meetings, i.e. instrument at the clang
AST level?

Also, do you generate GCOV files, some yaml, or is this a separate format?

And finally in the meeting you had given how you assign counters to
the blocks, an algorithm to minimize the number of insertions. Is that
algorithm a well-known one or a custom one? Is that described
somewhere?

Bob Wilson

unread,
May 7, 2015, 11:47:21 AM5/7/15
to Hayden Livingston, Xinliang David Li, LLVM Developers Mailing List

> On May 7, 2015, at 12:55 AM, Hayden Livingston <halivi...@gmail.com> wrote:
>
> Can you tell us if you're continuing to use the same approach as
> described in one of the LLVM meetings, i.e. instrument at the clang
> AST level?

Yes, that is the approach we’re using.

>
> Also, do you generate GCOV files, some yaml, or is this a separate format?

It is a separate format. The code for reading/writing profile data is in compiler-rt/lib/profile/.

There is also a separate format for mapping the profile data back to source locations for code coverage testing. Details here: http://www.llvm.org/docs/CoverageMappingFormat.html

>
> And finally in the meeting you had given how you assign counters to
> the blocks, an algorithm to minimize the number of insertions. Is that
> algorithm a well-known one or a custom one? Is that described
> somewhere?

It is a custom approach. I don’t think we have a written description but the code is pretty straightforward. Look at ComputeRegionCounts in clang’s lib/CodeGen/CodeGenPGO.cpp source file.

Xinliang David Li

unread,
May 7, 2015, 12:17:54 PM5/7/15
to Bob Wilson, LLVM Developers Mailing List
On Thu, May 7, 2015 at 8:43 AM, Bob Wilson <bob.w...@apple.com> wrote:
>
>> On May 7, 2015, at 12:55 AM, Hayden Livingston <halivi...@gmail.com> wrote:
>>
>> Can you tell us if you're continuing to use the same approach as
>> described in one of the LLVM meetings, i.e. instrument at the clang
>> AST level?
>
> Yes, that is the approach we’re using.
>

We are still collecting more data, but since the question is asked, I
will share our plan now.

Instrumentation in Clang AST is too slow for large C++ applications
(the data and explanation will be sent out in a different thread). To
solve that problem, we have introduced the 'late' instrumentation pass
that is done in LLVM. The late instrumentation uses the exact same
runtime support as the current frontent based instrumentation, but
assign edge counters based on minimal spanning trees. We have seen
very promising results. Note This is not intended to replace the
current front-end based instrumentation, but as an alternative under
an option.

thanks,

david

Hayden Livingston

unread,
May 7, 2015, 1:01:37 PM5/7/15
to Xinliang David Li, LLVM Developers Mailing List
Thanks Bob and David.

Bob, was GCOV skipped because of any specific reason?

David, when you say you have a late instrumentation pass, is that done
at the IR level that is already optimized and therefore changed? How
are mapping back to the source code?

Xinliang David Li

unread,
May 7, 2015, 1:05:28 PM5/7/15
to Hayden Livingston, LLVM Developers Mailing List
On Thu, May 7, 2015 at 9:57 AM, Hayden Livingston
<halivi...@gmail.com> wrote:
> Thanks Bob and David.
>
> Bob, was GCOV skipped because of any specific reason?
>
> David, when you say you have a late instrumentation pass, is that done
> at the IR level that is already optimized and therefore changed?

yes at IR level but only after some early optimizations.

>How
> are mapping back to the source code?

It is using CFG based matching. It is not used for code coverage
purpose, so source mapping is less important.

David

Dario Domizioli

unread,
May 22, 2015, 11:20:56 AM5/22/15
to Bob Wilson, Xinliang David Li, LLVM Developers Mailing List
Hi all,

I am a bit confused about the documentation of the format of the profile data file.

The Clang user guide here describes it as an ASCII text file:

Whereas the posts above and the referenced link describe it as a stream of bytes containing LEB128s:

From experimenting with the latest trunk I can see the latter is correct (well, at least the file I get is not ASCII text).
Should we update the Clang user guide documentation?
Or am I just getting confused? Are there two formats, one used for coverage and one used for PGO?

Cheers,
    Dario Domizioli
    SN Systems - Sony Computer Entertainment Group



Diego Novillo

unread,
May 22, 2015, 12:00:40 PM5/22/15
to Dario Domizioli, Xinliang David Li, LLVM Developers Mailing List
On Fri, May 22, 2015 at 11:16 AM, Dario Domizioli
<dario.d...@gmail.com> wrote:
> Hi all,
>
> I am a bit confused about the documentation of the format of the profile
> data file.
>
> The Clang user guide here describes it as an ASCII text file:
> http://clang.llvm.org/docs/UsersManual.html#sample-profile-format
>
> Whereas the posts above and the referenced link describe it as a stream of
> bytes containing LEB128s:
> http://www.llvm.org/docs/CoverageMappingFormat.html
>
> From experimenting with the latest trunk I can see the latter is correct
> (well, at least the file I get is not ASCII text).
> Should we update the Clang user guide documentation?
> Or am I just getting confused? Are there two formats, one used for coverage
> and one used for PGO?

You are looking at two unrelated formats. The first URL describes the
sampling profiling format. That is not used for coverage, only
optimization.

There are two main profilers in LLVM. The sampling profiler uses
external profilers (e.g., Linux Perf) to produce sample information
that is then matched to the user code. There is no option to use the
sampling profiler for coverage (it would be a very poor match).

The instrumentation profiler causes Clang to inject tracking code into
the user program. This is the one used for coverage. If you are
interested in coverage, you should read the second URL.

I will clarify the documentation for sampling profiles.


Diego.

Dario Domizioli

unread,
May 28, 2015, 1:21:02 PM5/28/15
to Diego Novillo, Xinliang David Li, LLVM Developers Mailing List
Hi Diego,

thanks for clarifying the difference between the two formats. I have noticed the new note in the "Sample Profile Format" section of the Clang guide clarifying that it is different from the coverage format.

So, my further question is... Am I right in understanding that both formats can be used for PGO purposes then?
I have tried the following, as in the Clang user guide:

$ clang++ -O2 -fprofile-instr-generate code.cc -o code
$ LLVM_PROFILE_FILE="code-%p.profraw" ./code
$ llvm-profdata merge -output=code.profdata code-*.profraw
$ clang++ -O2 -fprofile-instr-use=code.profdata code.cc -o code

This produces a PGOptimized executable which performs differently (in fact, better!) than a normal O2 build, so I think the "code.profdata" file produced by the commands above is valid.

If I look inside "code.profdata" with a text editor, the file is most definitely not the ASCII-based sampling profile file format. Now I know that this is to be expected because I have used the infrastructure designed for coverage to generate the file.

So, if I understand correctly:
- If you want to do PGO with a sampling profile file that you have somehow generated from data collected by an external profiler, then the format must be the ASCII text one described in the Clang guide.
- However you can also use the infrastructure for coverage, and the file produced by such infrastructure, as an input to PGO (without caring too much about the format at this point, as you don't need to look inside the file).

Is my understanding correct?
In which case I would recommend to add a note to the "Profiling with Instrumentation" section as well, to state that the format produced by "llvm-profdata merge" is not the same as the one detailed just above that section.
I now understand the difference, but I believe a reader who is approaching this for the first time could be misinterpreting the guide and they could assume the instrumentation approach also produces a sampling profile file in the ASCII format.

Cheers,
    Dario Domizioli
    SN Systems - Sony Computer Entertainment Group



Justin Bogner

unread,
May 28, 2015, 2:19:29 PM5/28/15
to Dario Domizioli, Xinliang David Li, LLVM Developers Mailing List

Yes, basically. There are two ways to do PGO with clang - using the
sample based profiling or the instrumentation based profiling. The on
disk formats for these two types of profiling have nothing to do with
each other, and the instrumentation based profiling is also useful for
coverage.

> In which case I would recommend to add a note to the "Profiling with
> Instrumentation" section as well, to state that the format produced by
> "llvm-profdata merge" is not the same as the one detailed just above that
> section.

It sounds to me like this documentation just needs some clean up to be
clearer up front about what's going on. If nobody else gets to it first,
I'll take a shot at improving the situation when I have a chance.

Diego Novillo

unread,
May 28, 2015, 2:56:24 PM5/28/15
to Dario Domizioli, Xinliang David Li, LLVM Developers Mailing List
On Thu, May 28, 2015 at 1:10 PM, Dario Domizioli
<dario.d...@gmail.com> wrote:
> Hi Diego,
>
> thanks for clarifying the difference between the two formats. I have noticed
> the new note in the "Sample Profile Format" section of the Clang guide
> clarifying that it is different from the coverage format.
>
> So, my further question is... Am I right in understanding that both formats
> can be used for PGO purposes then?
> I have tried the following, as in the Clang user guide:
>
> $ clang++ -O2 -fprofile-instr-generate code.cc -o code
> $ LLVM_PROFILE_FILE="code-%p.profraw" ./code
> $ llvm-profdata merge -output=code.profdata code-*.profraw
> $ clang++ -O2 -fprofile-instr-use=code.profdata code.cc -o code
>
> This produces a PGOptimized executable which performs differently (in fact,
> better!) than a normal O2 build, so I think the "code.profdata" file
> produced by the commands above is valid.
>
> If I look inside "code.profdata" with a text editor, the file is most
> definitely not the ASCII-based sampling profile file format. Now I know that
> this is to be expected because I have used the infrastructure designed for
> coverage to generate the file.
>
> So, if I understand correctly:
> - If you want to do PGO with a sampling profile file that you have somehow
> generated from data collected by an external profiler, then the format must
> be the ASCII text one described in the Clang guide.

Right. Note that this ASCII text format is just one of the 3 formats
accepted by the sampling profiler. There is a more compact binary
representation and a (yet unsubmitted) gcov variant that's used by
GCC's sampling profiler.

However, the fundamental difference is still the same. Regardless of
what file format you use for the sampling profiler, that data is not
suitable for coverage. Only the instrumentation generated with
-fprofile-instr-generate can be used for coverage.

> - However you can also use the infrastructure for coverage, and the file
> produced by such infrastructure, as an input to PGO (without caring too much
> about the format at this point, as you don't need to look inside the file).

Well, you never need to care about the format for inspection. All the
formats are read by llvm-profdata. All you need to care is that the
data generated by sampling profilers is not really useful for
coverage. Note that it would be ~trivial to use, but the results would
be awful. Sampling is a pretty lossy approach.

> In which case I would recommend to add a note to the "Profiling with
> Instrumentation" section as well, to state that the format produced by
> "llvm-profdata merge" is not the same as the one detailed just above that
> section.
> I now understand the difference, but I believe a reader who is approaching
> this for the first time could be misinterpreting the guide and they could
> assume the instrumentation approach also produces a sampling profile file in
> the ASCII format.

I agree. Thanks for pointing this out. I'll re-work this section.

Justin Bogner

unread,
May 28, 2015, 5:59:47 PM5/28/15
to Diego Novillo, Xinliang David Li, LLVM Developers Mailing List
Diego Novillo <dnov...@google.com> writes:
> On Thu, May 28, 2015 at 2:09 PM, Justin Bogner <ma...@justinbogner.com> wrote:
>
>> It sounds to me like this documentation just needs some clean up to be
>> clearer up front about what's going on. If nobody else gets to it first,
>> I'll take a shot at improving the situation when I have a chance.
>
> I just reworked this section a bit in r238504. Dario, does this help?
>
> Justin, could you check me for consistency?

Looks like a nice improvement. Thanks!

Diego Novillo

unread,
May 28, 2015, 6:56:11 PM5/28/15
to Justin Bogner, Xinliang David Li, LLVM Developers Mailing List
On Thu, May 28, 2015 at 2:09 PM, Justin Bogner <ma...@justinbogner.com> wrote:

> It sounds to me like this documentation just needs some clean up to be
> clearer up front about what's going on. If nobody else gets to it first,
> I'll take a shot at improving the situation when I have a chance.

I just reworked this section a bit in r238504. Dario, does this help?

Justin, could you check me for consistency?


Thanks. Diego.

Dario Domizioli

unread,
May 29, 2015, 9:17:29 AM5/29/15
to Justin Bogner, Xinliang David Li, LLVM Developers Mailing List
Hi Diego, Justin.

The reworked section looks much better to me now. It is definitely clearer.

Thanks a lot!

    Dario Domizioli
    SN Systems - Sony Computer Entertainment Group



Reply all
Reply to author
Forward
0 new messages