[LLVMdev] RFC - Profile Guided Optimization in LLVM

Diego Novillo

unread,

Jun 12, 2013, 5:23:14 PM6/12/13

to llv...@cs.uiuc.edu

I have started looking at the state of PGO (Profile Guided Optimization) in LLVM. I want to discuss my high-level plan and make sure I'm not missing anything interesting out. I appreciate any feedback on this, pointers to existing work, patches and anything related to PGO in LLVM.

I will be keeping changes to this plan in this web document

https://docs.google.com/document/d/1b2XFuOkR2K-Oao4u5fR3a9Ok83IB_W4EJWVmNak4GRE/pub

At a high-level, I would like the PGO harness to contain the following modules:

Profile generators

These modules represent sources of profile. Mostly, they work by instrumenting the user program to make it produce profile information. However, other sources of profile information (e.g., samples, hardware counters, static predictors) would be supported.

Profile Analysis Oracles

Profile information is loaded into the compiler and translated into analysis data which the optimizers can use. These oracles become the one and only source of profile information used by transformations. Direct access to the raw profile data generated externally is not allowed.

Translation from profile information into analysis can be done by adding IR metadata or altering compiler internal data structures directly. I prefer IR metadata because it simplifies debugging, unit testing and bug reproduction.

Analyses should be narrow in the specific type of information they provide (e.g., branch probability) and there should not be two different analyses that provide overlapping information. We could later provide broader analyses types by aggregating the existing ones.

Transformations

Transformations should naturally take advantage of profile information by consulting the analyses. The better information they get from the analysis oracles, the better their decisions.

My plan is to start by making sure that the infrastructure exists and provides the basic analyses.

I have two primary goals in this first phase:

Augment the PGO infrastructure where required.

Fix existing transformations that are not taking advantage of profile data.

In evaluating and triaging the existing infrastructure, I will use test cases taken from GCC’s own testsuite, a collection of Google’s internal applications and any other code base folks consider useful.

In using GCC’s testsuite, my goal is not to mimic how GCC does its work, but make sure that the two compilers implement functionally equivalent transformations. That is, make sure that LLVM is not leaving optimization opportunities behind.

This may require implementing missing profile functionality. From a brief inspection of the code, most of the major ones seem to be there (edge, path, block). But I don’t know what state they are in.

Some of the properties I would like to maintain or add to the current framework:

Profile data is never accessed directly by analyses and transformations. Rather, it is translated into IR metadata.

Graceful degradation in the presence of stale profiles. Old profile data should only result in degraded optimization opportunities. It should neither confuse the compiler nor cause erroneous code generation.

After the basic profile-based transformations are working, I would like to add new sources of profile. Mainly, I am thinking of implementing Auto FDO. FDO stands for Feedback Directed Optimization (both PGO and FDO tend to be used interchangeably in the GCC community). In this scheme, the compiler does not instrument the code. Rather, it uses an external sample collection tool (e.g., perf) to collect samples from the program’s execution. These samples are then converted to the format that the instrumented program would’ve emitted.

In terms of optimizations, our (Google) experience is that inlining is the key beneficiary of profile information. Particularly, in big C++ applications. I expect to focus most of my attention on the inliner.

Thanks. Diego.

Diego Novillo

unread,

Jun 12, 2013, 5:32:46 PM6/12/13

to llv...@cs.uiuc.edu

On 2013-06-12 17:23 , Diego Novillo wrote:

I have started looking at the state of PGO (Profile Guided Optimization) in LLVM. I want to discuss my high-level plan and make sure I'm not missing anything interesting out. I appreciate any feedback on this, pointers to existing work, patches and anything related to PGO in LLVM.

Good grief. A whole lot of fail in my cut-n-paste job. Apologies.

I will be keeping changes to this plan in this web document

https://docs.google.com/document/d/1b2XFuOkR2K-Oao4u5fR3a9Ok83IB_W4EJWVmNak4GRE/pub

You can read from the above or this:

At a high-level, I would like the PGO harness to contain the following modules:

Profile generators

These modules represent sources of profile. Mostly, they work by instrumenting the user program to make it produce profile information. However, other sources of profile information (e.g., samples, hardware counters, static predictors) would be supported.

Profile Analysis Oracles

Profile information is loaded into the compiler and translated into analysis data which the optimizers can use. These oracles become the one and only source of profile information used by transformations. Direct access to the raw profile data generated externally is not allowed.

Translation from profile information into analysis can be done by adding IR metadata or altering compiler internal data structures directly. I prefer IR metadata because it simplifies debugging, unit testing and bug reproduction.

Analyses should be narrow in the specific type of information they provide (e.g., branch probability) and there should not be two different analyses that provide overlapping information. We could later provide broader analyses types by aggregating the existing ones.

Transformations

Transformations should naturally take advantage of profile information by consulting the analyses. The better information they get from the analysis oracles, the better their decisions.

My plan is to start by making sure that the infrastructure exists and provides the basic analyses.
I have two primary goals in this first phase:

1- Augment the PGO infrastructure where required.
2- Fix existing transformations that are not taking advantage of profile data.

In evaluating and triaging the existing infrastructure, I will use test cases taken from GCC’s own testsuite, a collection of Google’s internal applications and any other code base folks consider useful.

In using GCC’s testsuite, my goal is not to mimic how GCC does its work, but make sure that the two compilers implement functionally equivalent transformations. That is, make sure that LLVM is not leaving optimization opportunities behind.

This may require implementing missing profile functionality. From a brief inspection of the code, most of the major ones seem to be there (edge, path, block). But I don’t know what state they are in.

Some of the properties I would like to maintain or add to the current framework:

* Profile data is never accessed directly by analyses and transformations. Rather, it is translated into IR metadata.

* Graceful degradation in the presence of stale profiles. Old profile data should only result in degraded optimization opportunities. It should neither confuse the compiler nor cause erroneous code generation.

Jakob Stoklund Olesen

unread,

Jun 12, 2013, 5:55:26 PM6/12/13

to Diego Novillo, llv...@cs.uiuc.edu

On Jun 12, 2013, at 2:23 PM, Diego Novillo <dnov...@google.com> wrote:

> In terms of optimizations, our (Google) experience is that
> inlining is the key beneficiary of profile information.
> Particularly, in big C++ applications. I expect to focus most
> of my attention on the inliner.

That sounds plausible to me. It seems like we might need a way of representing call graph profiling in addition to the existing branch probabilities?

FWIW, the greedy register allocator’s live range splitting algorithm is designed to consume profile information so it can push spill code into cold blocks. The primary interface is SpillPlacement::getBlockFrequency() which currently returns an estimate based on loop depth only.

/jakob

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Chandler Carruth

unread,

Jun 12, 2013, 6:05:14 PM6/12/13

to Jakob Stoklund Olesen, llv...@cs.uiuc.edu

On Wed, Jun 12, 2013 at 2:55 PM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:

That sounds plausible to me. It seems like we might need a way of representing call graph profiling in addition to the existing branch probabilities?

Agreed. An important consideration here is WPO vs. LTO vs. TU-at-a-time call graphs.

FWIW, the greedy register allocator’s live range splitting algorithm is designed to consume profile information so it can push spill code into cold blocks. The primary interface is SpillPlacement::getBlockFrequency() which currently returns an estimate based on loop depth only.

It doesn't use MachineBlockFrequency? If it does, it will get a lot more than loop depth: __builtin_expect, cold function attribute, and static branch heuristics. If it doesn't it should, and then it will immediately benefit from this.

Chandler Carruth

unread,

Jun 12, 2013, 6:05:38 PM6/12/13

to Jakob Stoklund Olesen, llv...@cs.uiuc.edu

On Wed, Jun 12, 2013 at 3:05 PM, Chandler Carruth <chan...@google.com> wrote:

FWIW, the greedy register allocator’s live range splitting algorithm is designed to consume profile information so it can push spill code into cold blocks. The primary interface is SpillPlacement::getBlockFrequency() which currently returns an estimate based on loop depth only.

It doesn't use MachineBlockFrequency?

Err, MachineBlockFrequencyInfo -- the analysis pass. Sorry.

Jakob Stoklund Olesen

unread,

Jun 12, 2013, 6:10:21 PM6/12/13

to Chandler Carruth, llv...@cs.uiuc.edu

It predates the block frequency interface. It just needs to be hooked up, patches welcome. It would also be nice to remove the floating point computations from the spill placement code.

Thanks,

Chandler Carruth

unread,

Jun 12, 2013, 6:15:31 PM6/12/13

to Jakob Stoklund Olesen, llv...@cs.uiuc.edu

On Wed, Jun 12, 2013 at 3:10 PM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:

It predates the block frequency interface. It just needs to be hooked up, patches welcome. It would also be nice to remove the floating point computations from the spill placement code.

Cool, if Diego doesn't beat me to it, I may send you a patch as that seems easy and obviously beneficial.

Jakob Stoklund Olesen

unread,

Jun 12, 2013, 6:25:03 PM6/12/13

to Chandler Carruth, llv...@cs.uiuc.edu

Sounds good.

The only complication is that the floats are spilling into RAGreedy where they are used in the cost model. I think they can simply be replaced with BlockFrequency everywhere.

If the block frequencies make sense, this should also be unnecessary:

struct SpillPlacement::Node {
/// Scale - Inverse block frequency feeding into[0] or out of[1] the bundle.
/// Ideally, these two numbers should be identical, but inaccuracies in the
/// block frequency estimates means that we need to normalize ingoing and
/// outgoing frequencies separately so they are commensurate.
float Scale[2];

Xinliang David Li

unread,

Jun 12, 2013, 7:26:39 PM6/12/13

to Diego Novillo, llv...@cs.uiuc.edu

After the basic profile-based transformations are working, I would like to add new sources of profile. Mainly, I am thinking of implementing Auto FDO.

For those who are not familiar with what autoFDO is -- Auto FDO is originally called Sample Based FDO. Its main author is Dehao Chen @google, and Robert Hundt is the one of the main pushers of technology in Google. The latest incarnation of this technology uses LBR events available on Nehalem and above. http://www.computer.org/csdl/trans/tc/2013/02/ttc2013020376-abs.html

Cheers,

David

Diego Novillo

unread,

Jun 13, 2013, 1:12:03 PM6/13/13

to Chandler Carruth, llv...@cs.uiuc.edu

Unless you're in a hurry, I'd rather tackle this one myself. Particularly considering that I've no idea what you two are yapping about, so it will be a good learning experience.

Chad Rosier

unread,

Jun 14, 2013, 12:57:56 PM6/14/13

to Diego Novillo, llv...@cs.uiuc.edu

Bob and I were discussing this over lunch yesterday and he has some very good ideas, so I've CC'ed him to make sure he joins the thread.

Chad

Krzysztof Parzyszek

unread,

Jun 14, 2013, 1:56:00 PM6/14/13

to llv...@cs.uiuc.edu

On 6/12/2013 4:55 PM, Jakob Stoklund Olesen wrote:
>
> It seems like we might need a way of representing call graph profiling in addition to the existing branch probabilities?
>

Metadata on function definitions/declarations perhaps?

I agree---the call graph profiling is of critical importance. I guess
we should first see what types of data we already know may be useful and
see how we would represent them within the compiler.

Here's my list (in addition to the usual data gathered in such cases):

* Function call frequency:
- How many times a function has been called?

* Function argument value profiling:
- Is the function called with a small set of values for the parameters?
- Is there any value that is particularly frequent for any of the
arguments?
- If the function takes pointers, are the pointers usually aligned?
- If the function takes pointers, are they aliased?
- etc.

* Branch profiling:
- Is the branch executed (i.e. taken/not-taken) in the same way as
the last time, or does it often change direction?

* Path profiling:
- Correlation between branches, function calls, etc.

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Mark Lacey

unread,

Jun 14, 2013, 3:10:44 PM6/14/13

to Krzysztof Parzyszek, llv...@cs.uiuc.edu

On Jun 14, 2013, at 10:56 AM, Krzysztof Parzyszek <kpar...@codeaurora.org> wrote:
>
> * Function argument value profiling:
> - Is the function called with a small set of values for the parameters?
> - Is there any value that is particularly frequent for any of the arguments?
> - If the function takes pointers, are the pointers usually aligned?
> - If the function takes pointers, are they aliased?
> - etc.

More generally it can be useful to collect value data for things like:
- switch values, in order to split off one or more dominant values for an explicit check prior to a table look-up or binary-search through all the cases
- function pointers, to make it possible to explicitly check against a very common pointer value so that the generated code can do a direct call to, or inline, the specific function (including virtual functions)

> * Path profiling:
> - Correlation between branches, function calls, etc.

If interprocedural path profiling information is gathered, that should also be usable as a source for generating context-sensitive call graph profile data.

Evan Cheng

unread,

Jun 15, 2013, 2:18:15 PM6/15/13

to Xinliang David Li, Chandler Carruth, llv...@cs.uiuc.edu

Apple folks are also gearing up to push on the PGO front. We are primarily interested in using instrumentation, rather than sampling, to collect profile info. However, I suspect the way profile ended up being used in the various optimization and codegen passes would be largely similar.

There is also some interests in pursuing profile directed specialization. But that can wait. I think it makes sense for us to get together and discuss our plans to make sure there won't be duplication of efforts.

Evan

Sent from my iPad

_______________________________________________

Evan Cheng

unread,

Jun 15, 2013, 2:20:51 PM6/15/13

to Mark Lacey, llv...@cs.uiuc.edu

Yes. But I'd like to see us tackling these as part two of PGO push. The question is what designs we would have to decide on now to prevent re-design / re-implement later.

Evan

Sent from my iPad

Benjamin Kramer

unread,

Jun 15, 2013, 4:39:29 PM6/15/13

to Diego Novillo, llv...@cs.uiuc.edu

I didn't want to interfere with you in any way but I was working on this just this week, though from a completely different background:

In zlib's longest_match() (which happens to contain the hottest loop in deflate()) there's a loop with this layout:

do {
if (something) continue;

while (cond && cond && cond && cond && cond) { }

} while (something_else);

The inner while loop is freezing cold. This is one of the cases where the current estimation based on loop depth completely fails while the heuristics BlockFrequency is based on get it right. With the old spill weights we spilled parts of the "something_else" code above that had to be reloaded on every iteration so we could keep more data of the inner loop in registers, causing a huge slowdown.

I hacked up a patch and saw a whopping 5% improvement on deflate, bringing us a lot closer to GCC on this code. Other benchmarks I tried so far looked neutral or positive but sometimes I saw a large increase in spilling; bloating code size, that needs to be figured out.

This patch only does the grunt work of getting the BlockFrequency analysis into the spill placement code, it doesn't replace the use of floats in the register allocator. It's possible that some of the downstream calculations need an update to cope with the changed magnitude of the frequencies.

Do you want to take over this effort or should I poke more at it?

- Ben

block-frequency-spilling.patch

Diego Novillo

unread,

Jun 17, 2013, 9:54:33 AM6/17/13

to Evan Cheng, llv...@cs.uiuc.edu

On 2013-06-15 14:18 , Evan Cheng wrote:

Apple folks are also gearing up to push on the PGO front. We are primarily interested in using instrumentation, rather than sampling, to collect profile info. However, I suspect the way profile ended up being used in the various optimization and codegen passes would be largely similar.

Excellent! We are initially interested in instrumentation, as well. This is where we draw most of our performance with GCC. Sampling is showing a lot of promise, however. And it really is not much different than instrumentation. Most of what changes is the source of profile data.

There is also some interests in pursuing profile directed specialization. But that can wait. I think it makes sense for us to get together and discuss our plans to make sure there won't be duplication of efforts.

Sure. My initial plan is fairly simple. Triage the existing instrumentation code and see what needs fixing. I'm starting this in the next week or so. What are your plans?

Diego.

Diego Novillo

unread,

Jun 17, 2013, 9:56:20 AM6/17/13

to Benjamin Kramer, llv...@cs.uiuc.edu

On 2013-06-15 16:39 , Benjamin Kramer wrote:
> Do you want to take over this effort or should I poke more at it?

Since you've already started, it's easier if you poke more at it.
Thanks. I've got a whole bunch of other things to go through.

Diego.

Benjamin Kramer

unread,

Jun 17, 2013, 10:03:04 AM6/17/13

to Diego Novillo, Jakob Stoklund Olesen, llvmdev@cs.uiuc.edu (LLVMdev@cs.uiuc.edu)

On 17.06.2013, at 15:56, Diego Novillo <dnov...@google.com> wrote:

> On 2013-06-15 16:39 , Benjamin Kramer wrote:
>> Do you want to take over this effort or should I poke more at it?
>
> Since you've already started, it's easier if you poke more at it. Thanks. I've got a whole bunch of other things to go through.

OK, will do.

Jakob any comments on the patch? The only interesting part is in LiveIntervals::getSpillWeight. Do we have to scale the values somehow there?

- Ben

Jakob Stoklund Olesen

unread,

Jun 17, 2013, 12:43:54 PM6/17/13

to Benjamin Kramer, llv...@cs.uiuc.edu

On Jun 17, 2013, at 7:03 AM, Benjamin Kramer <benn...@gmail.com> wrote:

>
> On 17.06.2013, at 15:56, Diego Novillo <dnov...@google.com> wrote:
>
>> On 2013-06-15 16:39 , Benjamin Kramer wrote:
>>> Do you want to take over this effort or should I poke more at it?
>>
>> Since you've already started, it's easier if you poke more at it. Thanks. I've got a whole bunch of other things to go through.
>
> OK, will do.
>
>
> Jakob any comments on the patch? The only interesting part is in LiveIntervals::getSpillWeight. Do we have to scale the values somehow there?

Yes, BlockFrequency::getFrequency() is poorly named, it returns a fixpoint number. I think you should scale it to be relative to the entry block frequency.

+LiveIntervals::getSpillWeight(bool isDef, bool isUse, BlockFrequency freq) {
+ return (isDef + isUse) * freq.getFrequency();
}

This computation can overflow.

@@ -178,9 +180,10 @@ bool SpillPlacement::runOnMachineFunction(MachineFunction &mf) {

// Compute total ingoing and outgoing block frequencies for all bundles.
BlockFrequency.resize(mf.getNumBlockIDs());
+ MachineBlockFrequencyInfo &MBFI = getAnalysis<MachineBlockFrequencyInfo>();
for (MachineFunction::iterator I = mf.begin(), E = mf.end(); I != E; ++I) {
- float Freq = LiveIntervals::getSpillWeight(true, false,
- loops->getLoopDepth(I));
+ float Freq =
+ LiveIntervals::getSpillWeight(true, false, MBFI.getBlockFreq(I));
unsigned Num = I->getNumber();
BlockFrequency[Num] = Freq;

I think you should leave LiveIntervals out of this and just the block frequency directly, scaled as above. The getSpillWeight() function is just used here to get the non-overflowing frequency approximation.

Otherwise, looks good.

Thanks,
/jakob

Benjamin Kramer

unread,

Jun 17, 2013, 1:48:13 PM6/17/13

to Jakob Stoklund Olesen, llv...@cs.uiuc.edu

[Splitting this out from the original thread to reduce noise in it]

On 17.06.2013, at 18:43, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:

>
> On Jun 17, 2013, at 7:03 AM, Benjamin Kramer <benn...@gmail.com> wrote:
>
>>
>> On 17.06.2013, at 15:56, Diego Novillo <dnov...@google.com> wrote:
>>
>>> On 2013-06-15 16:39 , Benjamin Kramer wrote:
>>>> Do you want to take over this effort or should I poke more at it?
>>>
>>> Since you've already started, it's easier if you poke more at it. Thanks. I've got a whole bunch of other things to go through.
>>
>> OK, will do.
>>
>>
>> Jakob any comments on the patch? The only interesting part is in LiveIntervals::getSpillWeight. Do we have to scale the values somehow there?
>
> Yes, BlockFrequency::getFrequency() is poorly named, it returns a fixpoint number. I think you should scale it to be relative to the entry block frequency.
>
> +LiveIntervals::getSpillWeight(bool isDef, bool isUse, BlockFrequency freq) {
> + return (isDef + isUse) * freq.getFrequency();
> }
>
> This computation can overflow.

Yep, I went down the easy route and converted it to floating point arithmetic. Is that OK here?

>
> @@ -178,9 +180,10 @@ bool SpillPlacement::runOnMachineFunction(MachineFunction &mf) {
>
> // Compute total ingoing and outgoing block frequencies for all bundles.
> BlockFrequency.resize(mf.getNumBlockIDs());
> + MachineBlockFrequencyInfo &MBFI = getAnalysis<MachineBlockFrequencyInfo>();
> for (MachineFunction::iterator I = mf.begin(), E = mf.end(); I != E; ++I) {
> - float Freq = LiveIntervals::getSpillWeight(true, false,
> - loops->getLoopDepth(I));
> + float Freq =
> + LiveIntervals::getSpillWeight(true, false, MBFI.getBlockFreq(I));
> unsigned Num = I->getNumber();
> BlockFrequency[Num] = Freq;
>
> I think you should leave LiveIntervals out of this and just the block frequency directly, scaled as above. The getSpillWeight() function is just used here to get the non-overflowing frequency approximation.

Fixed.

>
> Otherwise, looks good.

The attached patch also fixes the regression tests that failed. Mostly minor changes in register allocation. The SPARC one is a bit scary because an instruction now gets moved across inline assembly into a delay slot. Do you know if that is a safe transformation?

block-frequency-spilling.patch

Jakob Stoklund Olesen

unread,

Jun 17, 2013, 2:03:14 PM6/17/13

to Benjamin Kramer, llv...@cs.uiuc.edu

On Jun 17, 2013, at 10:48 AM, Benjamin Kramer <benn...@gmail.com> wrote:

> [Splitting this out from the original thread to reduce noise in it]
>
>
> On 17.06.2013, at 18:43, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:
>> +LiveIntervals::getSpillWeight(bool isDef, bool isUse, BlockFrequency freq) {
>> + return (isDef + isUse) * freq.getFrequency();
>> }
>>
>> This computation can overflow.
>
> Yep, I went down the easy route and converted it to floating point arithmetic. Is that OK here?

Yes, that should be fine.

+LiveIntervals::getSpillWeight(bool isDef, bool isUse, BlockFrequency freq) {

+ float EntryFreq = BlockFrequency::getEntryFrequency();
+ return (isDef + isUse) * (freq.getFrequency() / EntryFreq);

Nitpick: The float division can be constant folded and turned into a multiplication:

const float Scale = 1.0f / getEntryFrequency();
return (isDef + isUse) * (freq.getFrequency() * Scale);

I wouldn’t trust the compiler to do that without -fast-math.

> The attached patch also fixes the regression tests that failed. Mostly minor changes in register allocation. The SPARC one is a bit scary because an instruction now gets moved across inline assembly into a delay slot. Do you know if that is a safe transformation?

That sounds OK. As long as it is safe to move that instruction across the inline asm.

> <block-frequency-spilling.patch>

LGTM

Chandler Carruth

unread,

Jun 17, 2013, 2:43:18 PM6/17/13

to Jakob Stoklund Olesen, Benjamin Kramer, llv...@cs.uiuc.edu

On Mon, Jun 17, 2013 at 11:03 AM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:

>> This computation can overflow.
>
> Yep, I went down the easy route and converted it to floating point arithmetic. Is that OK here?

Yes, that should be fine.

I really don't think this is fine. I would like for LLVM to not rely on floating point arithmetic on the host.

Benjamin Kramer

unread,

Jun 17, 2013, 2:58:49 PM6/17/13

to Chandler Carruth, llv...@cs.uiuc.edu

I tend to agree. The piece of code I was referring to is replacing other floating point code and interacts with other floating point code so this is not a regression.

- Ben

Evan Cheng

unread,

Jun 17, 2013, 8:48:10 PM6/17/13

to Diego Novillo, llv...@cs.uiuc.edu

Your short term plan sounds great. We have some tentative plans build / improve the instrumentation tools and start to enhancing various passes to use profile data. Beyond that, our plan is still vague. Bob, comments?

Evan

>
>
> Diego.

Bob Wilson

unread,

Jun 18, 2013, 2:19:42 PM6/18/13

to Diego Novillo, llv...@cs.uiuc.edu

On Jun 17, 2013, at 6:54 AM, Diego Novillo <dnov...@google.com> wrote:

I've been working on prototyping a new approach to instrumentation for both PGO and code coverage testing. I want to use the same data for both of those uses, and for code coverage we really need to have accurate source location info, including column positions and the starting and ending locations for every block of code. Working backward from the debug info to find source locations hasn't worked very well. The debug info just doesn't have enough detailed information. Instead, I am planning to insert the instrumentation code in the front-end. The raw profile data in this scheme is tied to source locations. One disadvantage is that it conflicts with the goal of having graceful degradation. Simply adding a blank line invalidates all the following source locations. My plan is to ignore any profile data whenever the source changes in any way. We had some discussions about this, and it is easy to come up with cases where even a trivial change to the source, e.g., enablin!
g a debugging flag, causes massive changes in the profile data. I don't think graceful degradation should even be a goal. It is too hard to sort out insignificant changes from those that should invalidate all the profile data.

I am still at least a few weeks away from having any details to share about this new instrumentation scheme. I want to convince myself that it is viable before making a proposal to add it to LLVM.

I don't know if this new kind of instrumentation will fit everyone's needs for PGO. I hope that it will, but regardless the most important thing is that we standardize on the representation of profile data in the IR. We have a good start on that already with the branch probability metadata. The missing piece is adding per-function metadata to record the execution counts. We had planned to add that along with the branch probabilities but haven't gotten it done yet. You can see the lack of it in BlockFrequency::getEntryFrequency(), which just returns a constant value. We will want that to allow comparing profile data across different functions, which you can't easily get from the branch probabilities alone. I don't think anyone has yet come up with a specific proposal for that.

As long as we agree on the canonical metadata, there should be no problem with supporting different kinds of profile generators, including things like Auto FDO.

I also agree about the importance of getting inlining to use the PGO data. Chandler had looked into that and found that it was really difficult to do it properly with the current pass manager. Has any progress been made on that front?

Eric Christopher

unread,

Jun 18, 2013, 5:48:03 PM6/18/13

to Bob Wilson, llv...@cs.uiuc.edu

On Tue, Jun 18, 2013 at 11:19 AM, Bob Wilson <bob.w...@apple.com> wrote:
>
> On Jun 17, 2013, at 6:54 AM, Diego Novillo <dnov...@google.com> wrote:
>
>> On 2013-06-15 14:18 , Evan Cheng wrote:
>>> Apple folks are also gearing up to push on the PGO front. We are primarily interested in using instrumentation, rather than sampling, to collect profile info. However, I suspect the way profile ended up being used in the various optimization and codegen passes would be largely similar.
>>>
>>
>> Excellent! We are initially interested in instrumentation, as well. This is where we draw most of our performance with GCC. Sampling is showing a lot of promise, however. And it really is not much different than instrumentation. Most of what changes is the source of profile data.
>>
>>> There is also some interests in pursuing profile directed specialization. But that can wait. I think it makes sense for us to get together and discuss our plans to make sure there won't be duplication of efforts.
>>
>> Sure. My initial plan is fairly simple. Triage the existing instrumentation code and see what needs fixing. I'm starting this in the next week or so. What are your plans?
>
> I've been working on prototyping a new approach to instrumentation for both PGO and code coverage testing. I want to use the same data for both of those uses, and for code coverage we really need to have accurate source location info, including column positions and the starting and ending locations for every block of code. Working backward from the debug info to find source locations hasn't worked very well. The debug info just doesn't have enough detailed information.

I'm curious what problems you've had. You've surely not mentioned them
before. Note that I'm not saying that I think this is the best method,
but I'm curious what problems you've had and what you're running into.

> Instead, I am planning to insert the instrumentation code in the front-end. The raw profile data in this scheme is tied to source locations. One disadvantage is that it conflicts with the goal of having graceful degradation. Simply adding a blank line invalidates all the following source locations. My plan is to ignore any profile data whenever the source changes in any way. We had some discussions about this, and it is easy to come up with cases where even a trivial change to the source, e.g., enablin!
> g a debugging flag, causes massive changes in the profile data. I don't think graceful degradation should even be a goal. It is too hard to sort out insignificant changes from those that should invalidate all the profile data.
>

Even a comment?

-eric

Bob Wilson

unread,

Jun 18, 2013, 6:34:51 PM6/18/13

to Eric Christopher, llv...@cs.uiuc.edu

On Jun 18, 2013, at 2:48 PM, Eric Christopher <echr...@gmail.com> wrote:

> On Tue, Jun 18, 2013 at 11:19 AM, Bob Wilson <bob.w...@apple.com> wrote:
>>
>> On Jun 17, 2013, at 6:54 AM, Diego Novillo <dnov...@google.com> wrote:
>>
>>> On 2013-06-15 14:18 , Evan Cheng wrote:
>>>> Apple folks are also gearing up to push on the PGO front. We are primarily interested in using instrumentation, rather than sampling, to collect profile info. However, I suspect the way profile ended up being used in the various optimization and codegen passes would be largely similar.
>>>>
>>>
>>> Excellent! We are initially interested in instrumentation, as well. This is where we draw most of our performance with GCC. Sampling is showing a lot of promise, however. And it really is not much different than instrumentation. Most of what changes is the source of profile data.
>>>
>>>> There is also some interests in pursuing profile directed specialization. But that can wait. I think it makes sense for us to get together and discuss our plans to make sure there won't be duplication of efforts.
>>>
>>> Sure. My initial plan is fairly simple. Triage the existing instrumentation code and see what needs fixing. I'm starting this in the next week or so. What are your plans?
>>
>> I've been working on prototyping a new approach to instrumentation for both PGO and code coverage testing. I want to use the same data for both of those uses, and for code coverage we really need to have accurate source location info, including column positions and the starting and ending locations for every block of code. Working backward from the debug info to find source locations hasn't worked very well. The debug info just doesn't have enough detailed information.
>
> I'm curious what problems you've had. You've surely not mentioned them
> before. Note that I'm not saying that I think this is the best method,
> but I'm curious what problems you've had and what you're running into.

The main issue is that I want precise begin/end source locations for each statement. Debug info doesn't do that. Even if we enable column information, debug info just gives you one source location for each statement. That's not a problem with debug info -- it's just not what it was intended to be used for.

There's also the issue that I want PGO instrumentation to work the same regardless of the debug info setting.

>
>> Instead, I am planning to insert the instrumentation code in the front-end. The raw profile data in this scheme is tied to source locations. One disadvantage is that it conflicts with the goal of having graceful degradation. Simply adding a blank line invalidates all the following source locations. My plan is to ignore any profile data whenever the source changes in any way. We had some discussions about this, and it is easy to come up with cases where even a trivial change to the source, e.g., enablin!
>> g a debugging flag, causes massive changes in the profile data. I don't think graceful degradation should even be a goal. It is too hard to sort out insignificant changes from those that should invalidate all the profile data.
>>
>
> Even a comment?

Yes. Obviously you could do that, but it would add significant complexity and little real value.

Eric Christopher

unread,

Jun 18, 2013, 6:41:20 PM6/18/13

to Bob Wilson, llv...@cs.uiuc.edu

> The main issue is that I want precise begin/end source locations for each statement. Debug info doesn't do that. Even if we enable column information, debug info just gives you one source location for each statement. That's not a problem with debug info -- it's just not what it was intended to be used for.
>

True... somewhat. Precise begin/end source locations exist and the
debug info will map them back to regions of code that are associated
with a particular line and column - this should also change when the
statement changes. Anything else should probably be a bug? The only
difference I can see is that there's no concrete "done with this
statement" in the code. It might take some processing work to get "all
statements that executed between these source ranges" because of code
motion effects, but you should be able to get a concrete range.

> There's also the issue that I want PGO instrumentation to work the same regardless of the debug info setting.

Basically it's going to be "line-tables-only" :)

Of course, I'm also not signing up to do the work, just curious what
the limitations are in the infrastructure here.

>
>>
>> Even a comment?
>
> Yes. Obviously you could do that, but it would add significant complexity and little real value.

Mmm.. I think you're underestimating the amount of usefulness here,
but we can revisit when there's some code. :)

-eric

Jakob Stoklund Olesen

unread,

Jun 21, 2013, 2:41:54 PM6/21/13

to Bob Wilson, llv...@cs.uiuc.edu

On Jun 18, 2013, at 11:19 AM, Bob Wilson <bob.w...@apple.com> wrote:

> The missing piece is adding per-function metadata to record the execution counts. We had planned to add that along with the branch probabilities but haven't gotten it done yet. You can see the lack of it in BlockFrequency::getEntryFrequency(), which just returns a constant value. We will want that to allow comparing profile data across different functions, which you can't easily get from the branch probabilities alone.

Our 64-bit fixpoint representation of block frequencies doesn’t have the dynamic range to do that. See PR16402.

/jakob

Bob Wilson

unread,

Sep 6, 2013, 6:37:55 PM9/6/13

to llv...@cs.uiuc.edu

On Jun 18, 2013, at 11:19 AM, Bob Wilson <bob.w...@apple.com> wrote:

>

> On Jun 17, 2013, at 6:54 AM, Diego Novillo <dnov...@google.com> wrote:
>
>> On 2013-06-15 14:18 , Evan Cheng wrote:
>>> Apple folks are also gearing up to push on the PGO front. We are primarily interested in using instrumentation, rather than sampling, to collect profile info. However, I suspect the way profile ended up being used in the various optimization and codegen passes would be largely similar.
>>>
>>
>> Excellent! We are initially interested in instrumentation, as well. This is where we draw most of our performance with GCC. Sampling is showing a lot of promise, however. And it really is not much different than instrumentation. Most of what changes is the source of profile data.
>>
>>> There is also some interests in pursuing profile directed specialization. But that can wait. I think it makes sense for us to get together and discuss our plans to make sure there won't be duplication of efforts.
>>
>> Sure. My initial plan is fairly simple. Triage the existing instrumentation code and see what needs fixing. I'm starting this in the next week or so. What are your plans?
>

> I've been working on prototyping a new approach to instrumentation for both PGO and code coverage testing. I want to use the same data for both of those uses, and for code coverage we really need to have accurate source location info, including column positions and the starting and ending locations for every block of code. Working backward from the debug info to find source locations hasn't worked very well. The debug info just doesn't have enough detailed information. Instead, I am planning to insert the instrumentation code in the front-end. The raw profile data in this scheme is tied to source locations. One disadvantage is that it conflicts with the goal of having graceful degradation. Simply adding a blank line invalidates all the following source locations. My plan is to ignore any profile data whenever the source changes in any way. We had some discussions about this, and it is easy to come up with cases where even a trivial change to the source, e.g., enabl!
ing a debugging flag, causes massive changes in the profile data. I don't think graceful degradation should even be a goal. It is too hard to sort out insignificant changes from those that should invalidate all the profile data.

>
> I am still at least a few weeks away from having any details to share about this new instrumentation scheme. I want to convince myself that it is viable before making a proposal to add it to LLVM.
>
> I don't know if this new kind of instrumentation will fit everyone's needs for PGO. I hope that it will, but regardless the most important thing is that we standardize on the representation of profile data in the IR. We have a good start on that already with the branch probability metadata. The missing piece is adding per-function metadata to record the execution counts. We had planned to add that along with the branch probabilities but haven't gotten it done yet. You can see the lack of it in BlockFrequency::getEntryFrequency(), which just returns a constant value. We will want that to allow comparing profile data across different functions, which you can't easily get from the branch probabilities alone. I don't think anyone has yet come up with a specific proposal for that.
>
> As long as we agree on the canonical metadata, there should be no problem with supporting different kinds of profile generators, including things like Auto FDO.
>
> I also agree about the importance of getting inlining to use the PGO data. Chandler had looked into that and found that it was really difficult to do it properly with the current pass manager. Has any progress been made on that front?

Well, it took me much longer than a few weeks, and I completely abandoned the idea to tie the data to specific source locations and require it to be regenerated for any source change at all, but I finally have a proposal for doing PGO with instrumentation. Because that proposal is to do the instrumentation and the metadata generation in the front-end, so I've sent it to the cfe-dev list. Please have a look if you're interested.

Reply all

Reply to author

Forward