[llvm-dev] [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

2,143 views

Skip to first unread message

Wenlei He via llvm-dev

unread,

Aug 7, 2020, 2:28:59 PM8/7/20

to llvm...@lists.llvm.org, Xinliang David Li, Wei Mi, Hongtao Yu

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile. It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO. In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.
Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.
The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

We propose a full context-sensitive sample profiling infrastructure that utilizes both LBR and call stack samples at the same time to synthesize a profile with a full context sensitivity. The key advantage is that rather than relying on previous inlining or a separate profile, the profile collected with the new approach will have full calling contexts recovered from both inlined and not inlined call sites. To achieve an accurate post-inline profile, a separate profile is no longer needed. Instead, the post-inline profile can be directly derived from adjusting the input profile based on all inline decisions. The richer context-sensitive profile also enables better inline decisions. The infrastructure has two key components listed below.

Synthesizing context-sensitive LBR with a virtual unwinder

To make sample PGO’s input profile context aware, we need to know the call stack of each LBR fall through path. That is done by sampling LBR and call stack simultaneously. With that, each sample will contain a call stack in addition to LBR entries. We use level 2 PEBS to control sampling skid so that the leaf frame from stack sample aligns with leaf frame from LBR. The raw call stack sample describes the calling context for the leaf LBR entry. In addition, by unwinding “call” and “return” (including implicit ones from inlinee) from LBR entries backwards on top of raw stack samples, we can recover the calling context for each of the LBR entries from the sample, thus synthesizing context-sensitive LBR profile.

We can then generate context-sensitive sample PGO profile using the context-sensitive LBR profile. In the new profile, a function’s profile becomes a collection of profiles, each representing a profile for a given calling context.

Context-sensitive FDO/PGO framework in LLVM

In order to leverage context-sensitive profile for inlining, and to maintain accurate post-inline counts, we introduced SampleContextTracker which is a layer sitting in between input profile and the profile used to annotate CFG for optimizations. We also introduced the notion of base profile which is the merged profile for function’s profiles from any outstanding (not inlined) context, and context profile which is a function's profile for a given calling context. The framework includes four simple APIs for updating and query profiles:

Query API:

getBaseSamplesFor: Query base profile by function name.
getContextSamplesFor: Query context profile by calling context and function name.

Update API:

MarkContextSamplesInlined: When a function is inlined for a given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing the base profile.
PromoteMergeContextSamplesTree: When a function is not inlined for a given calling context, we need to promote the context profile tree to be top-level context. This preserves the child context under that function so later inline decisions for calls originating from that not inlined function will still be driven by an accurate context profile.

These APIs are used by SampleProfileLoader’s inlining, for better inline decisions and better post-inline counts. For optimal results, the new infrastructure needs to work with a top-down FDO inliner. We added top-down FDO inlining under a switch, and the switch is turned on by default in upstream recently. There’re a few other improvements for the FDO inliner that we plan to upstream soon.

Pseudo-instrumentation for sample to IR mapping

Being able to profile production binaries is a key advantage of AutoFDO over Instrumentation PGO, but it also comes with a big challenge. While using line number and discriminator as anchor for profile mapping incurs zero run time overhead for AutoFDO, it’s not as accurate as instrumented probes. This is because the instrumented probes are part of the IR, rather than metadata attached to the IR like !dbg. That has two implications: 1) it’s easier to maintain IR than metadata for optimization passes; 2) probe blocks some CFG transformations that can mess up profile correlation.

With the proposed pseudo instrumentation, we can achieve most of the benefit of instrumentation PGO in little runtime overhead. We instrument each basic block with a pseudo probe associated with the block Id. Unlike in PGO instrumentation where a counter is implemented as a persisting operation such as atomic read/write or runtime helper call, a pseudo probe is implemented as a dedicated intrinsic call with IntrInaccessibleMemOnly flag. The intrinsic comes with most of the semantics of a PGO counter but is much less optimization-intrusive.

The pseudo probe intrinsic calls are on the IR throughout the optimization and code generation pipeline and are materialized as a piece of binary data stored in a separate .pseudo_probe data section. The section is then used to map binary samples back to blocks of CFG during profile generation. There are also no real machine instructions generated for a pseudo probe and the.pseudo_probe section won’t be loaded into memory at runtime, therefore they should incur very little runtime overhead. As a fact, we see no measure-able performance impact from pseudo-instrumentation itself on SPEC2006 or big internal workload.

Pseudo-instrumentation integration and Pass Ordering

One implication from pseudo-probe instrumentation is that the profile is now sensitive to CFG changes. We now defect stale profiles for sample PGO via CFG checksum, instead of just using it. However, the potential downside is that CFG may change between different versions of the compiler even if the source code is unchanged. To solve that problem, we perform the pseudo instrumentation very early in the pre-LTO pipeline, before any CFG transformation. This ensures that the CFG instrumented and annotated is stable. We added SampleProfileProber that performs the pseudo instrumentation and runs independent of profile annotation.

A new switch -fpseudo-probe-for-profiling is added to enable sample PGO with pseudo instrumentation, similar to -fdebug-info-for-profiling for AutoFDO. Input profile is still provided through the same switch used by today’s AutoFDO, namely -fprofile-sample-use, and the profile loader will automatically decide how to load and annotate profile depending on whether input profile is dwarf-based or pseudo-probe based.

New profile format and profile generation

We extend current profile format in order to be able to represent a full context-sensitive profile and also encode pseudo-probe info. This is done without drastically diverging from today’s AutoFDO profile format so that existing tools and libraries can be reused with minor changes (e.g. llvm-profdata, profiler reader and writer).

For a context-sensitive profile, we extend the profile format by changing the function profile header line to include calling context in addition to function name. With today’s AutoFDO, we have a single profile header for each function to represent its accumulative profile. E.g. This is the profile header for foo, with 1290 total samples, and 74 header samples.

foo:1290:74

For CSSPGO, we will have multiple profile headers for a single function’s profile, each representing profile for a specific calling context as shown below. CSSPGO profile header is bracketed to differentiate from today’s AutoFDO.

[main:12 @ bar:3 @ foo]:279:54

[main:19 @ zoo:7 @ foo]:1011:20

With calling context encoded in the function header, we no longer need a nested function profile for inlinees. Instead, a context profile will be represented uniformly using context strings in the function profile header, regardless of whether the calls in the context are inlined or not. The flat structure makes sure that context profile is easily indexable. The change is mostly transparent to tools like llvm-profdata. Support for binary profile format has not been added yet, but should be easy to do.

For pseudo-probe, we repurposed the line to count map of AutoFDO profile to be block Id to count map. This only changes the interpretation of profile content rather than the representation, hence all reader/writer helpers can be reused.

There's a new profile generation tool, llvm-profgen, with the virtual winder implemented for context-sensitive profiling, and uses the .pseudo_probe section to map binary profile to pre-opt CFG profile. Since profile generation is a critical piece of the workflow, we’d like to propose to include the tool as part of LLVM, alongside with llvm-profdata.

Preliminary Results

To quantitatively assess profile quality improvement brought by pseudo-instrumentation, we introduce a profile quality metric. We measure the metric by first annotating an optimized binary with the MIR block execution counts derived from a profile. The binary is then sampled and the counts are compared against the annotation. The weighted relative delta is used as an indicator for profile quality (lower is better).

Table below shows the profile quality metric for SPEC2006. We can see from the numbers that the profile quality of pseudo-instrumentation sample PGO is much better than AutoFDO and close to instrumentation PGO.

Profile quality metric	Baseline AutoFDO	Instrumentation PGO	Sample PGO w/ Pseudo Instrumentation
SPEC2006	24.58%	15.70%	16.21%

We also measured performance and code size on SPEC2006 with CSSPGO. The measurement was done with MonoLTO and new pass manager, with tuning for FDO inliner to accommodate context-sensitive profile, and used training dataset for both pass1 and pass2. The result shows ~2% performance win on top of today’s AutoFDO, with ~4% .text reduction, see table below.

SPEC2006

Performance

Code Size

AutoFDO over LTO

CSSPGO

Over LTO

CSSPGO over AutoFDO

AutoFDO over LTO

CSSPGO

Over LTO

CSSPGO over AutoFDO

Geomean Delta %

6.80%

8.70%

2.04%

11.17%

6.66%

4.06%

While the SPEC2006 benchmark suite is different from large workloads, we think the results demonstrated the potential of CSSPGO and served its purpose for proof of concept. We plan to continue tuning and start to evaluate larger internal workloads soon, and we’d like to upstream our work. Feedbacks are welcomed!

Thanks,

Wenlei & Hongtao

Xinliang David Li via llvm-dev

unread,

Aug 7, 2020, 4:24:45 PM8/7/20

to Wenlei He, llvm...@lists.llvm.org, Wei Mi, Hongtao Yu

Wenlei, Thanks for the interesting proposal! please see my replies inline below.

On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wen...@fb.com> wrote:

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile.

Can you share more details on this? LBR only has 32 entries, so it won't give you full call context, so stack unwinding is needed. What is the overhead you see in production environment?

It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO.

What is the sample profile data size impact with the full context information?

In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

Is CSSPGO inherently dependent upon pseudo-probe or is it orthogonal?  I hope that it is the latter :)

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.

Acknowledge of the limitation here.

Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

I think we need more quantification of the impact of using debug information for matching purposes:  How much performance are left on the table due to this, and are they fixable issues or not.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself.  Can you provide more motivation for the pseudo probes?

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.
The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

We propose a full context-sensitive sample profiling infrastructure that utilizes both LBR and call stack samples at the same time to synthesize a profile with a full context sensitivity. The key advantage is that rather than relying on previous inlining or a separate profile, the profile collected with the new approach will have full calling contexts recovered from both inlined and not inlined call sites. To achieve an accurate post-inline profile, a separate profile is no longer needed. Instead, the post-inline profile can be directly derived from adjusting the input profile based on all inline decisions. The richer context-sensitive profile also enables better inline decisions. The infrastructure has two key components listed below.

Synthesizing context-sensitive LBR with a virtual unwinder

To make sample PGO’s input profile context aware, we need to know the call stack of each LBR fall through path. That is done by sampling LBR and call stack simultaneously. With that, each sample will contain a call stack in addition to LBR entries. We use level 2 PEBS to control sampling skid so that the leaf frame from stack sample aligns with leaf frame from LBR. The raw call stack sample describes the calling context for the leaf LBR entry. In addition, by unwinding “call” and “return” (including implicit ones from inlinee) from LBR entries backwards on top of raw stack samples, we can recover the calling context for each of the LBR entries from the sample, thus synthesizing context-sensitive LBR profile.

We can then generate context-sensitive sample PGO profile using the context-sensitive LBR profile. In the new profile, a function’s profile becomes a collection of profiles, each representing a profile for a given calling context.

Sounds good -- see the overhead question posted at the beginning.

Context-sensitive FDO/PGO framework in LLVM

In order to leverage context-sensitive profile for inlining, and to maintain accurate post-inline counts, we introduced SampleContextTracker which is a layer sitting in between input profile and the profile used to annotate CFG for optimizations. We also introduced the notion of base profile which is the merged profile for function’s profiles from any outstanding (not inlined) context, and context profile which is a function's profile for a given calling context. The framework includes four simple APIs for updating and query profiles:

Query API:

getBaseSamplesFor: Query base profile by function name.
getContextSamplesFor: Query context profile by calling context and function name.

Update API:

MarkContextSamplesInlined: When a function is inlined for a given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing the base profile.
PromoteMergeContextSamplesTree: When a function is not inlined for a given calling context, we need to promote the context profile tree to be top-level context. This preserves the child context under that function so later inline decisions for calls originating from that not inlined function will still be driven by an accurate context profile.

These APIs are used by SampleProfileLoader’s inlining, for better inline decisions and better post-inline counts. For optimal results, the new infrastructure needs to work with a top-down FDO inliner. We added top-down FDO inlining under a switch, and the switch is turned on by default in upstream recently. There’re a few other improvements for the FDO inliner that we plan to upstream soon.

The profile data should be usable by the SCC inliner as well. In the bottom up inlining, as the function gets inline further up in the call chain, the inline instance has few incoming contexts to merge.

Pseudo-instrumentation for sample to IR mapping

Being able to profile production binaries is a key advantage of AutoFDO over Instrumentation PGO, but it also comes with a big challenge. While using line number and discriminator as anchor for profile mapping incurs zero run time overhead for AutoFDO, it’s not as accurate as instrumented probes. This is because the instrumented probes are part of the IR, rather than metadata attached to the IR like !dbg. That has two implications: 1) it’s easier to maintain IR than metadata for optimization passes; 2) probe blocks some CFG transformations that can mess up profile correlation.

With the proposed pseudo instrumentation, we can achieve most of the benefit of instrumentation PGO in little runtime overhead. We instrument each basic block with a pseudo probe associated with the block Id. Unlike in PGO instrumentation where a counter is implemented as a persisting operation such as atomic read/write or runtime helper call, a pseudo probe is implemented as a dedicated intrinsic call with IntrInaccessibleMemOnly flag. The intrinsic comes with most of the semantics of a PGO counter but is much less optimization-intrusive.

The pseudo probe intrinsic calls are on the IR throughout the optimization and code generation pipeline and are materialized as a piece of binary data stored in a separate .pseudo_probe data section.

How are these information maintained? Blocks can be eliminated or cloned in many optimization passes: jump threading, taildup, unrolling, peeling etc.  For instance, how to handle the block that is merged into another? Does it lose samples because of this?

The section is then used to map binary samples back to blocks of CFG during profile generation. There are also no real machine instructions generated for a pseudo probe and the.pseudo_probe section won’t be loaded into memory at runtime, therefore they should incur very little runtime overhead. As a fact, we see no measure-able performance impact from pseudo-instrumentation itself on SPEC2006 or big internal workload.

 How large are the probe sections? 

Pseudo-instrumentation integration and Pass Ordering

One implication from pseudo-probe instrumentation is that the profile is now sensitive to CFG changes. We now defect stale profiles for sample PGO via CFG checksum, instead of just using it. However, the potential downside is that CFG may change between different versions of the compiler even if the source code is unchanged. To solve that problem, we perform the pseudo instrumentation very early in the pre-LTO pipeline, before any CFG transformation. This ensures that the CFG instrumented and annotated is stable. We added SampleProfileProber that performs the pseudo instrumentation and runs independent of profile annotation.

A new switch -fpseudo-probe-for-profiling is added to enable sample PGO with pseudo instrumentation, similar to -fdebug-info-for-profiling for AutoFDO. Input profile is still provided through the same switch used by today’s AutoFDO, namely -fprofile-sample-use, and the profile loader will automatically decide how to load and annotate profile depending on whether input profile is dwarf-based or pseudo-probe based.

Can you compare the source change tolerance of pseudo probe based approach vs debug info based approach?

New profile format and profile generation

We extend current profile format in order to be able to represent a full context-sensitive profile and also encode pseudo-probe info. This is done without drastically diverging from today’s AutoFDO profile format so that existing tools and libraries can be reused with minor changes (e.g. llvm-profdata, profiler reader and writer).

For a context-sensitive profile, we extend the profile format by changing the function profile header line to include calling context in addition to function name. With today’s AutoFDO, we have a single profile header for each function to represent its accumulative profile. E.g. This is the profile header for foo, with 1290 total samples, and 74 header samples.

foo:1290:74

For CSSPGO, we will have multiple profile headers for a single function’s profile, each representing profile for a specific calling context as shown below. CSSPGO profile header is bracketed to differentiate from today’s AutoFDO.

[main:12 @ bar:3 @ foo]:279:54

[main:19 @ zoo:7 @ foo]:1011:20

sounds good.

With calling context encoded in the function header, we no longer need a nested function profile for inlinees. Instead, a context profile will be represented uniformly using context strings in the function profile header, regardless of whether the calls in the context are inlined or not. The flat structure makes sure that context profile is easily indexable. The change is mostly transparent to tools like llvm-profdata. Support for binary profile format has not been added yet, but should be easy to do.

It is still useful to annotate (as least with comment line) that a profile is for top level function or inline instance.

For pseudo-probe, we repurposed the line to count map of AutoFDO profile to be block Id to count map. This only changes the interpretation of profile content rather than the representation, hence all reader/writer helpers can be reused.

There's a new profile generation tool, llvm-profgen, with the virtual winder implemented for context-sensitive profiling, and uses the .pseudo_probe section to map binary profile to pre-opt CFG profile. Since profile generation is a critical piece of the workflow, we’d like to propose to include the tool as part of LLVM, alongside with llvm-profdata.

Preliminary Results

To quantitatively assess profile quality improvement brought by pseudo-instrumentation, we introduce a profile quality metric. We measure the metric by first annotating an optimized binary with the MIR block execution counts derived from a profile. The binary is then sampled and the counts are compared against the annotation. The weighted relative delta is used as an indicator for profile quality (lower is better).

Table below shows the profile quality metric for SPEC2006. We can see from the numbers that the profile quality of pseudo-instrumentation sample PGO is much better than AutoFDO and close to instrumentation PGO.

Profile quality metric

Baseline AutoFDO

Instrumentation PGO

Sample PGO w/ Pseudo Instrumentation

SPEC2006

24.58%

15.70%

16.21%

Instrumentation PGO does not have context sensitivity, so I would expect it scores worse than CSSPGO. Do you know why it is better here?

We also measured performance and code size on SPEC2006 with CSSPGO. The measurement was done with MonoLTO and new pass manager, with tuning for FDO inliner to accommodate context-sensitive profile, and used training dataset for both pass1 and pass2. The result shows ~2% performance win on top of today’s AutoFDO, with ~4% .text reduction, see table below.

SPEC2006

Performance

Code Size

AutoFDO over LTO

CSSPGO

Over LTO

CSSPGO over AutoFDO

AutoFDO over LTO

CSSPGO

Over LTO

CSSPGO over AutoFDO

Geomean Delta %

6.80%

8.70%

2.04%

11.17%

6.66%

4.06%

While the SPEC2006 benchmark suite is different from large workloads, we think the results demonstrated the potential of CSSPGO and served its purpose for proof of concept. We plan to continue tuning and start to evaluate larger internal workloads soon, and we’d like to upstream our work. Feedbacks are welcomed!

What is the performance win with peudo-probe alone?

thanks,

David

Thanks,

Wenlei & Hongtao

Wenlei He via llvm-dev

unread,

Aug 7, 2020, 8:00:17 PM8/7/20

to Xinliang David Li, llvm...@lists.llvm.org, Wei Mi, Hongtao Yu

Thanks for the thoughtful questions, David. See my answers inline.

Thanks,

Wenlei

From: Xinliang David Li <dav...@google.com>
Date: Friday, August 7, 2020 at 1:24 PM
To: Wenlei He <wen...@fb.com>
Cc: "llvm...@lists.llvm.org" <llvm...@lists.llvm.org>, Wei Mi <w...@google.com>, Hongtao Yu <h...@fb.com>
Subject: Re: [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Wenlei, Thanks for the interesting proposal! please see my replies inline below.

On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wen...@fb.com> wrote:

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile.

Can you share more details on this? LBR only has 32 entries, so it won't give you full call context, so stack unwinding is needed. What is the overhead you see in production environment?

[wenlei] We are not worried about overhead in production environment as the sampling rate there is extremely low. We did measure locally however, for stack sampling and level 2 PEBS on top of regular LBR sampling, the overheads isn’t very noticeable still, but I don’t have numbers at hand.

It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO.

What is the sample profile data size impact with the full context information?

[wenlei] Text CS profile is normally around 1x-10x of regular profile size, with all live context included. We plan to trim cold context, which we expect to bring the size down in a meaningful way. Another source of size increase is the context string, which could contain duplicated mangle names (can be very long for C++ templated code), but should be very compressible with the built-in compression support from extended binary profile. We will move to extended binary format, and leverage the compression support if needed. We can also consider more efficient fixed-length integer context representation (similar to rolling hash).

In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

Is CSSPGO inherently dependent upon pseudo-probe or is it orthogonal? I hope that it is the latter :)

[wenlei] They’re orthogonal. Context-sensitive SPGO can work without pseudo-probe and still use dwarf. Our plan is to keep context-sensitive SPGO working w/ and w/o pseudo-probe functionality-wise, but we only look at perf and tune with the two combined.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.

Acknowledge of the limitation here.

Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

I think we need more quantification of the impact of using debug information for matching purposes: How much performance are left on the table due to this, and are they fixable issues or not.

[wenlei] The first table in the result section is comparing pseudo-probe with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a quantitative assessment of the effectiveness of pseudo-probe. It’s hard to assess performance benefit though, because PGO performance is a function of profile quality and heuristic. Currently heuristics are tuned to cope with the profile quality we have, so it may not do justice for profile quality improvements that pseudo-probe brings us.

One example is how FDO inliner evaluates call site. It checks callee’s total sample count instead of callee’s entry count. This is less than ideal, but we couldn’t fix it due to profile quality issues – we can’t reliably get inlinee’s entry count with dwarf based approach, see discussion in https://reviews.llvm.org/D60086. That problem is solved with pseudo-probe, but until we change the inliner, we won’t see perf win from that particular profile quality improvement. There are other similar cases too, and that’s why we used profile quality metric instead of performance to assess pseudo-probe.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

3). Possibility of offline count inference. We have an experiment that encodes edges alongside with probes (blocks), so more sophisticated offline count inference algorithm is possible to further improve profile quality. Our algorithm researchers are working on new profile inference solution now.

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.
The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

We propose a full context-sensitive sample profiling infrastructure that utilizes both LBR and call stack samples at the same time to synthesize a profile with a full context sensitivity. The key advantage is that rather than relying on previous inlining or a separate profile, the profile collected with the new approach will have full calling contexts recovered from both inlined and not inlined call sites. To achieve an accurate post-inline profile, a separate profile is no longer needed. Instead, the post-inline profile can be directly derived from adjusting the input profile based on all inline decisions. The richer context-sensitive profile also enables better inline decisions. The infrastructure has two key components listed below.

Synthesizing context-sensitive LBR with a virtual unwinder

To make sample PGO’s input profile context aware, we need to know the call stack of each LBR fall through path. That is done by sampling LBR and call stack simultaneously. With that, each sample will contain a call stack in addition to LBR entries. We use level 2 PEBS to control sampling skid so that the leaf frame from stack sample aligns with leaf frame from LBR. The raw call stack sample describes the calling context for the leaf LBR entry. In addition, by unwinding “call” and “return” (including implicit ones from inlinee) from LBR entries backwards on top of raw stack samples, we can recover the calling context for each of the LBR entries from the sample, thus synthesizing context-sensitive LBR profile.

We can then generate context-sensitive sample PGO profile using the context-sensitive LBR profile. In the new profile, a function’s profile becomes a collection of profiles, each representing a profile for a given calling context.

Sounds good -- see the overhead question posted at the beginning.

Context-sensitive FDO/PGO framework in LLVM

In order to leverage context-sensitive profile for inlining, and to maintain accurate post-inline counts, we introduced SampleContextTracker which is a layer sitting in between input profile and the profile used to annotate CFG for optimizations. We also introduced the notion of base profile which is the merged profile for function’s profiles from any outstanding (not inlined) context, and context profile which is a function's profile for a given calling context. The framework includes four simple APIs for updating and query profiles:

Query API:

getBaseSamplesFor: Query base profile by function name.
getContextSamplesFor: Query context profile by calling context and function name.

Update API:

MarkContextSamplesInlined: When a function is inlined for a given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing the base profile.
PromoteMergeContextSamplesTree: When a function is not inlined for a given calling context, we need to promote the context profile tree to be top-level context. This preserves the child context under that function so later inline decisions for calls originating from that not inlined function will still be driven by an accurate context profile.

These APIs are used by SampleProfileLoader’s inlining, for better inline decisions and better post-inline counts. For optimal results, the new infrastructure needs to work with a top-down FDO inliner. We added top-down FDO inlining under a switch, and the switch is turned on by default in upstream recently. There’re a few other improvements for the FDO inliner that we plan to upstream soon.

The profile data should be usable by the SCC inliner as well. In the bottom up inlining, as the function gets inline further up in the call chain, the inline instance has few incoming contexts to merge.

[wenlei] Yes, we intentionally introduced the SampleContextTracker abstraction that is decoupled from SampleProfileLoader, so it can work with both FDO inliner and SCC inliner. But we expect FDO inliner to take over more inlining for CSSPGO because the FDO inliner is no longer a replay inliner now. And it’s good as top-down inline helps with specialization which is important for context-sensitive inlining.

Pseudo-instrumentation for sample to IR mapping

Being able to profile production binaries is a key advantage of AutoFDO over Instrumentation PGO, but it also comes with a big challenge. While using line number and discriminator as anchor for profile mapping incurs zero run time overhead for AutoFDO, it’s not as accurate as instrumented probes. This is because the instrumented probes are part of the IR, rather than metadata attached to the IR like !dbg. That has two implications: 1) it’s easier to maintain IR than metadata for optimization passes; 2) probe blocks some CFG transformations that can mess up profile correlation.

With the proposed pseudo instrumentation, we can achieve most of the benefit of instrumentation PGO in little runtime overhead. We instrument each basic block with a pseudo probe associated with the block Id. Unlike in PGO instrumentation where a counter is implemented as a persisting operation such as atomic read/write or runtime helper call, a pseudo probe is implemented as a dedicated intrinsic call with IntrInaccessibleMemOnly flag. The intrinsic comes with most of the semantics of a PGO counter but is much less optimization-intrusive.

The pseudo probe intrinsic calls are on the IR throughout the optimization and code generation pipeline and are materialized as a piece of binary data stored in a separate .pseudo_probe data section.

How are these information maintained? Blocks can be eliminated or cloned in many optimization passes: jump threading, taildup, unrolling, peeling etc. For instance, how to handle the block that is merged into another? Does it lose samples because of this?

[wenlei] They are just maintained as part of IR, like any other instructions, without special care. The key difference is they’re part of IR instead of metadata attached to IR. We can categorize relevant CFG transformations into 1) duplication, 2) merge and removal.

For any duplication, tail/head dup, unrolling, probe will be duplicated along with other instructions, and we don’t need duplication factor that was used by dwarf-based approach, because counts from duplicated probes will be added together naturally. For merge and removal, IntrInaccessibleMemOnly flag will block it, similar to real probes.

Pseudo-probe is a framework that is tunable. Depending on the semantic we put on the intrinsic, it can be as heavy as real probe, or as light as a label. IntrInaccessibleMemOnly is a carefully chosen semantic based on our experiments that balances run time overhead and profile quality – it doesn’t incur measure-able overhead even though it still blocks merging and removal, we didn’t see measure-able overhead from SPEC or a large internal workload. But the profile quality improvement is measure-able as the 1st table in result section shows.

The section is then used to map binary samples back to blocks of CFG during profile generation. There are also no real machine instructions generated for a pseudo probe and the.pseudo_probe section won’t be loaded into memory at runtime, therefore they should incur very little runtime overhead. As a fact, we see no measure-able performance impact from pseudo-instrumentation itself on SPEC2006 or big internal workload.

How large are the probe sections?

[wenlei] About 10% of binary size, another 2% if we encode CFG edges in addition to probes/blocks.

Pseudo-instrumentation integration and Pass Ordering

One implication from pseudo-probe instrumentation is that the profile is now sensitive to CFG changes. We now defect stale profiles for sample PGO via CFG checksum, instead of just using it. However, the potential downside is that CFG may change between different versions of the compiler even if the source code is unchanged. To solve that problem, we perform the pseudo instrumentation very early in the pre-LTO pipeline, before any CFG transformation. This ensures that the CFG instrumented and annotated is stable. We added SampleProfileProber that performs the pseudo instrumentation and runs independent of profile annotation.

A new switch -fpseudo-probe-for-profiling is added to enable sample PGO with pseudo instrumentation, similar to -fdebug-info-for-profiling for AutoFDO. Input profile is still provided through the same switch used by today’s AutoFDO, namely -fprofile-sample-use, and the profile loader will automatically decide how to load and annotate profile depending on whether input profile is dwarf-based or pseudo-probe based.

Can you compare the source change tolerance of pseudo probe based approach vs debug info based approach?

[wenlei] Pseudo-probe should be more resilient to source changes. See my reply for motivation of pseudo-probe. Pseudo-probe will be able to tolerate source changes as long as they don’t alter CFG. On the contrary, changes that delete a comment and shift line offset can cause perf churn with line-based approach. We've been bitten by this a few times – people making comment only change during holiday freeze only to find surprising perf regression due to AutoFDO 😊. It also opens up possibility of fuzzy CFG matching when there’s a CFG mutation due to source change to make it even more resilient.

New profile format and profile generation

We extend current profile format in order to be able to represent a full context-sensitive profile and also encode pseudo-probe info. This is done without drastically diverging from today’s AutoFDO profile format so that existing tools and libraries can be reused with minor changes (e.g. llvm-profdata, profiler reader and writer).

For a context-sensitive profile, we extend the profile format by changing the function profile header line to include calling context in addition to function name. With today’s AutoFDO, we have a single profile header for each function to represent its accumulative profile. E.g. This is the profile header for foo, with 1290 total samples, and 74 header samples.

foo:1290:74

For CSSPGO, we will have multiple profile headers for a single function’s profile, each representing profile for a specific calling context as shown below. CSSPGO profile header is bracketed to differentiate from today’s AutoFDO.

[main:12 @ bar:3 @ foo]:279:54

[main:19 @ zoo:7 @ foo]:1011:20

sounds good.

With calling context encoded in the function header, we no longer need a nested function profile for inlinees. Instead, a context profile will be represented uniformly using context strings in the function profile header, regardless of whether the calls in the context are inlined or not. The flat structure makes sure that context profile is easily indexable. The change is mostly transparent to tools like llvm-profdata. Support for binary profile format has not been added yet, but should be easy to do.

It is still useful to annotate (as least with comment line) that a profile is for top level function or inline instance.

[wenlei] Agreed, and that’s in our plan too - we need that for tuning purpose.

For pseudo-probe, we repurposed the line to count map of AutoFDO profile to be block Id to count map. This only changes the interpretation of profile content rather than the representation, hence all reader/writer helpers can be reused.

There's a new profile generation tool, llvm-profgen, with the virtual winder implemented for context-sensitive profiling, and uses the .pseudo_probe section to map binary profile to pre-opt CFG profile. Since profile generation is a critical piece of the workflow, we’d like to propose to include the tool as part of LLVM, alongside with llvm-profdata.

Preliminary Results

To quantitatively assess profile quality improvement brought by pseudo-instrumentation, we introduce a profile quality metric. We measure the metric by first annotating an optimized binary with the MIR block execution counts derived from a profile. The binary is then sampled and the counts are compared against the annotation. The weighted relative delta is used as an indicator for profile quality (lower is better).

Table below shows the profile quality metric for SPEC2006. We can see from the numbers that the profile quality of pseudo-instrumentation sample PGO is much better than AutoFDO and close to instrumentation PGO.

Profile quality metric

Baseline AutoFDO

Instrumentation PGO

Sample PGO w/ Pseudo Instrumentation

SPEC2006

24.58%

15.70%

16.21%

Instrumentation PGO does not have context sensitivity, so I would expect it scores worse than CSSPGO. Do you know why it is better here?

[wenlei] This is for evaluating effectiveness of pseudo-probe exclusively. We have all inlining turned off for this experiment, and this is without context-sensitive profile for Sample PGO either, so the comparison should be fair, and Instrumentation PGO should be the upper bound.

We also measured performance and code size on SPEC2006 with CSSPGO. The measurement was done with MonoLTO and new pass manager, with tuning for FDO inliner to accommodate context-sensitive profile, and used training dataset for both pass1 and pass2. The result shows ~2% performance win on top of today’s AutoFDO, with ~4% .text reduction, see table below.

SPEC2006

Performance

Code Size

AutoFDO over LTO

CSSPGO

Over LTO

CSSPGO over AutoFDO

AutoFDO over LTO

CSSPGO

Over LTO

CSSPGO over AutoFDO

Geomean Delta %

6.80%

8.70%

2.04%

11.17%

6.66%

4.06%

While the SPEC2006 benchmark suite is different from large workloads, we think the results demonstrated the potential of CSSPGO and served its purpose for proof of concept. We plan to continue tuning and start to evaluate larger internal workloads soon, and we’d like to upstream our work. Feedbacks are welcomed!

What is the performance win with peudo-probe alone?

[wenlei] We don’t have numbers for pseudo-probe along. As I mentioned earlier, profile quality improvement may not translate directly to perf win without heuristic changes. That’s why we evaluate pseudo-probe exclusively with profile quality metric. The hope is that it will open up opportunity for better optimizations. E.g. it could potentially help the Machine Function Splitting pass too. That said, pseudo-probe does bring extra win for CSSPGO comparing to line-based CSSPGO for some benchmarks, but we didn’t tune CSSPGO with line-based profile, so we didn’t aggregate numbers as the comparison isn’t fair either.

thanks,

David

Thanks,

Wenlei & Hongtao

Wei Mi via llvm-dev

unread,

Aug 7, 2020, 8:32:13 PM8/7/20

to Wenlei He, llvm...@lists.llvm.org, Xinliang David Li, Hongtao Yu

Thanks for the proposal and the performance improvement over existing AutoFDO is impressive.

On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wen...@fb.com> wrote:

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile. It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO. In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.
Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

Acknowledge to issues. We also found an issue that current AFDO profile doesn't keep edge information and that leads to nonoptimal profile in some cases. Since profile format is needed to be redesigned for component 1, I am thinking whether it is possible to extend the profile format in a way so it can incorporate edge information as well.

About pseudo probe, seemly you doesn't mention in this proposal but does it still provides the ability to solve the source drift issue you mentioned before? If it does, how it is achieved?

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

I like both ideas, and those two components can be orthogonal? For the first component, I hope the existing debug information based AutoFDO can be benefited from it as well, with some extension to the current profile format.

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.
The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

We propose a full context-sensitive sample profiling infrastructure that utilizes both LBR and call stack samples at the same time to synthesize a profile with a full context sensitivity. The key advantage is that rather than relying on previous inlining or a separate profile, the profile collected with the new approach will have full calling contexts recovered from both inlined and not inlined call sites. To achieve an accurate post-inline profile, a separate profile is no longer needed. Instead, the post-inline profile can be directly derived from adjusting the input profile based on all inline decisions. The richer context-sensitive profile also enables better inline decisions. The infrastructure has two key components listed below.

Synthesizing context-sensitive LBR with a virtual unwinder

To make sample PGO’s input profile context aware, we need to know the call stack of each LBR fall through path. That is done by sampling LBR and call stack simultaneously. With that, each sample will contain a call stack in addition to LBR entries. We use level 2 PEBS to control sampling skid so that the leaf frame from stack sample aligns with leaf frame from LBR. The raw call stack sample describes the calling context for the leaf LBR entry. In addition, by unwinding “call” and “return” (including implicit ones from inlinee) from LBR entries backwards on top of raw stack samples, we can recover the calling context for each of the LBR entries from the sample, thus synthesizing context-sensitive LBR profile.

What if the stack unwinding is not intact? For example, tail call optimization may cause unwinding issue currently in perf. framepointer or call frame information may not be properly maintained.

We can then generate context-sensitive sample PGO profile using the context-sensitive LBR profile. In the new profile, a function’s profile becomes a collection of profiles, each representing a profile for a given calling context.

Will the profile size be significantly larger?

Wenlei He via llvm-dev

unread,

Aug 7, 2020, 9:18:38 PM8/7/20

to Wei Mi, llvm...@lists.llvm.org, Xinliang David Li, Hongtao Yu

Thanks for the feedbacks and questions, Wei. See my replies inline.

From: Wei Mi <w...@google.com>
Date: Friday, August 7, 2020 at 5:32 PM
To: Wenlei He <wen...@fb.com>
Cc: "llvm...@lists.llvm.org" <llvm...@lists.llvm.org>, Xinliang David Li <dav...@google.com>, Hongtao Yu <h...@fb.com>
Subject: Re: [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Thanks for the proposal and the performance improvement over existing AutoFDO is impressive.

On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wen...@fb.com> wrote:

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile. It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO. In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.
Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

[wenlei] Yes, we have implemented an “add-on” that could encode edges in addition to probes/blocks in .pseudo_probe section, and we also have a way to represent edges in new profile. But that’s not critical for the framework and initial evaluation, which is why it’s not mentioned in this RFC. We did that mostly for enabling offline count inference algorithm experiments. We will share more details on that later. Curious what is the issue you saw due to lack of edge info?

About pseudo probe, seemly you doesn't mention in this proposal but does it still provides the ability to solve the source drift issue you mentioned before? If it does, how it is achieved?

[wenlei] Pseudo-probe handles source drift reasonably well, and has good resilience against source changes. It can tolerate any source changes that doesn’t alter CFG, so the issues we ran into with line-based approach where deleting a comment lead to big regression isn’t going to happen with pseudo-probe. For changes that does alter CFG, we could also employ fuzzy CFG matching in future. Bottom line is using probe and CFG as profile carrier inherently has richer info, so it’s easier for PGO to see through the source changes and can still make sense of a stale profile. (We didn’t expand on the source drift issue in initial RFC, but I just mentioned that part in my reply to David, as secondary motivation for pseudo-probe.)

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

[wenlei] Thanks. Yes, they’re orthogonal. But we need both for peak performance, and we want to focus tuning effort on the combination. Also see my reply to David’s questions.

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.
The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

We propose a full context-sensitive sample profiling infrastructure that utilizes both LBR and call stack samples at the same time to synthesize a profile with a full context sensitivity. The key advantage is that rather than relying on previous inlining or a separate profile, the profile collected with the new approach will have full calling contexts recovered from both inlined and not inlined call sites. To achieve an accurate post-inline profile, a separate profile is no longer needed. Instead, the post-inline profile can be directly derived from adjusting the input profile based on all inline decisions. The richer context-sensitive profile also enables better inline decisions. The infrastructure has two key components listed below.

Synthesizing context-sensitive LBR with a virtual unwinder

To make sample PGO’s input profile context aware, we need to know the call stack of each LBR fall through path. That is done by sampling LBR and call stack simultaneously. With that, each sample will contain a call stack in addition to LBR entries. We use level 2 PEBS to control sampling skid so that the leaf frame from stack sample aligns with leaf frame from LBR. The raw call stack sample describes the calling context for the leaf LBR entry. In addition, by unwinding “call” and “return” (including implicit ones from inlinee) from LBR entries backwards on top of raw stack samples, we can recover the calling context for each of the LBR entries from the sample, thus synthesizing context-sensitive LBR profile.

What if the stack unwinding is not intact? For example, tail call optimization may cause unwinding issue currently in perf. framepointer or call frame information may not be properly maintained.

[wenlei] That’s a good question. Currently, we have frame pointer optimization (FPO) and tail call optimization disabled for experiments. FPO is disabled for our production builds as well, so it’s not a problem for us. For tail call, we’ll need to evaluate the cost-benefit and see what we can do. We know there’s heuristic to recover single missing frame due to tail call, which we haven’t implemented yet; beyond that, perhaps we can revisit leveraging dwarf unwinding, or live with either imperfect profile or tail call disabled. We also implemented special case for sample that lands in prolog and epilog where frame chain isn’t ready. However, even with both FPO and tail call disabled, we still see truncated stack samples, which we’re investigating. But the perf results are with profiles containing truncated/imperfect stack samples, so it looks like a small portion of imperfect profile doesn’t impact the effectiveness of CSSPGO too much.

We can then generate context-sensitive sample PGO profile using the context-sensitive LBR profile. In the new profile, a function’s profile becomes a collection of profiles, each representing a profile for a given calling context.

Will the profile size be significantly larger?

[wenlei] Currently text CS profile is 1-10x larger. But there’re ways to bring them it and we’re working on it: 1) trim cold context, 2) leverage compression from extended binary (should be effective for context strings that has duplicated long C++ mangle names), 3) consider fixed-length integer context presentation, e.g. rolling hash. Also see my replies to David’s question on this.

Xinliang David Li via llvm-dev

unread,

Aug 7, 2020, 10:57:52 PM8/7/20

to Wenlei He, llvm...@lists.llvm.org, Wei Mi, Hongtao Yu

On Fri, Aug 7, 2020 at 4:44 PM Wenlei He <wen...@fb.com> wrote:

Thanks for the thoughtful questions, David. See my answers inline.

Thanks,

Wenlei

From: Xinliang David Li <dav...@google.com>
Date: Friday, August 7, 2020 at 1:24 PM
To: Wenlei He <wen...@fb.com>
Cc: "llvm...@lists.llvm.org" <llvm...@lists.llvm.org>, Wei Mi <w...@google.com>, Hongtao Yu <h...@fb.com>
Subject: Re: [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Wenlei, Thanks for the interesting proposal! please see my replies inline below.

On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wen...@fb.com> wrote:

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile.

Can you share more details on this? LBR only has 32 entries, so it won't give you full call context, so stack unwinding is needed. What is the overhead you see in production environment?

[wenlei] We are not worried about overhead in production environment as the sampling rate there is extremely low. We did measure locally however, for stack sampling and level 2 PEBS on top of regular LBR sampling, the overheads isn’t very noticeable still, but I don’t have numbers at hand.

I assume this is with no-omit-frame-pointer option right?

It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO.

What is the sample profile data size impact with the full context information?

[wenlei] Text CS profile is normally around 1x-10x of regular profile size, with all live context included. We plan to trim cold context, which we expect to bring the size down in a meaningful way. Another source of size increase is the context string, which could contain duplicated mangle names (can be very long for C++ templated code), but should be very compressible with the built-in compression support from extended binary profile. We will move to extended binary format, and leverage the compression support if needed. We can also consider more efficient fixed-length integer context representation (similar to rolling hash).

What is the average and max number of live contexts you have seen? Statically it grows exponentially as the depth of the context increases.

In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

Is CSSPGO inherently dependent upon pseudo-probe or is it orthogonal? I hope that it is the latter :)

[wenlei] They’re orthogonal. Context-sensitive SPGO can work without pseudo-probe and still use dwarf. Our plan is to keep context-sensitive SPGO working w/ and w/o pseudo-probe functionality-wise, but we only look at perf and tune with the two combined.

great.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.

Acknowledge of the limitation here.

Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

I think we need more quantification of the impact of using debug information for matching purposes: How much performance are left on the table due to this, and are they fixable issues or not.

[wenlei] The first table in the result section is comparing pseudo-probe with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a quantitative assessment of the effectiveness of pseudo-probe. It’s hard to assess performance benefit though, because PGO performance is a function of profile quality and heuristic. Currently heuristics are tuned to cope with the profile quality we have, so it may not do justice for profile quality improvements that pseudo-probe brings us.

One example is how FDO inliner evaluates call site. It checks callee’s total sample count instead of callee’s entry count. This is less than ideal, but we couldn’t fix it due to profile quality issues – we can’t reliably get inlinee’s entry count with dwarf based approach, see discussion in https://reviews.llvm.org/D60086. That problem is solved with pseudo-probe, but until we change the inliner, we won’t see perf win from that particular profile quality improvement. There are other similar cases too, and that’s why we used profile quality metric instead of performance to assess pseudo-probe.

Can you change the inliner to use entry count when probe based profile is used?

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

While this is true, the problem with CFG based approach is that a local CFG change can make the whole profile losing profile which can be bad too. Debug info based approach allows partial matching while relying on a propagation algorithm to compensate the rest.

3). Possibility of offline count inference. We have an experiment that encodes edges alongside with probes (blocks), so more sophisticated offline count inference algorithm is possible to further improve profile quality. Our algorithm researchers are working on new profile inference solution now.

This is needed because critical edges can not be splitted as instrumentation based PGO?

Ok. Also see my reply above. It seems to me that the line shifting problem should be solvable for AFDO (or make it more tolerant).  

It would be nice to see the main source of precision loss of AFDO here. Probably related to the missing edge information Wei mentioned.

thanks,

David

Wei Mi via llvm-dev

unread,

Aug 8, 2020, 12:56:17 AM8/8/20

to Wenlei He, llvm...@lists.llvm.org, Xinliang David Li, Hongtao Yu

On Fri, Aug 7, 2020 at 6:18 PM Wenlei He <wen...@fb.com> wrote:

Thanks for the feedbacks and questions, Wei. See my replies inline.

From: Wei Mi <w...@google.com>
Date: Friday, August 7, 2020 at 5:32 PM
To: Wenlei He <wen...@fb.com>
Cc: "llvm...@lists.llvm.org" <llvm...@lists.llvm.org>, Xinliang David Li <dav...@google.com>, Hongtao Yu <h...@fb.com>
Subject: Re: [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Thanks for the proposal and the performance improvement over existing AutoFDO is impressive.

On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wen...@fb.com> wrote:

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile. It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO. In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.
Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

Acknowledge to issues. We also found an issue that current AFDO profile doesn't keep edge information and that leads to nonoptimal profile in some cases. Since profile format is needed to be redesigned for component 1, I am thinking whether it is possible to extend the profile format in a way so it can incorporate edge information as well.

[wenlei] Yes, we have implemented an “add-on” that could encode edges in addition to probes/blocks in .pseudo_probe section, and we also have a way to represent edges in new profile. But that’s not critical for the framework and initial evaluation, which is why it’s not mentioned in this RFC. We did that mostly for enabling offline count inference algorithm experiments. We will share more details on that later. Curious what is the issue you saw due to lack of edge info?

There are critical edges in the CFG. Compiler cannot infer all the edge counts based on bb counts when critical edges are involved, so the prababilities of some branches are imprecise.

About pseudo probe, seemly you doesn't mention in this proposal but does it still provides the ability to solve the source drift issue you mentioned before? If it does, how it is achieved?

[wenlei] Pseudo-probe handles source drift reasonably well, and has good resilience against source changes. It can tolerate any source changes that doesn’t alter CFG, so the issues we ran into with line-based approach where deleting a comment lead to big regression isn’t going to happen with pseudo-probe. For changes that does alter CFG, we could also employ fuzzy CFG matching in future. Bottom line is using probe and CFG as profile carrier inherently has richer info, so it’s easier for PGO to see through the source changes and can still make sense of a stale profile. (We didn’t expand on the source drift issue in initial RFC, but I just mentioned that part in my reply to David, as secondary motivation for pseudo-probe.)

I see.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

I like both ideas, and those two components can be orthogonal? For the first component, I hope the existing debug information based AutoFDO can be benefited from it as well, with some extension to the current profile format.

[wenlei] Thanks. Yes, they’re orthogonal. But we need both for peak performance, and we want to focus tuning effort on the combination. Also see my reply to David’s questions.

That is great!

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.
The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

We propose a full context-sensitive sample profiling infrastructure that utilizes both LBR and call stack samples at the same time to synthesize a profile with a full context sensitivity. The key advantage is that rather than relying on previous inlining or a separate profile, the profile collected with the new approach will have full calling contexts recovered from both inlined and not inlined call sites. To achieve an accurate post-inline profile, a separate profile is no longer needed. Instead, the post-inline profile can be directly derived from adjusting the input profile based on all inline decisions. The richer context-sensitive profile also enables better inline decisions. The infrastructure has two key components listed below.

Synthesizing context-sensitive LBR with a virtual unwinder

To make sample PGO’s input profile context aware, we need to know the call stack of each LBR fall through path. That is done by sampling LBR and call stack simultaneously. With that, each sample will contain a call stack in addition to LBR entries. We use level 2 PEBS to control sampling skid so that the leaf frame from stack sample aligns with leaf frame from LBR. The raw call stack sample describes the calling context for the leaf LBR entry. In addition, by unwinding “call” and “return” (including implicit ones from inlinee) from LBR entries backwards on top of raw stack samples, we can recover the calling context for each of the LBR entries from the sample, thus synthesizing context-sensitive LBR profile.

What if the stack unwinding is not intact? For example, tail call optimization may cause unwinding issue currently in perf. framepointer or call frame information may not be properly maintained.

[wenlei] That’s a good question. Currently, we have frame pointer optimization (FPO) and tail call optimization disabled for experiments. FPO is disabled for our production builds as well, so it’s not a problem for us. For tail call, we’ll need to evaluate the cost-benefit and see what we can do. We know there’s heuristic to recover single missing frame due to tail call, which we haven’t implemented yet; beyond that, perhaps we can revisit leveraging dwarf unwinding, or live with either imperfect profile or tail call disabled. We also implemented special case for sample that lands in prolog and epilog where frame chain isn’t ready. However, even with both FPO and tail call disabled, we still see truncated stack samples, which we’re investigating. But the perf results are with profiles containing truncated/imperfect stack samples, so it looks like a small portion of imperfect profile doesn’t impact the effectiveness of CSSPGO too much.

Thanks.

We can then generate context-sensitive sample PGO profile using the context-sensitive LBR profile. In the new profile, a function’s profile becomes a collection of profiles, each representing a profile for a given calling context.

Will the profile size be significantly larger?

[wenlei] Currently text CS profile is 1-10x larger. But there’re ways to bring them it and we’re working on it: 1) trim cold context, 2) leverage compression from extended binary (should be effective for context strings that has duplicated long C++ mangle names), 3) consider fixed-length integer context presentation, e.g. rolling hash. Also see my replies to David’s question on this.

Trimming cold contexts could be very effective.

Wenlei He via llvm-dev

unread,

Aug 8, 2020, 1:53:50 AM8/8/20

to Xinliang David Li, llvm...@lists.llvm.org, Wei Mi, Hongtao Yu

See my answers inline.

From: Xinliang David Li <dav...@google.com>
Date: Friday, August 7, 2020 at 7:57 PM
To: Wenlei He <wen...@fb.com>
Cc: "llvm...@lists.llvm.org" <llvm...@lists.llvm.org>, Wei Mi <w...@google.com>, Hongtao Yu <h...@fb.com>
Subject: Re: [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

On Fri, Aug 7, 2020 at 4:44 PM Wenlei He <wen...@fb.com> wrote:

Thanks for the thoughtful questions, David. See my answers inline.

Thanks,

Wenlei

From: Xinliang David Li <dav...@google.com>
Date: Friday, August 7, 2020 at 1:24 PM
To: Wenlei He <wen...@fb.com>
Cc: "llvm...@lists.llvm.org" <llvm...@lists.llvm.org>, Wei Mi <w...@google.com>, Hongtao Yu <h...@fb.com>
Subject: Re: [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Wenlei, Thanks for the interesting proposal! please see my replies inline below.

On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wen...@fb.com> wrote:

Hi All,

Our team at Facebook is building a new context-sensitive Sample PGO as an alternative to the existing AutoFDO. We’d like to share our motivation, propose a new design, and reveal preliminary results on benchmarks. We will refer to the proposed design as CSSPGO in this RFC.

The new CSSPGO leverages simultaneous LBR and stack sampling to construct a full context-sensitive profile.

Can you share more details on this? LBR only has 32 entries, so it won't give you full call context, so stack unwinding is needed. What is the overhead you see in production environment?

[wenlei] We are not worried about overhead in production environment as the sampling rate there is extremely low. We did measure locally however, for stack sampling and level 2 PEBS on top of regular LBR sampling, the overheads isn’t very noticeable still, but I don’t have numbers at hand.

I assume this is with no-omit-frame-pointer option right?

[wenlei] Right, and tail call is off too for our experiments, but we may keep it on for production usage later. See my reply to Wei’s question on this.

It doesn’t rely on previous inlining like today’s AutoFDO to get context-sensitive profile, and it also doesn’t need a separate post-inline context-sensitive profile like CSPGO.

What is the sample profile data size impact with the full context information?

[wenlei] Text CS profile is normally around 1x-10x of regular profile size, with all live context included. We plan to trim cold context, which we expect to bring the size down in a meaningful way. Another source of size increase is the context string, which could contain duplicated mangle names (can be very long for C++ templated code), but should be very compressible with the built-in compression support from extended binary profile. We will move to extended binary format, and leverage the compression support if needed. We can also consider more efficient fixed-length integer context representation (similar to rolling hash).

What is the average and max number of live contexts you have seen? Statically it grows exponentially as the depth of the context increases.

[wenlei] I guess you meant the ratio of number of live contexts to number of functions? I haven’t looked, but I’d expect profile size ratio to be a good proxy for that.

In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

Is CSSPGO inherently dependent upon pseudo-probe or is it orthogonal? I hope that it is the latter :)

[wenlei] They’re orthogonal. Context-sensitive SPGO can work without pseudo-probe and still use dwarf. Our plan is to keep context-sensitive SPGO working w/ and w/o pseudo-probe functionality-wise, but we only look at perf and tune with the two combined.

great.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.

Acknowledge of the limitation here.

Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

I think we need more quantification of the impact of using debug information for matching purposes: How much performance are left on the table due to this, and are they fixable issues or not.

[wenlei] The first table in the result section is comparing pseudo-probe with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a quantitative assessment of the effectiveness of pseudo-probe. It’s hard to assess performance benefit though, because PGO performance is a function of profile quality and heuristic. Currently heuristics are tuned to cope with the profile quality we have, so it may not do justice for profile quality improvements that pseudo-probe brings us.

One example is how FDO inliner evaluates call site. It checks callee’s total sample count instead of callee’s entry count. This is less than ideal, but we couldn’t fix it due to profile quality issues – we can’t reliably get inlinee’s entry count with dwarf based approach, see discussion in https://reviews.llvm.org/D60086. That problem is solved with pseudo-probe, but until we change the inliner, we won’t see perf win from that particular profile quality improvement. There are other similar cases too, and that’s why we used profile quality metric instead of performance to assess pseudo-probe.

Can you change the inliner to use entry count when probe based profile is used?

[wenlei] Yes, we already made that change, and that’s one of the “few other improvements for the FDO inliner” I mentioned in the RFC. This is one example of the coupling between heuristic and profile quality.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[wenlei] Fair point. Though I’m wondering how much perf win does flow sensitivity bring to PGO? Curious if you have data for this. My guess is context sensitivity is much more visible than flow sensitivity for PGO’s effectiveness.

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

[wenlei] We intentionally insert pseudo-probe early for better resilience to compiler version changes, knowing that context-sensitivity will be covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO to cover some context-sensitivity. We choose to do pseudo instrumentation early because we view the combination as package even though they can be decoupled for clean design. That said, I agreed that it’s easier for CSSPGO to work without pseudo-probe than for pseudo-probe to work without CSSPGO.

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

While this is true, the problem with CFG based approach is that a local CFG change can make the whole profile losing profile which can be bad too. Debug info based approach allows partial matching while relying on a propagation algorithm to compensate the rest.

[wenlei] If we want to tolerate local CFG change, and still match majority of CFG, we could employ fuzzy CFG matching, and still using propagation to infer the unmatched parts. I think that should be easy to do, and more effective than line based fuzzy/partial match still. That’s something we planned to implement too.

3). Possibility of offline count inference. We have an experiment that encodes edges alongside with probes (blocks), so more sophisticated offline count inference algorithm is possible to further improve profile quality. Our algorithm researchers are working on new profile inference solution now.

This is needed because critical edges can not be splitted as instrumentation based PGO?

[wenlei] Yes, this is one of the cases we want to cover. We also have the option to insert nop for critical edges, but we want to avoid that, as it may lead to visible run time overhead.

[wenlei] Agreed that we can do better with line number approach too. But CFG as profile carrier has richer info than line, and is closer to profile which is inherently CFG based. So I think it should be easier with probe and CFG.

[wenlei] The edge count issue Wei mentioned isn’t handled by pseudo probe either, at least not for now. From our investigation, the problem here is more like death by a thousand cut.

Hongtao Yu via llvm-dev

unread,

Aug 8, 2020, 1:14:58 PM8/8/20

to Wenlei He, Xinliang David Li, llvm...@lists.llvm.org, Wei Mi

A few add-ons.

1. AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.

Acknowledge of the limitation here.

1.

2. Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

I think we need more quantification of the impact of using debug information for matching purposes: How much performance are left on the table due to this, and are they fixable issues or not.

[wenlei] The first table in the result section is comparing pseudo-probe with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a quantitative assessment of the effectiveness of pseudo-probe. It’s hard to assess performance benefit though, because PGO performance is a function of profile quality and heuristic. Currently heuristics are tuned to cope with the profile quality we have, so it may not do justice for profile quality improvements that pseudo-probe brings us.

One example is how FDO inliner evaluates call site. It checks callee’s total sample count instead of callee’s entry count. This is less than ideal, but we couldn’t fix it due to profile quality issues – we can’t reliably get inlinee’s entry count with dwarf based approach, see discussion in https://reviews.llvm.org/D60086. That problem is solved with pseudo-probe, but until we change the inliner, we won’t see perf win from that particular profile quality improvement. There are other similar cases too, and that’s why we used profile quality metric instead of performance to assess pseudo-probe.

Can you change the inliner to use entry count when probe based profile is used?

[hongtao] Yes, we strive to get to the peak performance with the FDO inliner tuned up for the combination of CSSPGO and pseudo probe. We haven't tuned for pseudo probe individually despite an initial promising results over AutoFDO on quite some SPEC2k6 benchmarks.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

1.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[hongtao] By flow-sensitivity, do you mean the execution trace of blocks in a function? This is missing from CSSPGO currently. Pseudo probe can be viewed as a cost-free instrumentation technique that correlates hardware samples to the IR for sample profiling. It may never achieve the precision of real instrumentation. It is currently combined with CSSPGO to obtain a context-sensitive profile. It can also be extended for flow-sensitivity (based on LBR) and value profiling (based on hardware register snapshot).

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

[hongtao] Yes, a local CFG change may invalidate a CFG-based profile. We are looking into a fuzzy CFG matching approach to minimize the invalidation. It may be based on CFG region analysis and value-numbering branch compares and function calls. On the other hand, the debug-info-based approach may not be resilient to code refactoring changes or semantics changes like branch flipping. We’d like users to be notified about such changes so that they can keep their profiles up-to-date.

3). Possibility of offline count inference. We have an experiment that encodes edges alongside with probes (blocks), so more sophisticated offline count inference algorithm is possible to further improve profile quality. Our algorithm researchers are working on new profile inference solution now.

This is needed because critical edges can not be splitted as instrumentation based PGO?

[wenlei] Yes, this is one of the cases we want to cover. We also have the option to insert nop for critical edges, but we want to avoid that, as it may lead to visible run time overhead.

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

1. The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.

2. The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

Rahman Lavaee via llvm-dev

unread,

Aug 8, 2020, 1:15:25 PM8/8/20

to Wenlei He, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi, Hongtao Yu

Hi Wenlei and Hogtao,

This sounds like an interesting (and complex) project. Do you think you can utilize the BB-info section (https://lists.llvm.org/pipermail/llvm-dev/2020-July/143512.html as an alternative to pseudo probes?

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Hongtao Yu via llvm-dev

unread,

Aug 8, 2020, 2:15:22 PM8/8/20

to Rahman Lavaee, Wenlei He, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi

Hi Rahman,

Thanks for sharing the BB-info section proposal which is a shiny idea. I think the BB-info and pseudo probes deal with a similar problem in different spaces, i.e., mapping hardware samples to corresponding basic blocks. In the context of pseudo probes, we much focus on mapping samples back to source-level blocks which is the input to the optimizer. Therefore we are building a persisting probe for each block that live through massive machine-independent/machine-dependent transforms. Besides probing basic blocks, a probe can be used to probe each value site of interest. So far only direct/indirect call sites are supported.

1. AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.

Acknowledge of the limitation here.

1.

2. Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

I think we need more quantification of the impact of using debug information for matching purposes: How much performance are left on the table due to this, and are they fixable issues or not.

[wenlei] The first table in the result section is comparing pseudo-probe with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a quantitative assessment of the effectiveness of pseudo-probe. It’s hard to assess performance benefit though, because PGO performance is a function of profile quality and heuristic. Currently heuristics are tuned to cope with the profile quality we have, so it may not do justice for profile quality improvements that pseudo-probe brings us.

One example is how FDO inliner evaluates call site. It checks callee’s total sample count instead of callee’s entry count. This is less than ideal, but we couldn’t fix it due to profile quality issues – we can’t reliably get inlinee’s entry count with dwarf based approach, see discussion in https://reviews.llvm.org/D60086. That problem is solved with pseudo-probe, but until we change the inliner, we won’t see perf win from that particular profile quality improvement. There are other similar cases too, and that’s why we used profile quality metric instead of performance to assess pseudo-probe.

Can you change the inliner to use entry count when probe based profile is used?

[wenlei] Yes, we already made that change, and that’s one of the “few other improvements for the FDO inliner” I mentioned in the RFC. This is one example of the coupling between heuristic and profile quality.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

1.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[wenlei] Fair point. Though I’m wondering how much perf win does flow sensitivity bring to PGO? Curious if you have data for this. My guess is context sensitivity is much more visible than flow sensitivity for PGO’s effectiveness.

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

[wenlei] We intentionally insert pseudo-probe early for better resilience to compiler version changes, knowing that context-sensitivity will be covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO to cover some context-sensitivity. We choose to do pseudo instrumentation early because we view the combination as package even though they can be decoupled for clean design. That said, I agreed that it’s easier for CSSPGO to work without pseudo-probe than for pseudo-probe to work without CSSPGO.

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

While this is true, the problem with CFG based approach is that a local CFG change can make the whole profile losing profile which can be bad too. Debug info based approach allows partial matching while relying on a propagation algorithm to compensate the rest.

[wenlei] If we want to tolerate local CFG change, and still match majority of CFG, we could employ fuzzy CFG matching, and still using propagation to infer the unmatched parts. I think that should be easy to do, and more effective than line based fuzzy/partial match still. That’s something we planned to implement too.

3). Possibility of offline count inference. We have an experiment that encodes edges alongside with probes (blocks), so more sophisticated offline count inference algorithm is possible to further improve profile quality. Our algorithm researchers are working on new profile inference solution now.

This is needed because critical edges can not be splitted as instrumentation based PGO?

[wenlei] Yes, this is one of the cases we want to cover. We also have the option to insert nop for critical edges, but we want to avoid that, as it may lead to visible run time overhead.

Context-sensitive Sample PGO

The effectiveness of BOLT, Propeller and CSPGO all demonstrated the importance of context-sensitive profile for PGO. However there are two limitations with the existing approaches.

1. The current solutions focus on leveraging a context-sensitive profile to attain an accurate post-inline profile that helps achieve a better code layout, but do not use the context-sensitive profile to drive better inlining.

2. The current solutions involve multiple training processes and profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for BOLT and Propeller), which incurs higher operational cost and complicates the build and release workflow.

Xinliang David Li via llvm-dev

unread,

Aug 8, 2020, 2:15:45 PM8/8/20

to Wenlei He, llvm...@lists.llvm.org, Wei Mi, Hongtao Yu

right.

In addition, we introduced pseudo-instrumentation for more accurate mapping from binary samples back to IR, similar to instrumentation PGO, but without any measure-able runtime overhead that is usually associated with instrumentation.

Is CSSPGO inherently dependent upon pseudo-probe or is it orthogonal? I hope that it is the latter :)

[wenlei] They’re orthogonal. Context-sensitive SPGO can work without pseudo-probe and still use dwarf. Our plan is to keep context-sensitive SPGO working w/ and w/o pseudo-probe functionality-wise, but we only look at perf and tune with the two combined.

great.

We have a functioning implementation for the new CSSPGO now. Initial results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.

Motivation

AutoFDO is a big success as it lowers the entry barrier for PGO significantly while still delivering substantial performance boost. However, there’s still a gap between AutoFDO and its instrumentation counterpart. From several failed internal attempts to improve AutoFDO, we realized that the bottleneck of AutoFDO lies in its profile quality. With the current level of profile quality, it’s difficult to reap more performance win because good heuristics are often limited by inferior profile. That prompted a systemic effort to investigate and improve AutoFDO framework. Specifically, we’re trying to handle the two biggest sources of profile quality issues:

AutoFDO relies on a limited context-sensitive profile collected based on previous inlining. Therefore it can only replay or prune the previous inlining. With the main CGSCC inliner, post-inline counts are not accurate due to scaling of context-less profile, which affects the effectiveness of later passes such as profile-guided code layout.

Acknowledge of the limitation here.

Dwarf line and discriminator info aren’t always well-maintained throughout the compilation, thus using them as anchors to map binary samples back to the IR can sometimes be inaccurate, which leads to inferior profile quality and limits PGO performance.

I think we need more quantification of the impact of using debug information for matching purposes: How much performance are left on the table due to this, and are they fixable issues or not.

[wenlei] The first table in the result section is comparing pseudo-probe with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a quantitative assessment of the effectiveness of pseudo-probe. It’s hard to assess performance benefit though, because PGO performance is a function of profile quality and heuristic. Currently heuristics are tuned to cope with the profile quality we have, so it may not do justice for profile quality improvements that pseudo-probe brings us.

One example is how FDO inliner evaluates call site. It checks callee’s total sample count instead of callee’s entry count. This is less than ideal, but we couldn’t fix it due to profile quality issues – we can’t reliably get inlinee’s entry count with dwarf based approach, see discussion in https://reviews.llvm.org/D60086. That problem is solved with pseudo-probe, but until we change the inliner, we won’t see perf win from that particular profile quality improvement. There are other similar cases too, and that’s why we used profile quality metric instead of performance to assess pseudo-probe.

Can you change the inliner to use entry count when probe based profile is used?

[wenlei] Yes, we already made that change, and that’s one of the “few other improvements for the FDO inliner” I mentioned in the RFC. This is one example of the coupling between heuristic and profile quality.

One way to measure performance is to use the exact pipeline setup for probe insertion as instrumentation PGO, and then do a 3-way comparison.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[wenlei] Fair point. Though I’m wondering how much perf win does flow sensitivity bring to PGO? Curious if you have data for this. My guess is context sensitivity is much more visible than flow sensitivity for PGO’s effectiveness.

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

[wenlei] We intentionally insert pseudo-probe early for better resilience to compiler version changes, knowing that context-sensitivity will be covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO to cover some context-sensitivity. We choose to do pseudo instrumentation early because we view the combination as package even though they can be decoupled for clean design. That said, I agreed that it’s easier for CSSPGO to work without pseudo-probe than for pseudo-probe to work without CSSPGO.

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

While this is true, the problem with CFG based approach is that a local CFG change can make the whole profile losing profile which can be bad too. Debug info based approach allows partial matching while relying on a propagation algorithm to compensate the rest.

[wenlei] If we want to tolerate local CFG change, and still match majority of CFG, we could employ fuzzy CFG matching, and still using propagation to infer the unmatched parts. I think that should be easy to do, and more effective than line based fuzzy/partial match still. That’s something we planned to implement too.

ok.

thanks,

David

Xinliang David Li via llvm-dev

unread,

Aug 8, 2020, 2:16:15 PM8/8/20

to Hongtao Yu, llvm...@lists.llvm.org, Wei Mi

It is probably also interesting to see some performance number for large server workload :) Topdown inlining can potentially bloat up code a lot leading to worse performance for programs with large instruction working set -- but this is of course tunable.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

1.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[wenlei] Fair point. Though I’m wondering how much perf win does flow sensitivity bring to PGO? Curious if you have data for this. My guess is context sensitivity is much more visible than flow sensitivity for PGO’s effectiveness.

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

[wenlei] We intentionally insert pseudo-probe early for better resilience to compiler version changes, knowing that context-sensitivity will be covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO to cover some context-sensitivity. We choose to do pseudo instrumentation early because we view the combination as package even though they can be decoupled for clean design. That said, I agreed that it’s easier for CSSPGO to work without pseudo-probe than for pseudo-probe to work without CSSPGO.

[hongtao] By flow-sensitivity, do you mean the execution trace of blocks in a function?

More like the path sensitive profile -- a realistic way of getting that is from post cfg transformation profiles.  Rong is going to share a proposal based on the current AFDO implementation.

This is missing from CSSPGO currently. Pseudo probe can be viewed as a cost-free instrumentation technique that correlates hardware samples to the IR for sample profiling. It may never achieve the precision of real instrumentation. It is currently combined with CSSPGO to obtain a context-sensitive profile. It can also be extended for flow-sensitivity (based on LBR) and value profiling (based on hardware register snapshot).

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

While this is true, the problem with CFG based approach is that a local CFG change can make the whole profile losing profile which can be bad too. Debug info based approach allows partial matching while relying on a propagation algorithm to compensate the rest.

[wenlei] If we want to tolerate local CFG change, and still match majority of CFG, we could employ fuzzy CFG matching, and still using propagation to infer the unmatched parts. I think that should be easy to do, and more effective than line based fuzzy/partial match still. That’s something we planned to implement too.

[hongtao] Yes, a local CFG change may invalidate a CFG-based profile. We are looking into a fuzzy CFG matching approach to minimize the invalidation. It may be based on CFG region analysis and value-numbering branch compares and function calls. On the other hand, the debug-info-based approach may not be resilient to code refactoring changes or semantics changes like branch flipping. We’d like users to be notified about such changes so that they can keep their profiles up-to-date.

would matching profile with a flipped branch lead to wrong swapping of taken/nontaken weights?

thanks,

David

Hongtao Yu via llvm-dev

unread,

Aug 8, 2020, 3:06:22 PM8/8/20

to Xinliang David Li, llvm...@lists.llvm.org, Wei Mi

Replied inline.

[hongtao] Exactly. We haven’t tried with large workload yet but it’s definitely one of our ultimate goals. We did refine the inliner with more size controls but there’s going to be a lot more tunings. Upstreaming everything we have is our first step. We hope to see potential co-development/coordination in the future.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

1.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[wenlei] Fair point. Though I’m wondering how much perf win does flow sensitivity bring to PGO? Curious if you have data for this. My guess is context sensitivity is much more visible than flow sensitivity for PGO’s effectiveness.

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

[wenlei] We intentionally insert pseudo-probe early for better resilience to compiler version changes, knowing that context-sensitivity will be covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO to cover some context-sensitivity. We choose to do pseudo instrumentation early because we view the combination as package even though they can be decoupled for clean design. That said, I agreed that it’s easier for CSSPGO to work without pseudo-probe than for pseudo-probe to work without CSSPGO.

[hongtao] By flow-sensitivity, do you mean the execution trace of blocks in a function?

More like the path sensitive profile -- a realistic way of getting that is from post cfg transformation profiles. Rong is going to share a proposal based on the current AFDO implementation.

[hongtao] Great, looking forward to Rong’s proposal.

This is missing from CSSPGO currently. Pseudo probe can be viewed as a cost-free instrumentation technique that correlates hardware samples to the IR for sample profiling. It may never achieve the precision of real instrumentation. It is currently combined with CSSPGO to obtain a context-sensitive profile. It can also be extended for flow-sensitivity (based on LBR) and value profiling (based on hardware register snapshot).

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

While this is true, the problem with CFG based approach is that a local CFG change can make the whole profile losing profile which can be bad too. Debug info based approach allows partial matching while relying on a propagation algorithm to compensate the rest.

[wenlei] If we want to tolerate local CFG change, and still match majority of CFG, we could employ fuzzy CFG matching, and still using propagation to infer the unmatched parts. I think that should be easy to do, and more effective than line based fuzzy/partial match still. That’s something we planned to implement too.

[hongtao] Yes, a local CFG change may invalidate a CFG-based profile. We are looking into a fuzzy CFG matching approach to minimize the invalidation. It may be based on CFG region analysis and value-numbering branch compares and function calls. On the other hand, the debug-info-based approach may not be resilient to code refactoring changes or semantics changes like branch flipping. We’d like users to be notified about such changes so that they can keep their profiles up-to-date.

would matching profile with a flipped branch lead to wrong swapping of taken/nontaken weights?

[hongtao] Yes. If user flips the then-else blocks of an if-statement, the current compiler will still apply the profile from the original code which will lead to wrong branch weights.

Wenlei He via llvm-dev

unread,

Aug 8, 2020, 3:06:54 PM8/8/20

to Hongtao Yu, Xinliang David Li, llvm...@lists.llvm.org, Wei Mi

Also see my replies inline.

[wenlei] Yes, as Hongtao pointed out the ultimate goal is definitely to improve performance of large server workloads. We wanted to start upstreaming the changes while working on evaluating perf on larger workloads. I think there’s benefit in upstreaming this work now, as it makes it possible for others to evaluate early, and also avoid us having to keep a large chunk of changes as private patches. What do you think?

You’re right that we cannot let top-down inliner run unbounded. This current FDO is bounded by previous SCC inline as it only does replay, so it’s very simple. For top-down inlining with CS profile, it can go far beyond replay. So we needed call site prioritized BFS top-down inlining with a growth or size cap, which is already implemented internally. Again, this is among the “improvements for the FDO inliner” I mentioned earlier. 😊 There’s lots of tuning to be done, and we will likely have to constrain the FDO inliner initially, and gradually let it take over more inlining for PGO as it matures. But I think the perf and size numbers from SPEC is a very good sign.

I also wanted to point out that even though we haven’t got to point where we have perf numbers for large workload yet (we simply haven’t tried yet as we’re still working on refining the infrastructure), we do see quite a few cases in large workloads where top-down inlining with CS profile and its specialization would help derive better inline decision.

Eventually, with all pieces in place, we expect top-down inlining with CS profile to save code size, hence help reducing working set. This is because top-down inlining with CS profile is more selective.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

1.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[wenlei] Fair point. Though I’m wondering how much perf win does flow sensitivity bring to PGO? Curious if you have data for this. My guess is context sensitivity is much more visible than flow sensitivity for PGO’s effectiveness.

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

[wenlei] We intentionally insert pseudo-probe early for better resilience to compiler version changes, knowing that context-sensitivity will be covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO to cover some context-sensitivity. We choose to do pseudo instrumentation early because we view the combination as package even though they can be decoupled for clean design. That said, I agreed that it’s easier for CSSPGO to work without pseudo-probe than for pseudo-probe to work without CSSPGO.

[hongtao] By flow-sensitivity, do you mean the execution trace of blocks in a function?

More like the path sensitive profile -- a realistic way of getting that is from post cfg transformation profiles. Rong is going to share a proposal based on the current AFDO implementation.

[wenlei] How significant flow-sensitivity is comparing to context-sensitivity? Looking forward to the proposal, and wondering if it can be combined with CSSPGO and pseudo-probe.

[hongtao] Great, looking forward to Rong’s proposal.

This is missing from CSSPGO currently. Pseudo probe can be viewed as a cost-free instrumentation technique that correlates hardware samples to the IR for sample profiling. It may never achieve the precision of real instrumentation. It is currently combined with CSSPGO to obtain a context-sensitive profile. It can also be extended for flow-sensitivity (based on LBR) and value profiling (based on hardware register snapshot).

There’re other secondary motivations for pseudo-probe as well beyond its profile quality benefits that I didn’t mention earlier:

1). Stale profile detection. With line numbers, it’s hard to detect and react to stale profile. Pseudo-probes are tied to blocks so it’s effectively using CFG as carrier for profile, which makes stale profile detection easier.

2). Resilience to source changes. We’ve seen cases where deleting a single line of comment caused a 8% perf regression for a large service because it completely messed up profile annotation for a critical path. That will not happen with pseudo-probe – any source change not altering CFG will be tolerated without perf impact.

While this is true, the problem with CFG based approach is that a local CFG change can make the whole profile losing profile which can be bad too. Debug info based approach allows partial matching while relying on a propagation algorithm to compensate the rest.

[wenlei] If we want to tolerate local CFG change, and still match majority of CFG, we could employ fuzzy CFG matching, and still using propagation to infer the unmatched parts. I think that should be easy to do, and more effective than line based fuzzy/partial match still. That’s something we planned to implement too.

[hongtao] Yes, a local CFG change may invalidate a CFG-based profile. We are looking into a fuzzy CFG matching approach to minimize the invalidation. It may be based on CFG region analysis and value-numbering branch compares and function calls. On the other hand, the debug-info-based approach may not be resilient to code refactoring changes or semantics changes like branch flipping. We’d like users to be notified about such changes so that they can keep their profiles up-to-date.

would matching profile with a flipped branch lead to wrong swapping of taken/nontaken weights?

[hongtao] Yes. If user flips the then-else blocks of an if-statement, the current compiler will still apply the profile from the original code which will lead to wrong branch weights.

[wenlei] Right, that would lead to wrong weights, which is problematic for line-based approach as it cannot tell that flipping has happened. CFG/probe based approach can do better.

Rahman Lavaee via llvm-dev

unread,

Aug 8, 2020, 11:15:58 PM8/8/20

to Hongtao Yu, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi

Thanks for the kind words.

For the basic block mapping, would it not be sufficient if we add IR basic block ids to every BB info record? Since BB info emission is done at the end of codegen, the final BB records are all the machine basic blocks which have made it into the final binary.

Hongtao Yu via llvm-dev

unread,

Aug 8, 2020, 11:16:36 PM8/8/20

to Rahman Lavaee, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi

In addition to an IR block id or probe Id, we’ll also need to know the inline context of a probe if it comes from an inlinee. The current pseudo probe encoding is based on a DFS walk of the inline tree. A MIR BB may contain probes from different inlinees, and we may need to extend the BB-info format for encode the inline contexts there. I’m happy to work with you on a encoding format that can be used for both Propeller and pseudo probes.

This is our current encoding format:

// FUNCTION BODY (one for each uninlined function present in the text section)

// GUID (uint64)

// GUID of the function

// NPROBES (ULEB128)

// Number of probes originating from this function.

// NUM_INLINED_FUNCTIONS (ULEB128)

// Number of callees inlined into this function, aka number of

// first-level inlinees

// PROBE RECORDS

// A list of NPROBES entries. Each entry contains:

// INDEX (ULEB128)

// TYPE (uint4)

// 0 - block probe, 1 - indirect call, 2 - direct call

// ATTRIBUTE (uint3)

// 1 - internal linkage, 2 - dangling

// ADDRESS_TYPE (uint1)

// 0 - code address, 1 - address delta

// CODE_ADDRESS (uint64 or ULEB128)

// code address or address delta, depending on Flag

// INLINED FUNCTION RECORDS

// A list of NUM_INLINED_FUNCTIONS entries describing each of the inlined

// callees. Each record contains:

// INLINE SITE

// GUID of the inlinee (uint64)

// Line number | Discriminator (ULEB128)

// FUNCTION BODY

// A FUNCTION BODY entry describing the inlined function.

From: Rahman Lavaee <rah...@google.com>

Date: Saturday, August 8, 2020 at 1:09 PM
To: Hongtao Yu <h...@fb.com>

Xinliang David Li via llvm-dev

unread,

Aug 8, 2020, 11:17:00 PM8/8/20

to Wenlei He, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi, Hongtao Yu

Sounds good to me. We will discuss how to organize the changes in a way that is most maintainable in patch reviews.

You’re right that we cannot let top-down inliner run unbounded. This current FDO is bounded by previous SCC inline as it only does replay, so it’s very simple. For top-down inlining with CS profile, it can go far beyond replay. So we needed call site prioritized BFS top-down inlining with a growth or size cap, which is already implemented internally. Again, this is among the “improvements for the FDO inliner” I mentioned earlier. 😊 There’s lots of tuning to be done, and we will likely have to constrain the FDO inliner initially, and gradually let it take over more inlining for PGO as it matures. But I think the perf and size numbers from SPEC is a very good sign.

I also wanted to point out that even though we haven’t got to point where we have perf numbers for large workload yet (we simply haven’t tried yet as we’re still working on refining the infrastructure), we do see quite a few cases in large workloads where top-down inlining with CS profile and its specialization would help derive better inline decision.

Eventually, with all pieces in place, we expect top-down inlining with CS profile to save code size, hence help reducing working set. This is because top-down inlining with CS profile is more selective.

sounds good.

Some of the issues may be fixable with dwarf info maintenance, but the engineering cost to find and fix all issues are non-trivial. We think maintaining anchor as IR is a more sustainable alternative than maintaining anchor as metadata.

1.

To lift the above limitations, we’d like to propose an alternative design that consists of two components: 1) Context-sensitive sample PGO, 2) Sample to IR mapping using pseudo probes. The goal is to further improve sample PGO performance while maintaining usability and keeping training runtime overhead at zero. In addition, we hope the CSSPGO framework can also open up opportunities for new optimizations with more stringent requirements on profile quality.

CSSPGO is a very attractive optimization by itself. Can you provide more motivation for the pseudo probes?

[wenlei] One way to look at the combination of pseudo-probe and context-sensitive sample PGO is that, the former brings sample PGO closer to instrumentation PGO, and the latter to sample PGO is like the two-stage CSPGO, or even post-link optimizer to instrumentation PGO. These are two orthogonal problems that needs separate solutions.

There are also differences though:

1) CSPGO has lots of flow sensitivity and PLO has even more flow sensitivity while CSSPGO does not. CSSPGO has the advantage to guide inliner as well

[wenlei] Fair point. Though I’m wondering how much perf win does flow sensitivity bring to PGO? Curious if you have data for this. My guess is context sensitivity is much more visible than flow sensitivity for PGO’s effectiveness.

2) Pseudo-probes are inserted pretty early in the pipeline, so it won't beat instrumentation PGO performance as the latter has early inlining to expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the other way around.

[wenlei] We intentionally insert pseudo-probe early for better resilience to compiler version changes, knowing that context-sensitivity will be covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO to cover some context-sensitivity. We choose to do pseudo instrumentation early because we view the combination as package even though they can be decoupled for clean design. That said, I agreed that it’s easier for CSSPGO to work without pseudo-probe than for pseudo-probe to work without CSSPGO.

[hongtao] By flow-sensitivity, do you mean the execution trace of blocks in a function?

More like the path sensitive profile -- a realistic way of getting that is from post cfg transformation profiles. Rong is going to share a proposal based on the current AFDO implementation.

[wenlei] How significant flow-sensitivity is comparing to context-sensitivity? Looking forward to the proposal, and wondering if it can be combined with CSSPGO and pseudo-probe.

They are mostly orthogonal though.

David

_______________________________________________

Xinliang David Li via llvm-dev

unread,

Aug 8, 2020, 11:17:32 PM8/8/20

to Rahman Lavaee, llvm...@lists.llvm.org, Wei Mi, Hongtao Yu

On Sat, Aug 8, 2020 at 1:09 PM Rahman Lavaee <rah...@google.com> wrote:

Thanks for the kind words.
For the basic block mapping, would it not be sufficient if we add IR basic block ids to every BB info record? Since BB info emission is done at the end of codegen, the final BB records are all the machine basic blocks which have made it into the final binary.

My understanding is that pseudo probes need to be inserted early and it does not rely on existing inlining behavior to get context sensitive info for profiles.

David

Hongtao Yu via llvm-dev

unread,

Aug 8, 2020, 11:17:53 PM8/8/20

to Xinliang David Li, Rahman Lavaee, llvm...@lists.llvm.org, Wei Mi

Pseudo probes are inserted very early so that inlinees’s probes can be propagated into the inliner. We rely on the inlined probes to construct a full context-sensitive profile for inlinees. This is needed so that we can collect a CS profile on a production build.

Wenlei He via llvm-dev

unread,

Aug 11, 2020, 1:56:13 AM8/11/20

to Xinliang David Li, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi, Hongtao Yu

Thanks for the feedbacks and discussions! We will be sending up patches soon then. The patches will be organized in three categories: 1) Context-sensitive sample PGO, 2) Pseudo Instrumentation, 3) A new profile generation tool that #1 and #2 depends on.

Rahman Lavaee via llvm-dev

unread,

Aug 12, 2020, 6:14:28 PM8/12/20

to Hongtao Yu, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi

Thanks for sharing the detailed description of the pseudo probes. It sheds light on the fact that pseudo probes are used not just for address mapping back to IR, but also building the "context-sensitive" profile of each function.
One question (Although I think this was previously asked by David): How precise is the CODE_ADDRESS (specifically in the case of basic blocks being duplicated/merged by machine passes)?

Thanks a lot for the detailed description.

Hongtao Yu via llvm-dev

unread,

Aug 12, 2020, 6:15:17 PM8/12/20

to Rahman Lavaee, llvm...@lists.llvm.org, Xinliang David Li, Wei Mi

That’s a good question. During the offline counts processing, the samples collected on the first physical instruction following a probe will be counted towards the probe. This is mostly accurate unless the physical instruction is not on the same control flow path with the probe (e.g, with a label sits in between). The accuracy comes from the semantics associated with a block probe that enforces the probe to be virtually executed exactly the same times before and after an optimization. We rely on a sophisticated counts inference tool to deal with corner cases and hardware noises.

Regarding duplicated blocks, the probes are naturally distributed to newly created blocks and the counts collected on the duplicated probes will be accumulated to the original probe. Block merge will be blocked by pseudo probes since in the form of an intrinsic call they look different in call arguments. However, pseudo probes don’t block instruction merge.

Thanks,

Hongtao

Reply all

Reply to author

Forward

0 new messages