[LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

120 views
Skip to first unread message

Lee Hunt

unread,
May 27, 2015, 12:28:31 AM5/27/15
to llv...@cs.uiuc.edu

Hello –

 

I’m an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications.

 

We at Microsoft have deep experience with Visual C++’s Profile Guided Optimization and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch).   Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal.

 

Before we dig into turning it on, I’m wondering if there’s any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications?  It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY.  I’d like an overview of the optimizations that PGO does, but I don’t find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization.

 

For example, from reading different pages on how Clang PGO, it’s unclear if it does “block reordering” (i.e. moving unexecuted code blocks to a distant code page, leaving only ‘hot’ executed code packed together for greater code density).  I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing.  Does Clang PGO do block reordering?

 

Thanks,

--Lee

Diego Novillo

unread,
May 27, 2015, 10:46:43 AM5/27/15
to Lee Hunt, llv...@cs.uiuc.edu
On Tue, May 26, 2015 at 11:47 PM, Lee Hunt <le...@exchange.microsoft.com> wrote:

> For example, from reading different pages on how Clang PGO, it’s unclear if
> it does “block reordering” (i.e. moving unexecuted code blocks to a distant
> code page, leaving only ‘hot’ executed code packed together for greater code
> density). I find mention of “hot arc” optimization (-fprofile-arcs) , but
> I’m unclear if this is the same thing. Does Clang PGO do block reordering?

A small clarification. Clang itself does not implement any
optimizations. Clang limits itself to generate LLVM IR. The
annotated IR is then used by some LLVM optimizers to guide decisions.
At this time, there are few optimization passes that use the profile
information: block reordering and register allocation (to avoid
spilling on cold paths).

There are no other significant transformations that use profiling
information. We are working on that. Notably, we'd like to add
profiling-based decisions to the inliner, loop optimizers and the
vectorizer.


Diego.

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Xinliang David Li

unread,
May 27, 2015, 12:33:38 PM5/27/15
to Lee Hunt, llv...@cs.uiuc.edu
On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <le...@exchange.microsoft.com> wrote:

Hello –

 

I’m an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications.

 

We at Microsoft have deep experience with Visual C++’s Profile Guided Optimization and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch).  


yes. This is true for the GCC too.  Clang's PGO does not shrink code size yet.
 

 Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal.

 

Before we dig into turning it on, I’m wondering if there’s any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications?  It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY


Counter contention is one issue. Redundant counter updates is another major issue (due to the early instrumentation). We are working on the later and see great speed ups.

 

I’d like an overview of the optimizations that PGO does, but I don’t find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization.


Profile data is not used in any IPA passes yet. It is used by any post inline optimizations though -- including block layout, register allocator etc.

 

 

For example, from reading different pages on how Clang PGO, it’s unclear if it does “block reordering” (i.e. moving unexecuted code blocks to a distant code page, leaving only ‘hot’ executed code packed together for greater code density).


LLVM's block placement uses branch probability and frequency data, but there is no function splitting optimization yet.

 I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing.  Does Clang PGO do block reordering?


It does reordering, but does not do splitting/partitioning.

David

 

 

Thanks,

--Lee

Lee Hunt

unread,
May 27, 2015, 1:21:49 PM5/27/15
to Xinliang David Li, llv...@cs.uiuc.edu

Thanks! CIL [LeeHu] for a few comments…

 

 

From: Xinliang David Li [mailto:xinli...@gmail.com]
Sent: Wednesday, May 27, 2015 9:29 AM
To: Lee Hunt
Cc: llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

 

On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <le...@exchange.microsoft.com> wrote:

Hello –

 

I’m an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications.

 

We at Microsoft have deep experience with Visual C++’s Profile Guided Optimization and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch).  

 

yes. This is true for the GCC too.  Clang's PGO does not shrink code size yet.

 

[LeeHu] Note: I’m not talking about shrinking code size, but rather reordering it such that only ‘active’ branches within the profiled functions are grouped together in ‘hot’ code pages.  This is a very big optimization for us in VC++ toolchain in PGO.

We also have the “/LTCG” flag – which is seemingly similar to the “-flto” Clang flag -- that *does* shrink code by various means (dead code removal, common IL tree collapsing) because it can see all the object code for an entire produced target binary (e.g. .exe or .dll).

Does -flto also shrink code?

 

 Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal.

 

Before we dig into turning it on, I’m wondering if there’s any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications?  It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY

 

Counter contention is one issue. Redundant counter updates is another major issue (due to the early instrumentation). We are working on the later and see great speed ups.

 

 

I’d like an overview of the optimizations that PGO does, but I don’t find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization.

 

Profile data is not used in any IPA passes yet. It is used by any post inline optimizations though -- including block layout, register allocator etc.

 

[LeeHu]: sorry for naïve question, but what is IPA?  And what post-inline optimizations are currently being done?   We’re currently using Clang 3.5 if that matters.

Xinliang David Li

unread,
May 27, 2015, 1:28:58 PM5/27/15
to Lee Hunt, llv...@cs.uiuc.edu
On Wed, May 27, 2015 at 10:11 AM, Lee Hunt <le...@exchange.microsoft.com> wrote:

Thanks! CIL [LeeHu] for a few comments…

 

 

From: Xinliang David Li [mailto:xinli...@gmail.com]
Sent: Wednesday, May 27, 2015 9:29 AM
To: Lee Hunt
Cc: llv...@cs.uiuc.edu
Subject: Re: [LLVMdev] Capabilities of Clang's PGO (e.g. improving code density)

 

 

On Tue, May 26, 2015 at 8:47 PM, Lee Hunt <le...@exchange.microsoft.com> wrote:

Hello –

 

I’m an Engineer in Microsoft Office after looking into possible advantages of using PGO for our Android Applications.

 

We at Microsoft have deep experience with Visual C++’s Profile Guided Optimization and often see 10% or more reduction in the size of application code loaded after using PGO for key scenarios (e.g. application launch).  

 

yes. This is true for the GCC too.  Clang's PGO does not shrink code size yet.

 

[LeeHu] Note: I’m not talking about shrinking code size, but rather reordering it such that only ‘active’ branches within the profiled functions are grouped together in ‘hot’ code pages.  This is a very big optimization for us in VC++ toolchain in PGO.

We also have the “/LTCG” flag – which is seemingly similar to the “-flto” Clang flag -- that *does* shrink code by various means (dead code removal, common IL tree collapsing) because it can see all the object code for an entire produced target binary (e.g. .exe or .dll).

Does -flto also shrink code?

  


That depends on other options used (e.g, -Os). With LTO, compiler  sees larger scope, performs cross module inlines and dead function eliminations. It does have more opportunities to shrink code.


 

 Making application launch quickly is very important to us, and reducing the number of code pages loaded helps with this goal.

 

Before we dig into turning it on, I’m wondering if there’s any pre-existing research / case studies about possible code page reduction seen from other Clang PGO-enabled applications?  It sounds like there is some possible instrumented run performance problems due to counter contention resulting in sluggish performance and perhaps skewed profile data: https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY

 

Counter contention is one issue. Redundant counter updates is another major issue (due to the early instrumentation). We are working on the later and see great speed ups.

 

 

I’d like an overview of the optimizations that PGO does, but I don’t find much from looking at the Clang PGO section: http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization.

 

Profile data is not used in any IPA passes yet. It is used by any post inline optimizations though -- including block layout, register allocator etc.

 

[LeeHu]: sorry for naïve question, but what is IPA? 



Inter-procedural analysis/optimizations.

Duncan P. N. Exon Smith

unread,
May 27, 2015, 2:15:28 PM5/27/15
to Lee Hunt, llv...@cs.uiuc.edu

> On 2015 May 27, at 07:42, Diego Novillo <dnov...@google.com> wrote:
>
> On Tue, May 26, 2015 at 11:47 PM, Lee Hunt <le...@exchange.microsoft.com> wrote:
>
>> For example, from reading different pages on how Clang PGO, it’s unclear if
>> it does “block reordering” (i.e. moving unexecuted code blocks to a distant
>> code page, leaving only ‘hot’ executed code packed together for greater code
>> density). I find mention of “hot arc” optimization (-fprofile-arcs) , but
>> I’m unclear if this is the same thing. Does Clang PGO do block reordering?
>
> A small clarification. Clang itself does not implement any
> optimizations. Clang limits itself to generate LLVM IR. The
> annotated IR is then used by some LLVM optimizers to guide decisions.
> At this time, there are few optimization passes that use the profile
> information: block reordering and register allocation (to avoid
> spilling on cold paths).
>
> There are no other significant transformations that use profiling
> information. We are working on that. Notably, we'd like to add
> profiling-based decisions to the inliner

Just a quick note about the inliner. Although the inliner itself
doesn't know how to use the profile, clang's IRGen has been modified
to add an 'inlinehint' attribute to hot functions and the 'cold'
attribute to cold functions. Indirectly, PGO does affect the
inliner. (We'll remove this once the inliner does the right thing on
its own.)

Xinliang David Li

unread,
May 27, 2015, 3:55:29 PM5/27/15
to Randy Chapman, Lee Hunt, llv...@cs.uiuc.edu


On Wed, May 27, 2015 at 12:40 PM, Randy Chapman <ran...@microsoft.com> wrote:

 

Hi David!

 

Thanks again for your help!  I was wondering if you could clarify one thing for me?

 I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing.  Does Clang PGO do block reordering?

It does reordering, but does not do splitting/partitioning.

I take this to mean that PGO does block reordering within the function?  I don’t see that the clang drive passes anything to the linker to drive function ordering at the linker level as well.  Is there something there that I missed, or are you aware of any readily available tools to do so?  If not, we’ve done some work locally on enabling that which we will continue.

 


Ok. There are three reordering related optimizations:

1) intra-procedural Basic Block Reordering to reduce branch cost, icache miss and front-end stalls.
2) function splitting/partitioning -- splitting really code part of a function into unlikely.text sections
3) function reordering based on affinity and hotness -- reordering functions by the linker/plugin (guided by the compiler annotations).

Clang currently only does 1).

Hope this clarifies.

thanks,

David


 

Thanks J

--randy

Randy Chapman

unread,
May 27, 2015, 4:51:08 PM5/27/15
to xinli...@gmail.com, Lee Hunt, llv...@cs.uiuc.edu

 

Hi David!

 

Thanks again for your help!  I was wondering if you could clarify one thing for me?

 I find mention of “hot arc” optimization (-fprofile-arcs) , but I’m unclear if this is the same thing.  Does Clang PGO do block reordering?

It does reordering, but does not do splitting/partitioning.

I take this to mean that PGO does block reordering within the function?  I don’t see that the clang drive passes anything to the linker to drive function ordering at the linker level as well.  Is there something there that I missed, or are you aware of any readily available tools to do so?  If not, we’ve done some work locally on enabling that which we will continue.

 

Thanks J

--randy

Randy Chapman

unread,
May 27, 2015, 4:52:52 PM5/27/15
to Xinliang David Li, Lee Hunt, llv...@cs.uiuc.edu

 

David,

 

Yes, that is very helpful.  Thanks!

Lee Hunt

unread,
May 27, 2015, 7:59:58 PM5/27/15
to Xinliang David Li, Randy Chapman, llv...@cs.uiuc.edu

Yes, thanks David!

 

For the intra-procedural Basic Block Reordering, do you have any data as to how much improvement that gives speed-wise for any perf tests you’ve measured?

 

I’m thinking this may speed things up for things like application launch by a couple %.  For perf intensive code (e.g. spreadsheet recalc), I would expect it would be more.

Xinliang David Li

unread,
May 28, 2015, 2:16:43 AM5/28/15
to Lee Hunt, Randy Chapman, llv...@cs.uiuc.edu
On Wed, May 27, 2015 at 4:56 PM, Lee Hunt <le...@exchange.microsoft.com> wrote:

Yes, thanks David!

 

For the intra-procedural Basic Block Reordering, do you have any data as to how much improvement that gives speed-wise for any perf tests you’ve measured?


Yes. Most of the benchmarks we have see improvement with better layout -- some improvement are small and some are large. Of course this also depends on the layout algorithm, which we are working on improving too.
 

 

I’m thinking this may speed things up for things like application launch by a couple %. 


Function reordering may be more important for this, which needs call-trace profile. The trace based layout will reduce # of page faults during program starts.

David

Philip Reames

unread,
May 28, 2015, 1:06:01 PM5/28/15
to Duncan P. N. Exon Smith, Lee Hunt, llv...@cs.uiuc.edu

On 05/27/2015 11:13 AM, Duncan P. N. Exon Smith wrote:
>> On 2015 May 27, at 07:42, Diego Novillo <dnov...@google.com> wrote:
>>
>> On Tue, May 26, 2015 at 11:47 PM, Lee Hunt <le...@exchange.microsoft.com> wrote:
>>
>>> For example, from reading different pages on how Clang PGO, it’s unclear if
>>> it does “block reordering” (i.e. moving unexecuted code blocks to a distant
>>> code page, leaving only ‘hot’ executed code packed together for greater code
>>> density). I find mention of “hot arc” optimization (-fprofile-arcs) , but
>>> I’m unclear if this is the same thing. Does Clang PGO do block reordering?
>> A small clarification. Clang itself does not implement any
>> optimizations. Clang limits itself to generate LLVM IR. The
>> annotated IR is then used by some LLVM optimizers to guide decisions.
>> At this time, there are few optimization passes that use the profile
>> information: block reordering and register allocation (to avoid
>> spilling on cold paths).
>>
>> There are no other significant transformations that use profiling
>> information. We are working on that. Notably, we'd like to add
>> profiling-based decisions to the inliner
> Just a quick note about the inliner. Although the inliner itself
> doesn't know how to use the profile, clang's IRGen has been modified
> to add an 'inlinehint' attribute to hot functions and the 'cold'
> attribute to cold functions. Indirectly, PGO does affect the
> inliner. (We'll remove this once the inliner does the right thing on
> its own.)

OT: Can you give me a pointer to the clang code involved? I wasn't
aware of this.

Duncan P. N. Exon Smith

unread,
May 28, 2015, 2:10:45 PM5/28/15
to Philip Reames, Lee Hunt, llv...@cs.uiuc.edu

Have a look at `CodeGenPGO::applyFunctionAttributes()` around line
760 of lib/CodeGen/CodeGenPGO.cpp.

Teresa Johnson

unread,
May 28, 2015, 2:15:58 PM5/28/15
to Philip Reames, Lee Hunt, llv...@cs.uiuc.edu

It is set in clang/lib/CodeGen/CodeGenPGO.cpp
CodeGenPGO::applyFunctionAttributes.

Note that it uses the function entry count to determine hotness. This
means that functions entered infrequently but containing very hot
loops would be marked cold, perhaps this works since it is only used
for inlining and is presumably a stand-in for call edge hotness. The
MaxFunctionCount for the profile is also the max of all the function
entry counts (set during profile writing).

Teresa

>
>>
>>> , loop optimizers and the
>>> vectorizer.
>>>
>>>
>>> Diego.
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

--
Teresa Johnson | Software Engineer | tejo...@google.com | 408-460-2413

Diego Novillo

unread,
May 28, 2015, 3:31:29 PM5/28/15
to Teresa Johnson, Philip Reames, Lee Hunt, llv...@cs.uiuc.edu


On 05/28/15 14:08, Teresa Johnson wrote:
> On Thu, May 28, 2015 at 9:56 AM, Philip Reames
> <list...@philipreames.com> wrote:
>> OT: Can you give me a pointer to the clang code involved? I wasn't aware of
>> this.
> It is set in clang/lib/CodeGen/CodeGenPGO.cpp
> CodeGenPGO::applyFunctionAttributes.
>
> Note that it uses the function entry count to determine hotness. This
> means that functions entered infrequently but containing very hot
> loops would be marked cold, perhaps this works since it is only used
> for inlining and is presumably a stand-in for call edge hotness. The
> MaxFunctionCount for the profile is also the max of all the function
> entry counts (set during profile writing).

Right. We now also have function entry counts propagated into the IR.
This gives the inliner a way to compute global hotness using entry
counts and internal frequencies.
Reply all
Reply to author
Forward
0 new messages