[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

1,100 views
Skip to first unread message

Kostya Serebryany

unread,
Apr 17, 2014, 8:21:46 AM4/17/14
to Duncan P. N. Exon Smith, Bob Wilson, llv...@cs.uiuc.edu, Justin Bogner, Diego Novillo, Chandler Carruth, Dmitry Vyukov
Hi, 

The current design of -fprofile-instr-generate has the same fundamental flaw 
as the old gcc's gcov instrumentation: it has contention on counters. 
A trivial synthetic test case was described here: 

For the problem to appear we need to have a hot function that is simultaneously executed 
by multiple threads -- then we will have high contention on the racy profile counters.

Such situation is not necessary very frequent, but when it happens 
-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)

An example is the multi-threaded vp9 video encoder.

cd libvpx/
F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configure
make -j32 
time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m  

When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead
(8.5 vs 10 seconds)
When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).

I am not saying that this flaw is a showstopper, but with the continued move
towards multithreading it will be hurting more and more users of coverage and PGO.
AFAICT, most of our PGO users simply can not run their software in single-threaded mode,
and some of them surely have hot functions running in all threads at once. 

At the very least we should document this problem, but better try fixing it. 

Some ideas:

- per-thread counters. Solves the problem at huge cost in RAM per-thread
- 8-bit per-thread counters, dumping into central counters on overflow. 
- per-cpu counters (not portable, requires very modern kernel with lots of patches)
- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.
- reduce contention on hot counters by not incrementing them if they are big enough:
   {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?
- self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;

Other thoughts?

--kcc 



Yaron Keren

unread,
Apr 17, 2014, 10:10:38 AM4/17/14
to Kostya Serebryany, llv...@cs.uiuc.edu, Dmitry Vyukov
If accuracy is not critical, incrementing the counters without any guards might be good enough.
Hot areas will still be hot and cold areas will not be affected.

Yaron



_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


Kostya Serebryany

unread,
Apr 17, 2014, 10:13:58 AM4/17/14
to Yaron Keren, llv...@cs.uiuc.edu, Dmitry Vyukov
On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron...@gmail.com> wrote:
If accuracy is not critical, incrementing the counters without any guards might be good enough.

No.  Contention on the counters leads to 5x-10x slowdown. This is never good enough. 

--kcc 

Jonathan Roelofs

unread,
Apr 17, 2014, 12:37:34 PM4/17/14
to Kostya Serebryany, Yaron Keren, Dmitry Vyukov, llv...@cs.uiuc.edu
How about per-thread if the counter is hot enough?

Jon

On 4/17/14, 7:13 AM, Kostya Serebryany wrote:
>
>
>
> On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron...@gmail.com
> <mailto:yaron...@gmail.com>> wrote:
>
> If accuracy is not critical, incrementing the counters without any guards
> might be good enough.
>
>
> No. Contention on the counters leads to 5x-10x slowdown. This is never good
> enough.
>
> --kcc
>
> Hot areas will still be hot and cold areas will not be affected.
>
> Yaron
>
>
>
> 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <k...@google.com
> <mailto:k...@google.com>>:
> LLV...@cs.uiuc.edu <mailto:LLV...@cs.uiuc.edu> http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Jon Roelofs
jona...@codesourcery.com
CodeSourcery / Mentor Embedded

Kostya Serebryany

unread,
Apr 17, 2014, 12:39:52 PM4/17/14
to Jonathan Roelofs, Dmitry Vyukov, llv...@cs.uiuc.edu
On Thu, Apr 17, 2014 at 8:37 PM, Jonathan Roelofs <jona...@codesourcery.com> wrote:
How about per-thread if the counter is hot enough?

Err. How do you know if the counter is hot w/o first profiling the app? 

Jonathan Roelofs

unread,
Apr 17, 2014, 12:45:08 PM4/17/14
to Kostya Serebryany, Dmitry Vyukov, llv...@cs.uiuc.edu
if (counter_is_local)
counter++
if (counter > threshold)
counter_is_local=false
else
increment_thread_local_counter()

Kostya Serebryany

unread,
Apr 17, 2014, 12:48:24 PM4/17/14
to Jonathan Roelofs, Dmitry Vyukov, llv...@cs.uiuc.edu
On Thu, Apr 17, 2014 at 8:45 PM, Jonathan Roelofs <jona...@codesourcery.com> wrote:
if (counter_is_local)
  counter++
  if (counter > threshold)
    counter_is_local=false
else
  increment_thread_local_counter()

This might be another choice, but unless you implement increment_thread_local_counter() in some clever way 
it will still require the same RAM overhead as just keeping the counters in TLS

Jonathan Roelofs

unread,
Apr 17, 2014, 1:11:24 PM4/17/14
to Kostya Serebryany, Dmitry Vyukov, llv...@cs.uiuc.edu


On 4/17/14, 9:48 AM, Kostya Serebryany wrote:
>
>
>
> On Thu, Apr 17, 2014 at 8:45 PM, Jonathan Roelofs <jona...@codesourcery.com
> <mailto:jona...@codesourcery.com>> wrote:
>
> if (counter_is_local)
> counter++
> if (counter > threshold)
> counter_is_local=false
> else
> increment_thread_local___counter()
>
>
> This might be another choice, but unless you implement
> increment_thread_local___counter() in some clever way
> it will still require the same RAM overhead as just keeping the counters in TLS
Right, so you wouldn't want to have every thread have its own copy of the whole
list of counters, but I think you may be able to get away with keeping smaller
per-thread pools, and bump-ptr allocating within those pools.
>
>
>


--
Jon Roelofs
jona...@codesourcery.com
CodeSourcery / Mentor Embedded

Xinliang David Li

unread,
Apr 17, 2014, 1:38:27 PM4/17/14
to Kostya Serebryany, llv...@cs.uiuc.edu, Dmitry Vyukov
On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <k...@google.com> wrote:
Hi, 

The current design of -fprofile-instr-generate has the same fundamental flaw 
as the old gcc's gcov instrumentation: it has contention on counters. 
A trivial synthetic test case was described here: 

For the problem to appear we need to have a hot function that is simultaneously executed 
by multiple threads -- then we will have high contention on the racy profile counters.

Such situation is not necessary very frequent, but when it happens 
-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)

We have seen even larger slowdowns, but it is uncommon, nor have I heard many complaints about it.

 

An example is the multi-threaded vp9 video encoder.

cd libvpx/
F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configure
make -j32 
time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m  

When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead
(8.5 vs 10 seconds)
When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).

I am not saying that this flaw is a showstopper, but with the continued move
towards multithreading it will be hurting more and more users of coverage and PGO.
AFAICT, most of our PGO users simply can not run their software in single-threaded mode,
and some of them surely have hot functions running in all threads at once. 

At the very least we should document this problem, but better try fixing it. 


Users can also select smaller (but still representative) training data set to solve the problem.

 
Some ideas:

- per-thread counters. Solves the problem at huge cost in RAM per-thread

It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.
 
- 8-bit per-thread counters, dumping into central counters on overflow. 

The overflow will happen very quickly with 8bit counter.

 
- per-cpu counters (not portable, requires very modern kernel with lots of patches)
- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.

Interesting idea. This may work well.
 
- reduce contention on hot counters by not incrementing them if they are big enough:
   {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?

yes, it will be bad.
 
- self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;


Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high.

In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling. Basically counters will be updated every N samples.  

David
 
Other thoughts?

--kcc 



Duncan P. N. Exon Smith

unread,
Apr 17, 2014, 1:58:21 PM4/17/14
to Xinliang David Li, llv...@cs.uiuc.edu, Dmitry Vyukov

On 2014-Apr-17, at 10:38, Xinliang David Li <xinli...@gmail.com> wrote:

>
> Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high.

I think they'd need to be synced with global counters before function
calls as well, since any function call can call "exit()".

Xinliang David Li

unread,
Apr 17, 2014, 2:09:51 PM4/17/14
to Duncan P. N. Exon Smith, llv...@cs.uiuc.edu, Dmitry Vyukov
On Thu, Apr 17, 2014 at 10:58 AM, Duncan P. N. Exon Smith <dexon...@apple.com> wrote:

On 2014-Apr-17, at 10:38, Xinliang David Li <xinli...@gmail.com> wrote:

>
> Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high.

I think they'd need to be synced with global counters before function
calls as well, since any function call can call "exit()".

right -- but it might be better to handle this in other ways. For instance a stack of counters for each frames is maintained. At exit, they are flushed in a batch. Or simply ignore it in case of program exit .

David 

Bob Wilson

unread,
Apr 17, 2014, 2:22:12 PM4/17/14
to Xinliang David Li, llv...@cs.uiuc.edu, Dmitry Vyukov
It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior. We might need to add an option to choose between those. There’s a lot of room for improvement in the performance with the current instrumentation, so maybe we can find a way to make things incrementally better in a way that helps both, but avoiding the multithreaded cache conflicts seems like it’s going to be expensive in other ways.

Xinliang David Li

unread,
Apr 17, 2014, 2:33:35 PM4/17/14
to Bob Wilson, llv...@cs.uiuc.edu, Dmitry Vyukov
Yes -- option. It would be unwise to penalize single threaded app unconditionally.

David

Chandler Carruth

unread,
Apr 17, 2014, 2:41:07 PM4/17/14
to Bob Wilson, llv...@cs.uiuc.edu, Dmitry Vyukov
On Thu, Apr 17, 2014 at 11:22 AM, Bob Wilson <bob.w...@apple.com> wrote:

I don't really agree.

First, multithreaded applications are going to be the majority soon, even if they aren't already. We should design for them and support them well by default. If, once we have that, we find single threaded performance dramatically suffers, then maybe we should add a flag. But it doesn't make sense to do this before we even have data.

Bob Wilson

unread,
Apr 17, 2014, 2:47:36 PM4/17/14
to Chandler Carruth, llv...@cs.uiuc.edu, Dmitry Vyukov
If someone wants to revise the instrumentation in a way that works better for multithreaded code, that’s great. Before the change is committed, we should have performance data comparing it to the current code. If there is no regression, then fine. If it significantly hurts single-threaded performance, then we will need a flag.

Chandler Carruth

unread,
Apr 17, 2014, 2:50:42 PM4/17/14
to Bob Wilson, llv...@cs.uiuc.edu, Dmitry Vyukov
If you want to default Darwin into slow multithreaded instrumentation to make single threaded instrumentation faster, go for it. That is not the correct default for Clang as a project or the Linux port though.

Anyways, all of this is somewhat moot. We really need the *architecture* of PGO instrumentation to not be multiple times slower for multithreaded applications. I think this is critical limitation of the current design.

Chandler Carruth

unread,
Apr 17, 2014, 4:06:32 PM4/17/14
to Kostya Serebryany, llv...@cs.uiuc.edu, Dmitry Vyukov
Having thought a bit about the best strategy to solve this, I think we should use a tradeoff of memory to reduce contention. I don't really like any of the other options as much, if we can get that one to work. Here is my specific suggestion:

On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <k...@google.com> wrote:
- per-cpu counters (not portable, requires very modern kernel with lots of patches)
- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.

I think we should combine these somewhat.

At an abstract level, I think we should design the profiling to support up to N shards of counters.

I think we should have a dynamic number of shards if possible. The goal here is that if you never need multiple shards (single threaded) you pay essentially zero cost. I would have a global number of shards that changes rarely, and re-compute it on entry to each function with something along the lines of:

if (thread-ID != main's thread-ID && shard_count == 1) {
  shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES));
  // if shard_count changed with this, we can also call a library routine here that does the work of allocating the actual extra shards.
}

MAX is a fixed cap so even on systems with 100s of cores we don't do something silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards when we only have a small number of threads in the program. NUMBER_OF_CORES, if supported on the OS, can limit the shards. If we don't have the number of threads, we just use the number of cores. If we don't have the number of cores, we can just guess 8 (or something).

Then, we can gracefully fall back on the following strategies to pick an index into the shards:

- Per-core non-atomic counter updates (if we support them) using restartable sequences
- Use a core-ID as the index into the shards to statistically minimize the contention, and do the increments atomically so it remains correct even if the core-ID load happens before a core migration and the counter increment occurs afterward
- Use (thread-ID % number of cores) if we don't have support for getting a core-ID from the target OS. This will still have a reasonable distribution I suspect, even if not perfect.


Finally, I wouldn't merge on shutdown if possible. I would just write larger raw profile data for multithreaded runs, and let the profdata tool merge them.


If this is still too much memory, then I would suggest doing the above, but doing it independently for each function so that only those functions actually called via multithreaded code end up sharding their counters.


I think this would be reasonably straightforward to implement, not significantly grow the cost of single-threaded instrumentation, and largely mitigate the contention on the counters. It can benefit from advanced hooks into the OS when those are available, but seems pretty easy to implement on roughly any OS with reasonable tradeoffs. The only really hard requirement is the ability to get a thread-id, but I think that is fairly reasonable (C++ even makes this essentially mandatory).

Thoughts?

Justin Bogner

unread,
Apr 17, 2014, 4:27:56 PM4/17/14
to Chandler Carruth, llv...@cs.uiuc.edu, Dmitry Vyukov
Chandler Carruth <chan...@google.com> writes:
> Having thought a bit about the best strategy to solve this, I think we should
> use a tradeoff of memory to reduce contention. I don't really like any of the
> other options as much, if we can get that one to work. Here is my specific
> suggestion:
>
> On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <k...@google.com> wrote:
>
> - per-cpu counters (not portable, requires very modern kernel with lots of
> patches)
> - sharded counters: each counter represented as N counters sitting in
> different cache lines. Every thread accesses the counter with index TID%N.
> Solves the problem partially, better with larger values of N, but then
> again it costs RAM.
>
> I think we should combine these somewhat.
>
> At an abstract level, I think we should design the profiling to support up to
> N shards of counters.
>
> I think we should have a dynamic number of shards if possible. The goal here
> is that if you never need multiple shards (single threaded) you pay
> essentially zero cost. I would have a global number of shards that changes
> rarely, and re-compute it on entry to each function with something along the
> lines of:
>
> if (thread-ID != main's thread-ID && shard_count == 1) {
>   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES));
>   // if shard_count changed with this, we can also call a library routine here
> that does the work of allocating the actual extra shards.
> }

Is it possible to hook on something more clever than function entry?
Choosing to do this based on thread creation or something like that
could make this extra check very cheap.

> MAX is a fixed cap so even on systems with 100s of cores we don't do something
> silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards when we
> only have a small number of threads in the program. NUMBER_OF_CORES, if
> supported on the OS, can limit the shards. If we don't have the number of
> threads, we just use the number of cores. If we don't have the number of
> cores, we can just guess 8 (or something).

I like the general idea of this approach. The obvious TLS approach
worried me in that it explodes memory usage by much more than you'll
generally need.

> Then, we can gracefully fall back on the following strategies to pick an index
> into the shards:
>
> - Per-core non-atomic counter updates (if we support them) using restartable
> sequences
> - Use a core-ID as the index into the shards to statistically minimize the
> contention, and do the increments atomically so it remains correct even if the
> core-ID load happens before a core migration and the counter increment occurs
> afterward
> - Use (thread-ID % number of cores) if we don't have support for getting a
> core-ID from the target OS. This will still have a reasonable distribution I
> suspect, even if not perfect.
>
> Finally, I wouldn't merge on shutdown if possible. I would just write larger
> raw profile data for multithreaded runs, and let the profdata tool merge them.

Agreed. Doing the merging offline is a clear win. The space overhead is
short lived enough that it shouldn't be a big issue.

> If this is still too much memory, then I would suggest doing the above, but
> doing it independently for each function so that only those functions actually
> called via multithreaded code end up sharding their counters.

I'd worry that doing this per-function might be too fine of granularity,
but there are certainly use cases where large chunks of the code are
only exercised by one (of many) threads, so something along these lines
could be worthwhile.

> I think this would be reasonably straightforward to implement, not
> significantly grow the cost of single-threaded instrumentation, and largely
> mitigate the contention on the counters. It can benefit from advanced hooks
> into the OS when those are available, but seems pretty easy to implement on
> roughly any OS with reasonable tradeoffs. The only really hard requirement is
> the ability to get a thread-id, but I think that is fairly reasonable (C++
> even makes this essentially mandatory).
>
> Thoughts?

_______________________________________________

Xinliang David Li

unread,
Apr 17, 2014, 4:51:52 PM4/17/14
to Chandler Carruth, Dmitry Vyukov, llv...@cs.uiuc.edu
Good thinking, but why do you think runtime  selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered?

For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the # of  threads tends to be larger than NUM_OF_CORES, so it also ends up with N == MAX.  For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes.

Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it.

David

Chandler Carruth

unread,
Apr 17, 2014, 5:04:28 PM4/17/14
to Justin Bogner, llv...@cs.uiuc.edu, Dmitry Vyukov
On Thu, Apr 17, 2014 at 1:27 PM, Justin Bogner <ma...@justinbogner.com> wrote:
Chandler Carruth <chan...@google.com> writes: 
> if (thread-ID != main's thread-ID && shard_count < std::min(MAX, NUMBER_OF_CORES)) {

>   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES));
>   // if shard_count changed with this, we can also call a library routine here
> that does the work of allocating the actual extra shards.
> }

Note, I had a bug here. The first time I had this only grow the shard count once, and later thought it might be nice to grow this incrementally so that applications that only need 2 or 3 threads, don't pay for more. I fixed the guard here, but naturally this is still hand-wavy. =D
 

Is it possible to hook on something more clever than function entry?
Choosing to do this based on thread creation or something like that
could make this extra check very cheap.

Yea, *if* you can do it on thread creation, that would be awesome. But we don't really have a good way to model that right now. on the flip side.
Yea. If we need to do it in a fine grained way (IE, not whole-program sharding), I think my favorite granularity would be per-post-inliner-function. The other thing to do is to set up function-local variables at that phase that are used to locate the counters cheaply within the function body.

One way that might work would be to use essentially synthetic intrinsics to introduce the *markers* for instrumentation (and the function names, frontend hashes etc, all the stuff that needs language-level semantics) into the code, but have LLVM transform these into the actual instrumentation code after inlining and optimization. Then it can also inject the necessary code into the more coarse grained boundaries.

Chandler Carruth

unread,
Apr 17, 2014, 5:09:15 PM4/17/14
to Xinliang David Li, Dmitry Vyukov, llv...@cs.uiuc.edu
On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <xinli...@gmail.com> wrote:
Good thinking, but why do you think runtime  selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered?

Because extremely few applications statically decide how many threads to use in the real world (in my experience). This is even more relevant if you consider each <unit of code, maybe post-inlined function> independently, where you might have many threads but near 0 overlapping functions on those threads. The number of cores also changes from machine to machine, and can even change based on the particular OS mode in which your application runs.
 
For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the # of  threads tends to be larger than NUM_OF_CORES, so it also ends up with N == MAX.  For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes.

Sorry, this was just pseudo code, and very rough at that.

The goal was to allow programs with >1 thread but significantly fewer threads than cores to not pay (in memory) for all of the shards. There are common patterns here such as applications that are essentially single threaded, but with one or two background threads. Also, the hard compile-time max is a compile time constant, but the number of cores isn't (see above) so at least once per execution of the program, we'll need to dynamically take the min of the two.


Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it.

Yes, and a certain amount of this is actually fine because the whole point was to minimize contention rather than perfectly eliminate it.

Xinliang David Li

unread,
Apr 17, 2014, 5:33:14 PM4/17/14
to Chandler Carruth, Dmitry Vyukov, llv...@cs.uiuc.edu
On Thu, Apr 17, 2014 at 2:09 PM, Chandler Carruth <chan...@google.com> wrote:

On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <xinli...@gmail.com> wrote:
Good thinking, but why do you think runtime  selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered?

Because extremely few applications statically decide how many threads to use in the real world (in my experience). This is even more relevant if you consider each <unit of code, maybe post-inlined function> independently, where you might have many threads but near 0 overlapping functions on those threads. The number of cores also changes from machine to machine, and can even change based on the particular OS mode in which your application runs.


We are talking about developers here. Nobody would know the exact thread counts, but developers know the ballpark number, which should be enough. E.g. 1) my program is single threaded; 2) my program is mostly single threaded with some lightweight helper threads; 3) my program is heavily threaded without a single hotspot; 4) my program is heavily threaded with hotspot contention ,etc.  Only 4) is of concern here.  Besides, user can always find out if instrumentation build is too slow and decide which strategy to use. For apps with distinct phases (e.g. ST->MT->ST), the proposed approach may be useful, but it won't be the majority. 


 
For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the # of  threads tends to be larger than NUM_OF_CORES, so it also ends up with N == MAX.  For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes.

Sorry, this was just pseudo code, and very rough at that.

The goal was to allow programs with >1 thread but significantly fewer threads than cores to not pay (in memory) for all of the shards. There are common patterns here such as applications that are essentially single threaded, but with one or two background threads. Also, the hard compile-time max is a compile time constant, but the number of cores isn't (see above) so at least once per execution of the program, we'll need to dynamically take the min of the two.

See above -- for each cases (scenario 2), user normally has prior knowledge.
 

Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it.

Yes, and a certain amount of this is actually fine because the whole point was to minimize contention rather than perfectly eliminate it.


Another danger involved with dynamically resizing the counter is that it requires a global or per function lock to access the counters. The cost of this can be really high.

David 

Bob Wilson

unread,
Apr 17, 2014, 8:19:10 PM4/17/14
to Chandler Carruth, llv...@cs.uiuc.edu, Dmitry Vyukov
I really don’t think there is a *right* answer here. If you can make this work without a significant performance regression for cases where there is little contention for the counters, then great, but I’m pretty skeptical of that. I would love to be surprised.

If someone wants to implement a scheme like this, then we can get real performance numbers instead of just speculation and we’ll see.

Joerg Sonnenberger

unread,
Apr 17, 2014, 9:24:00 PM4/17/14
to llv...@cs.uiuc.edu
On Thu, Apr 17, 2014 at 04:21:46PM +0400, Kostya Serebryany wrote:
> - sharded counters: each counter represented as N counters sitting in
> different cache lines. Every thread accesses the counter with index TID%N.
> Solves the problem partially, better with larger values of N, but then
> again it costs RAM.

I'd strongly go with this schema with one tweak. Use the stack pointer
as base value with some fudging to not just use the lowest bits. It is
typically easier to get.

Joerg

Dmitry Vyukov

unread,
Apr 18, 2014, 3:13:13 AM4/18/14
to Bob Wilson, llv...@cs.uiuc.edu
Hi,

This is long thread, so I will combine several comments into single email.


>> - 8-bit per-thread counters, dumping into central counters on overflow.
>The overflow will happen very quickly with 8bit counter.

Yes, but it reduces contention by 256x (a thread must execute at least 256 loop iterations between increments). In practice, if you reduce contention below some threshold, it does not represent a problem anymore.



>> - per-thread counters. Solves the problem at huge cost in RAM per-thread
>It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.

Do we have any numbers about number of counters? If there are 100K 1-byte counters, I would consider it as practical.


 
> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;

This option did not receive any attention. While it has O(1) memory consumption and reduces contention.
Note that there are several tunables here: first -- how implement fast_rand: it can be a per-thread LCG, or per-thread counter or set of counter, or something else; second -- frequency of increment, e.g. it actually can be additive and/or bounded (because if we reduce frequency of increments by 1000x, this must be enough).



> In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling.
> Basically counters will be updated every N samples.

How does it work?



>> It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior.
> I don't really agree.

Disagreement seconded.
For single-threaded case we are talking about constant overhead, while for multithreaded case -- about superlinear overhead.
Having said that, if there is a *simple* way to modify the final scheme to significantly reduce single-threaded overheads, then or course it's worth doing. But it's not the cornerstone.



> MAX is a fixed cap so even on systems with 100s of cores we don't do something silly.

Makes not sense to me.
Why do you want only systems with up to 100 cores to be fast and scalable? 100+ core system *especially* need good scalability (contention tends to be superlinear).
What you are proposing is basically: If I have 10 engineers in my company, I probably want to give them 10 working desks as well. But let's not go insane. If I have 1000 engineers, 100 desks must be enough for them. This must reduce costs.
The baseline memory consumption for systems (and amount of RAM!) is O(NCORES), not O(1). In some read-mostly cases it's possible to achieve O(1) memory consumption, and that's great. But if it's not the case here, let it be so.



> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))

Threads do not produce contention, it's cores that produce contention.
The formula must be:  shard_count = k*NCORES
And if you want less memory in single-threaded case, then: shard_count = min(k*NCORES, c*NTHREADS)



>We are talking about developers here. Nobody would know the exact thread counts, but developers know the ballpark number

I strongly believe that we must relief developers from this choice during build time, and do our best to auto-tune (if the final scheme requires tuning).
First, such questions puts unnecessary burden on developers. We don't ask what register allocation algorithm to use for each function, right?
Second, there are significant chances we will get a wrong answer, because e.g. developer's view of how threaded is the app can differ from reality or from our classification.
Third, the app can be build by a build engineer; or the procedure can be applied to a base with 10'000 apps; or threaded-ness can change; or the app can work in different modes; or the app can have different phases.



>Another danger involved with dynamically resizing the counter is that it requires a global or per function lock to access the counters. The cost of this can be really high.

It's possible to do it w/o a mutex, fast-path overhead is only a virtually zero overhead (if implemented properly in the compiler) atomic consume load:

const int maxshard = 4096;
uint64* shard[maxshard];
atomic<int> shardmask;

void inline inccounter(int idx)
{
int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume);
shard[shardidx][idx]++;
}

int pthread_create(...)
{
if (updateshardcount()) {
shardlock();
if (updateshardcount()) {
int newcount = computeshardcount();
for (int i = oldcount; i < newcount; i++)
shard[i] = malloc(ncounter*sizeof(uint64));
atomic_store(&shardmask, newcount-1, memory_order_release);
}
shardunlock();
}
...
}


Kostya Serebryany

unread,
Apr 18, 2014, 3:25:49 AM4/18/14
to llv...@cs.uiuc.edu
On Fri, Apr 18, 2014 at 5:24 AM, Joerg Sonnenberger <jo...@britannica.bec.de> wrote:
On Thu, Apr 17, 2014 at 04:21:46PM +0400, Kostya Serebryany wrote:
> - sharded counters: each counter represented as N counters sitting in
> different cache lines. Every thread accesses the counter with index TID%N.
> Solves the problem partially, better with larger values of N, but then
> again it costs RAM.

I'd strongly go with this schema with one tweak. Use the stack pointer
as base value with some fudging to not just use the lowest bits. It is
typically easier to get.

Indeed, several middle bits of %sp may be used instead of TID%N.
This would heavily depend on the pthread implementation (how it allocates stacks) though.
It may be tricky to come up with the same constant scheme across all platforms. 

Dmitry Vyukov

unread,
Apr 18, 2014, 3:32:01 AM4/18/14
to Bob Wilson, llv...@cs.uiuc.edu
If we go with this scheme, for tid I would do:

__thread threadid;

int pthread_create(...)
{
    threadid = goohhash(gettid());
    ...
}

void userfunc()
{
    int tid = threadid+1;
    threadid = tid;
    // use tid in profiling
    ...
}

This avoids the problem of potentially expensive gettid(), and avoids a possible persistent interference between tids.

Dmitry Vyukov

unread,
Apr 18, 2014, 3:44:37 AM4/18/14
to Bob Wilson, llv...@cs.uiuc.edu
Kostya noted that this increment is a bad idea. Because each thread will access all possible cache lines for each counter. I agree, this is a bad idea.

What bothers me is that if we do tid%something, there is non-zero chance that significant amount of active threads will be mapped onto the same shard. And then get back to where we start -- we have significant contention.

Ideally it would be something like:

on_thread_rescheduling()
{
  tid = getcpuid();
}

but common OSes do not provide necessary interfaces.

Kostya Serebryany

unread,
Apr 18, 2014, 3:50:46 AM4/18/14
to Dmitry Vyukov, llv...@cs.uiuc.edu
Note also: if we store tid in a thread-local variable, we'll either have to load it on every counter increment or keep it in a register,
which by itself will bring additional ~10% slowdown on x86_64. This is why taking bits from %sp or %fs sounds attractive (it has problems too)

Chandler Carruth

unread,
Apr 18, 2014, 5:04:59 AM4/18/14
to Dmitry Vyukov, llv...@cs.uiuc.edu
On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvy...@google.com> wrote:
> MAX is a fixed cap so even on systems with 100s of cores we don't do something silly.

Makes not sense to me.
Why do you want only systems with up to 100 cores to be fast and scalable? 100+ core system *especially* need good scalability (contention tends to be superlinear).

Please don't argue motives, and instead argue the technical merits. I do want systems with 100s of cores to be fast and scalable.

What you are proposing is basically: If I have 10 engineers in my company, I probably want to give them 10 working desks as well. But let's not go insane. If I have 1000 engineers, 100 desks must be enough for them. This must reduce costs.
The baseline memory consumption for systems (and amount of RAM!) is O(NCORES), not O(1). In some read-mostly cases it's possible to achieve O(1) memory consumption, and that's great. But if it's not the case here, let it be so.

I think you are drastically overstating what I am suggesting. The bad analogy isn't helping.

The only way we get contention is if we have the same function accessing the same counters within the function on multiple cores at the same time. It is entirely conceivable that programs which manage to do this for *every* core in a system with many hundreds of cores are rare. As a consequence, it might be a practical compromise to reduce the number of shards below the number of cores if the memory overhead is not providing commensurate performance. Clearly, measurements and such are needed here, but it is at least a tunable knob that we should know about and consider in our measurements.
 



> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))

Threads do not produce contention, it's cores that produce contention.

It is the *combination* of threads on cores which produce contention. The above 'max' should have been a 'min', sorry for that confusion. The point was to reduce the shards if *either* the number of cores is small or the number of threads is small.

The formula must be:  shard_count = k*NCORES
And if you want less memory in single-threaded case, then: shard_count = min(k*NCORES, c*NTHREADS)

Which is what I intended to write.

Kostya Serebryany

unread,
Apr 18, 2014, 5:10:18 AM4/18/14
to Dmitry Vyukov, llv...@cs.uiuc.edu
One more proposal: simple per-thread counters allocated with mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan.

Chrome has ~3M basic blocks instrumented for coverage, 
so even largest applications will hardly have more than, say, 10M basic blocks
(number can be configurable at application start time). This gives us 80Mb for the array of 64-bit counters. 
That's a lot if multiplied by the number of threads, but the MAP_NORESERVE trick solves the problem -- 
each thread will only touch the pages where it actually increment the counters. 
On thread exit the whole 80Mb counter array are will be merged into a central array of counters and then discarded, 
but we can also postpone this until another new thread is created -- then we just reuse the counter array. 

This brings two challenges. 

#1. The basic blocks should be numbered sequentially. I see only one way to accomplish this: with the help of linker (and dynamic linker for DSOs).
The compiler would emit code using offsets that will later be transformed into constants by the linker. 
Not sure if any existing linker support this kind of thing. Anyone? 

#2. How to access the per-thread counter array. If we simply store the pointer to the array in TLS, the instrumentation will be more expensive just because of need to load and keep this pointer. 
If the counter array is part of TLS itself, we'll have to intrude into the pthread library (or wrap it) so that this part of TLS is mapped with MAP_NORESERVE. 

This is all far from trivial, but it
  - has the same performance in single-threaded case compared to the current design 
  - has no contention in multi-threaded case. 
  - should use moderate amount of RAM due to the MAP_NORESERVE trick.

?? 

--kcc 




Chandler Carruth

unread,
Apr 18, 2014, 5:21:13 AM4/18/14
to Kostya Serebryany, llv...@cs.uiuc.edu, Dmitry Vyukov
On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <k...@google.com> wrote:
One more proposal: simple per-thread counters allocated with mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan.

Chrome has ~3M basic blocks instrumented for coverage, 
so even largest applications will hardly have more than, say, 10M basic blocks

I think this is a *gross* underestimation. I work with applications more than one order of magnitude larger than Chrome.
 
(number can be configurable at application start time). This gives us 80Mb for the array of 64-bit counters. 
That's a lot if multiplied by the number of threads, but the MAP_NORESERVE trick solves the problem -- 
each thread will only touch the pages where it actually increment the counters. 
On thread exit the whole 80Mb counter array are will be merged into a central array of counters and then discarded, 
but we can also postpone this until another new thread is created -- then we just reuse the counter array. 

This brings two challenges. 

#1. The basic blocks should be numbered sequentially. I see only one way to accomplish this: with the help of linker (and dynamic linker for DSOs).
The compiler would emit code using offsets that will later be transformed into constants by the linker. 
Not sure if any existing linker support this kind of thing. Anyone? 

#2. How to access the per-thread counter array. If we simply store the pointer to the array in TLS, the instrumentation will be more expensive just because of need to load and keep this pointer. 
If the counter array is part of TLS itself, we'll have to intrude into the pthread library (or wrap it) so that this part of TLS is mapped with MAP_NORESERVE. 

#3. It essentially *requires* a complex merge on shutdown rather than a simple flush. I'm not even sure how to do the merge without dirtying still more pages of the no-reserve memory.


It's not at all clear to me that this scales up (either in memory usage, memory reservation, or shutdown time) to larger applications. Chrome isn't a useful upper bound here.

Dmitry Vyukov

unread,
Apr 18, 2014, 5:27:04 AM4/18/14
to Chandler Carruth, llv...@cs.uiuc.edu
On Fri, Apr 18, 2014 at 1:04 PM, Chandler Carruth <chan...@google.com> wrote:

On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvy...@google.com> wrote:
> MAX is a fixed cap so even on systems with 100s of cores we don't do something silly.

Makes not sense to me.
Why do you want only systems with up to 100 cores to be fast and scalable? 100+ core system *especially* need good scalability (contention tends to be superlinear).

Please don't argue motives, and instead argue the technical merits. I do want systems with 100s of cores to be fast and scalable.


That's what I've read in the code above.
If there was a subsequent correction, sorry, this thread is long.


 
What you are proposing is basically: If I have 10 engineers in my company, I probably want to give them 10 working desks as well. But let's not go insane. If I have 1000 engineers, 100 desks must be enough for them. This must reduce costs.
The baseline memory consumption for systems (and amount of RAM!) is O(NCORES), not O(1). In some read-mostly cases it's possible to achieve O(1) memory consumption, and that's great. But if it's not the case here, let it be so.

I think you are drastically overstating what I am suggesting. The bad analogy isn't helping.

The only way we get contention is if we have the same function accessing the same counters within the function on multiple cores at the same time. It is entirely conceivable that programs which manage to do this for *every* core in a system with many hundreds of cores are rare. As a consequence, it might be a practical compromise to reduce the number of shards below the number of cores if the memory overhead is not providing commensurate performance. Clearly, measurements and such are needed here, but it is at least a tunable knob that we should know about and consider in our measurements.
 

I do not agree.
First, lots of programs determine number of cores and all of them.
Second, HPC applications all execute the same small computational kernel in all threads. Server applications can do mostly decryption or decompression or deserialization on all cores. Data parallel client applications for video/image/audio processing execute the same code on all cores.
If we combine these, we get exactly -- all cores execute the same small piece of code.

Kostya Serebryany

unread,
Apr 18, 2014, 5:29:06 AM4/18/14
to Chandler Carruth, llv...@cs.uiuc.edu, Dmitry Vyukov
On Fri, Apr 18, 2014 at 1:21 PM, Chandler Carruth <chan...@google.com> wrote:
On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <k...@google.com> wrote:
One more proposal: simple per-thread counters allocated with mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan.

Chrome has ~3M basic blocks instrumented for coverage, 
so even largest applications will hardly have more than, say, 10M basic blocks

I think this is a *gross* underestimation. I work with applications more than one order of magnitude larger than Chrome.

Agree, Chrome is comparatively small. But the thing does not change even if we have 100M basic blocks. 
The hypothesis (which need to be checked) is that every thread will touch a small portion of BBs => a small portion of pages in the counter array.
 
 
(number can be configurable at application start time). This gives us 80Mb for the array of 64-bit counters. 
That's a lot if multiplied by the number of threads, but the MAP_NORESERVE trick solves the problem -- 
each thread will only touch the pages where it actually increment the counters. 
On thread exit the whole 80Mb counter array are will be merged into a central array of counters and then discarded, 
but we can also postpone this until another new thread is created -- then we just reuse the counter array. 

This brings two challenges. 

#1. The basic blocks should be numbered sequentially. I see only one way to accomplish this: with the help of linker (and dynamic linker for DSOs).
The compiler would emit code using offsets that will later be transformed into constants by the linker. 
Not sure if any existing linker support this kind of thing. Anyone? 

#2. How to access the per-thread counter array. If we simply store the pointer to the array in TLS, the instrumentation will be more expensive just because of need to load and keep this pointer. 
If the counter array is part of TLS itself, we'll have to intrude into the pthread library (or wrap it) so that this part of TLS is mapped with MAP_NORESERVE. 

#3. It essentially *requires* a complex merge on shutdown rather than a simple flush.
yep
 
I'm not even sure how to do the merge without dirtying still more pages of the no-reserve memory.
and yep again. I don't know a way to check if a mmaped page is unused. 

--kcc 

Dmitry Vyukov

unread,
Apr 18, 2014, 5:30:01 AM4/18/14
to Chandler Carruth, llv...@cs.uiuc.edu
Array processing is fast. Contention is slow. I would expect this to be a net win.
For the additional memory consumption during final merge, we can process one per-thread array, unmap it, process second array, unmap it, and so on. This will not require bringing all the pages into memory.



Chandler Carruth

unread,
Apr 18, 2014, 5:41:07 AM4/18/14
to Dmitry Vyukov, llv...@cs.uiuc.edu

On Fri, Apr 18, 2014 at 2:30 AM, Dmitry Vyukov <dvy...@google.com> wrote:
It's not at all clear to me that this scales up (either in memory usage, memory reservation, or shutdown time) to larger applications. Chrome isn't a useful upper bound here.

Array processing is fast. Contention is slow. I would expect this to be a net win.
For the additional memory consumption during final merge, we can process one per-thread array, unmap it, process second array, unmap it, and so on. This will not require bringing all the pages into memory.

Array processing is fast, but paging in a large % of the pages in your address space is not at all fast. This will murder the kernel's page table, and do other very slow things I suspect.

Dmitry Vyukov

unread,
Apr 18, 2014, 5:53:18 AM4/18/14
to Chandler Carruth, llv...@cs.uiuc.edu
If the pages were already written to, then the pages are already mapped. If the pages were not written to, and you are only *reading* them, then kernel premaps a single zero page for them. It's virtually zero cost for both time and memory.
I do not foresee any issues here.
So on second thought my previous statement that we need to unmap processed arrays is wrong, there is no need to unmap them.

Joerg Sonnenberger

unread,
Apr 18, 2014, 6:55:39 AM4/18/14
to llv...@cs.uiuc.edu
On Fri, Apr 18, 2014 at 11:25:49AM +0400, Kostya Serebryany wrote:
> On Fri, Apr 18, 2014 at 5:24 AM, Joerg Sonnenberger <jo...@britannica.bec.de
> > wrote:
>
> > On Thu, Apr 17, 2014 at 04:21:46PM +0400, Kostya Serebryany wrote:
> > > - sharded counters: each counter represented as N counters sitting in
> > > different cache lines. Every thread accesses the counter with index
> > TID%N.
> > > Solves the problem partially, better with larger values of N, but then
> > > again it costs RAM.
> >
> > I'd strongly go with this schema with one tweak. Use the stack pointer
> > as base value with some fudging to not just use the lowest bits. It is
> > typically easier to get.
> >
>
> Indeed, several middle bits of %sp may be used instead of TID%N.
> This would heavily depend on the pthread implementation (how it allocates
> stacks) though.
> It may be tricky to come up with the same constant scheme across all
> platforms.

Since it doesn't have to be stable, just multiply with a (random)
constant and pick the high bits?

Xinliang David Li

unread,
Apr 18, 2014, 3:45:22 PM4/18/14
to Dmitry Vyukov, llv...@cs.uiuc.edu
On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvy...@google.com> wrote:
Hi,

This is long thread, so I will combine several comments into single email.


>> - 8-bit per-thread counters, dumping into central counters on overflow.
>The overflow will happen very quickly with 8bit counter.

Yes, but it reduces contention by 256x (a thread must execute at least 256 loop iterations between increments). In practice, if you reduce contention below some threshold, it does not represent a problem anymore.



>> - per-thread counters. Solves the problem at huge cost in RAM per-thread
>It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.

Do we have any numbers about number of counters? If there are 100K 1-byte counters, I would consider it as practical.


A medium sized app I looked at has about 10M counters (arcs only).  It is also not uncommon to see such apps running with hundreds of threads.

 




> In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling.
> Basically counters will be updated every N samples.

How does it work?

Similar to how occurrences based PMU sampling work. Setting sampling period to 100 can reduce the instrumentation overhead by close to 100x without introducing much precision loss.
 



>> It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior.
> I don't really agree.

>We are talking about developers here. Nobody would know the exact thread counts, but developers know the ballpark number

I strongly believe that we must relief developers from this choice during build time, and do our best to auto-tune (if the final scheme requires tuning).


That really depends.   If the tuning space is small, it won't be a problem for the developer/builder. 

 
First, such questions puts unnecessary burden on developers. We don't ask what register allocation algorithm to use for each function, right?

Crazy developers can actually do that via internal options, but this is totally different case.  People just needs one flag to turn on/off sharding. When sharding is on, compiler can pick the best 'N' according to some heuristics at compile time.

 
Second, there are significant chances we will get a wrong answer, because e.g. developer's view of how threaded is the app can differ from reality or from our classification.
Third, the app can be build by a build engineer; or the procedure can be applied to a base with 10'000 apps; or threaded-ness can change; or the app can work in different modes; or the app can have different phases.


We have forgotten to mention the benefit of implementation simplicity.  If the static/simple solution solves the problem for most of the use cases, designing fancy dynamic solution sounds like over-engineering to me.  It (overhead reduction technique) may also get in the way of easier functional enhancement in the future.

 David

Kostya Serebryany

unread,
Apr 23, 2014, 10:31:29 AM4/23/14
to Xinliang David Li, Dmitry Vyukov, llv...@cs.uiuc.edu
I've run one proprietary benchmark that reflects a large portion of the google's server side code.
-fprofile-instr-generate leads to 14x slowdown due to counter contention. That's serious. 
Admittedly, there is a single hot function that accounts for half of that slowdown, 
but even if I rebuild that function w/o -fprofile-instr-generate, the slowdown remains above 5x.
This is not a toy code that I've written to prove my point -- this is real code one may want to profile with -fprofile-instr-generate.
We need another approach for threaded code. 

There is another ungood feature of the current instrumentation. Consider this function: 
std::vector<int> v(1000);
void foo() { v[0] = 42; }

Here we have a single basic block and a call, but since the coverage is emitted by the 
FE before inlining (and is also emitted for std::vector methods) we get this assembler at -O2:
0000000000400b90 <_Z3foov>:
  400b90:       48 ff 05 11 25 20 00    incq   0x202511(%rip)        # 6030a8 <__llvm_profile_counters__Z3foov>
  400b97:       48 ff 05 42 25 20 00    incq   0x202542(%rip)        # 6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
  400b9e:       48 8b 05 4b 26 20 00    mov    0x20264b(%rip),%rax        # 6031f0 <v>
  400ba5:       c7 00 2a 00 00 00       movl   $0x2a,(%rax)
  400bab:       c3                      retq   

Suddenly, an innocent function that uses std::vector becomes a terrible point of contention.
Full test case below, -fprofile-instr-generate leads to 10x slowdown. 

=========================

Now, here is a more detailed proposal of logarithmic self-cooling counter mentioned before. Please comment. 
The counter is a number of the form (2^k-1). 
It starts with 0.
After the first update it is 1.
After *approximately* 1 more update it becomes 3
After *approximately* 2 more updates it becomes 7
After *approximately* 4 more updates it becomes 15
...
After *approximately* 2^k more updates it becomes 2^(k+2)-1

The code would look like this:
  if ((fast_thread_local_rand() & counter) == 0)
    counter = 2 * counter + 1;

Possible implementation for fast_thread_local_rand: 
long fast_thread_local_rand() {
  static __thread long r;
  return r++;
}
Although I would try to find something cheaper that this. (Ideas?)


The counter is not precise (well, the current racy counters are not precise either).
But statistically it should be no more than 2x away from the real counter. 
Will this accuracy be enough for the major use cases? 

Moreover, this approach allows to implement the counter increment using a callback: 
  if ((fast_thread_local_rand() & counter) == 0)   __cov_increment(&counter);
which in turn will let us use the same hack as in AsanCoverage: use the PC to map the counter to the source code. 
(= no need to create separate section in the objects).

Thoughts? 

--kcc


% clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out 
TIME: real: 0.219; user: 0.430; system: 0.000
% clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc && time ./a.out 
TIME: real: 3.743; user: 7.280; system: 0.000

% cat coverage_mt_vec.cc
#include <pthread.h>
#include <vector>

std::vector<int> v(1000);

__attribute__((noinline)) void foo() { v[0] = 42; }
  
void *Thread1(void *) {
  for (int i = 0; i < 100000000; i++)
    foo(); 
  return 0;
}   

__attribute__((noinline)) void bar() { v[999] = 66; }

void *Thread2(void *) {
  for (int i = 0; i < 100000000; i++)
    bar();
  return 0;
  
int main() {
  static const int kNumThreads = 16;
  pthread_t t[kNumThreads];
  pthread_create(&t[0], 0, Thread1, 0);
  pthread_create(&t[1], 0, Thread2, 0);
  pthread_join(t[0], 0);
  pthread_join(t[1], 0);
  return 0;
}




Xinliang David Li

unread,
Apr 23, 2014, 12:13:01 PM4/23/14
to Kostya Serebryany, Dmitry Vyukov, llv...@cs.uiuc.edu
This is very neat. The concern is that for long running process, the precision loss can be huge.  Run to run counter variance can also be very large (without atomic update).

Moreover, this approach allows to implement the counter increment using a callback: 
  if ((fast_thread_local_rand() & counter) == 0)   __cov_increment(&counter);
which in turn will let us use the same hack as in AsanCoverage: use the PC to map the counter to the source code. 
(= no need to create separate section in the objects).

Similar callback mechanism (GCC based) is used in Google for testing coverage.

David

Bob Wilson

unread,
Apr 23, 2014, 2:48:31 PM4/23/14
to Kostya Serebryany, Dmitry Vyukov, llv...@cs.uiuc.edu
—kcc

I can see that the behavior of our current instrumentation is going to be a problem for the kinds of applications that you’re looking at. If you can find a way to get the overhead down without losing accuracy and without hurting the performance for applications without significant contention, then we can just adopt that. If you can’t do it without tradeoffs, then we should have a separate option to let those who care switch between different kinds of instrumentation.

Using the PC to map to the source code is simply not going to work with -fprofile-instr-generate. The mapping from counter values to the user-visible execution counts is complex and relies on detailed knowledge of the clang ASTs.

Kostya Serebryany

unread,
Apr 24, 2014, 4:16:44 AM4/24/14
to Bob Wilson, Dmitry Vyukov, llv...@cs.uiuc.edu

I can see that the behavior of our current instrumentation is going to be a problem for the kinds of applications that you’re looking at. If you can find a way to get the overhead down without losing accuracy and without hurting the performance for applications without significant contention, then we can just adopt that. If you can’t do it without tradeoffs, then we should have a separate option to let those who care switch between different kinds of instrumentation.
So far none of the ideas we discussed can guarantee the same single-threaded performance, memory consumption and functionality compared to the current -fprofile-instr-generate.
We'll keep looking for the solution (and continue using AsanCoverage in the meantime). 
I am really surprised that the applications you care about do not have this issue, but you probably know better :)

--kcc 

Dmitry Vyukov

unread,
Apr 24, 2014, 4:33:54 AM4/24/14
to Bob Wilson, llv...@cs.uiuc.edu

What are your requirements for accuracy?
Current implementation does not provide 100% accuracy, so it's
something less than 100%. What is it?
What use cases for numeric counters (as opposed to bool flag) do we
need to support? Is it only feedback-driven optimizations?


> and without
> hurting the performance for applications without significant contention,

What is the acceptable threshold? 0.01% would be fine, right? What is
the maximum value that you are ready to agree with?

Duncan P. N. Exon Smith

unread,
Apr 25, 2014, 12:30:35 PM4/25/14
to Dmitry Vyukov, llv...@cs.uiuc.edu
(Sorry to jump in before reading the whole thread...)

On 2014-Apr-24, at 1:33, Dmitry Vyukov <dvy...@google.com> wrote:

> On Wed, Apr 23, 2014 at 10:48 PM, Bob Wilson <bob.w...@apple.com> wrote:
>> I can see that the behavior of our current instrumentation is going to be a
>> problem for the kinds of applications that you’re looking at. If you can
>> find a way to get the overhead down without losing accuracy
>
> What are your requirements for accuracy?
> Current implementation does not provide 100% accuracy, so it's
> something less than 100%.

Modulo any bugs, my understanding is that the current implementation
*does* provide 100% accuracy for single-threaded applications, as long
as the counters don't overflow. Am I missing something?

Dmitry Vyukov

unread,
Apr 25, 2014, 12:42:35 PM4/25/14
to Duncan P. N. Exon Smith, llv...@cs.uiuc.edu
On Fri, Apr 25, 2014 at 8:30 PM, Duncan P. N. Exon Smith
<dexon...@apple.com> wrote:
> (Sorry to jump in before reading the whole thread...)
>
> On 2014-Apr-24, at 1:33, Dmitry Vyukov <dvy...@google.com> wrote:
>
>> On Wed, Apr 23, 2014 at 10:48 PM, Bob Wilson <bob.w...@apple.com> wrote:
>>> I can see that the behavior of our current instrumentation is going to be a
>>> problem for the kinds of applications that you’re looking at. If you can
>>> find a way to get the overhead down without losing accuracy
>>
>> What are your requirements for accuracy?
>> Current implementation does not provide 100% accuracy, so it's
>> something less than 100%.
>
> Modulo any bugs, my understanding is that the current implementation
> *does* provide 100% accuracy for single-threaded applications, as long
> as the counters don't overflow. Am I missing something?


This is mostly correct. Probably modulo signals/interrupts (which can
arrive in between of increment and lead to a lose of arbitrary number
of increments -- all increments in handler itself).
But it's difficult to say what is a single-threaded app today, as lots
of libraries can create threads (it's usually not part of a contract).

Bob Wilson

unread,
Apr 25, 2014, 12:44:38 PM4/25/14
to Dmitry Vyukov, llv...@cs.uiuc.edu
On Apr 24, 2014, at 1:33 AM, Dmitry Vyukov <dvy...@google.com> wrote:

I can see that the behavior of our current instrumentation is going to be a
problem for the kinds of applications that you’re looking at. If you can
find a way to get the overhead down without losing accuracy

What are your requirements for accuracy?
Current implementation does not provide 100% accuracy, so it's
something less than 100%. What is it?
What use cases for numeric counters (as opposed to bool flag) do we
need to support? Is it only feedback-driven optimizations?

That’s a fair point. The current implementation is potentially inaccurate because the counter increments are not thread-safe. In a low-contention situation, that won’t matter much, but the counts could become quite inaccurate if there are multiple threads running the same code at the same time.

I don’t have a specific goal in mind right now for accuracy. We plan to use this instrumentation for both PGO and code coverage. Some coverage users only care about a boolean check but others want to see the actual execution counts. If we display the count values and they are significantly different from the real execution counts, that will lead to much confusion.



and without
hurting the performance for applications without significant contention,

What is the acceptable threshold? 0.01% would be fine, right? What is
the maximum value that you are ready to agree with?

Like I said, I don’t have a specific value in mind. My sense is that most of the applications we care about are quite different from the massively parallel code that Google cares about. We may have many threads but they’re all doing different things and contention is much less likely to be a problem.

I really think we need to see specifics before we can decide anything here.

Duncan P. N. Exon Smith

unread,
Apr 25, 2014, 1:11:40 PM4/25/14
to Kostya Serebryany, Dmitry Vyukov, llv...@cs.uiuc.edu

On 2014-Apr-23, at 7:31, Kostya Serebryany <k...@google.com> wrote:

> I've run one proprietary benchmark that reflects a large portion of the google's server side code.
> -fprofile-instr-generate leads to 14x slowdown due to counter contention. That's serious.
> Admittedly, there is a single hot function that accounts for half of that slowdown,
> but even if I rebuild that function w/o -fprofile-instr-generate, the slowdown remains above 5x.
> This is not a toy code that I've written to prove my point -- this is real code one may want to profile with -fprofile-instr-generate.
> We need another approach for threaded code.
>
> There is another ungood feature of the current instrumentation. Consider this function:
> std::vector<int> v(1000);
> void foo() { v[0] = 42; }
>
> Here we have a single basic block and a call, but since the coverage is emitted by the
> FE before inlining (and is also emitted for std::vector methods) we get this assembler at -O2:
> 0000000000400b90 <_Z3foov>:
> 400b90: 48 ff 05 11 25 20 00 incq 0x202511(%rip) # 6030a8 <__llvm_profile_counters__Z3foov>
> 400b97: 48 ff 05 42 25 20 00 incq 0x202542(%rip) # 6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
> 400b9e: 48 8b 05 4b 26 20 00 mov 0x20264b(%rip),%rax # 6031f0 <v>
> 400ba5: c7 00 2a 00 00 00 movl $0x2a,(%rax)
> 400bab: c3 retq
>
> Suddenly, an innocent function that uses std::vector becomes a terrible point of contention.

I know you're just using std::vector<> as an example, but I think there
should be an option to avoid instrumenting the STL. STL
implementations have typically been hand-tuned already. I think the
benefit of PGO is extremely small there (although I have no numbers to
back this up.)

> Full test case below, -fprofile-instr-generate leads to 10x slowdown.
>
> =========================
>
> Now, here is a more detailed proposal of logarithmic self-cooling counter mentioned before. Please comment.
> The counter is a number of the form (2^k-1).
> It starts with 0.
> After the first update it is 1.
> After *approximately* 1 more update it becomes 3
> After *approximately* 2 more updates it becomes 7
> After *approximately* 4 more updates it becomes 15
> ...
> After *approximately* 2^k more updates it becomes 2^(k+2)-1
>
> The code would look like this:
> if ((fast_thread_local_rand() & counter) == 0)
> counter = 2 * counter + 1;
>
> Possible implementation for fast_thread_local_rand:
> long fast_thread_local_rand() {
> static __thread long r;
> return r++;
> }
> Although I would try to find something cheaper that this. (Ideas?)

Very cool.

> The counter is not precise (well, the current racy counters are not precise either).
> But statistically it should be no more than 2x away from the real counter.
> Will this accuracy be enough for the major use cases?

I really like this idea. I think the loss of accuracy should be
opt-in (-fprofile-instr-generate=fuzzy), but I think this should be
available.

I think this could be cleanly integrated into -fprofile-instr-generate
without changing -fprofile-instr-use, compiler-rt, or llvm-profdata.
Having this option looks like a clear win to me.

> Moreover, this approach allows to implement the counter increment using a callback:
> if ((fast_thread_local_rand() & counter) == 0) __cov_increment(&counter);
> which in turn will let us use the same hack as in AsanCoverage: use the PC to map the counter to the source code.
> (= no need to create separate section in the objects).

As Bob said already, the PC hack doesn't work for the frontend-based
instrumentation. I'm skeptical even about relying on debug info for
retrieving function names. But optimizing the binary size is a
separate topic anyway.

Dmitry Vyukov

unread,
Apr 25, 2014, 1:22:27 PM4/25/14
to Bob Wilson, llv...@cs.uiuc.edu
On Fri, Apr 25, 2014 at 8:44 PM, Bob Wilson <bob.w...@apple.com> wrote:
>
> On Apr 24, 2014, at 1:33 AM, Dmitry Vyukov <dvy...@google.com> wrote:
>
>
> I can see that the behavior of our current instrumentation is going to be a
> problem for the kinds of applications that you’re looking at. If you can
> find a way to get the overhead down without losing accuracy
>
>
> What are your requirements for accuracy?
> Current implementation does not provide 100% accuracy, so it's
> something less than 100%. What is it?
> What use cases for numeric counters (as opposed to bool flag) do we
> need to support? Is it only feedback-driven optimizations?
>
>
> That’s a fair point. The current implementation is potentially inaccurate
> because the counter increments are not thread-safe. In a low-contention
> situation, that won’t matter much, but the counts could become quite
> inaccurate if there are multiple threads running the same code at the same
> time.
>
> I don’t have a specific goal in mind right now for accuracy. We plan to use
> this instrumentation for both PGO and code coverage. Some coverage users
> only care about a boolean check but others want to see the actual execution
> counts. If we display the count values and they are significantly different
> from the real execution counts, that will lead to much confusion.

If I would look at coverage report with exact numbers and observe 2
statements that must have the same hit count, but in the report they
have different hit counts; I would be highly confused.
Go coverage profiler has 3 modes of operation: set, count and atomic:
http://golang.org/cmd/go/#hdr-Description_of_testing_flags
As you can guess, the last one (atomic) is intended for exactly such
case -- when user looks at exact numbers.

> and without
> hurting the performance for applications without significant contention,
>
>
> What is the acceptable threshold? 0.01% would be fine, right? What is
> the maximum value that you are ready to agree with?
>
>
> Like I said, I don’t have a specific value in mind. My sense is that most of
> the applications we care about are quite different from the massively
> parallel code that Google cares about. We may have many threads but they’re
> all doing different things and contention is much less likely to be a
> problem.


Note that contention can arise even if threads are doing different
things but use common components. E.g. if you include a custom malloc
implementation into your program, then everything that allocates
becomes a point of contention.

_______________________________________________

Bob Wilson

unread,
Apr 25, 2014, 1:28:49 PM4/25/14
to Dmitry Vyukov, llv...@cs.uiuc.edu
I haven’t worked with Go, but that is exactly the kind of distinction that I had in mind when I suggested that we may want to support different kinds of instrumentation. It sounds like the “set” mode is pretty appealing for your target applications, and I won’t be at all surprised if we end up wanting something like “atomic” for some code coverage users. I’m guessing that the current -fprofile-instr-generate approach is similar to the “count” mode.

>
>
>
>> and without
>> hurting the performance for applications without significant contention,
>>
>>
>> What is the acceptable threshold? 0.01% would be fine, right? What is
>> the maximum value that you are ready to agree with?
>>
>>
>> Like I said, I don’t have a specific value in mind. My sense is that most of
>> the applications we care about are quite different from the massively
>> parallel code that Google cares about. We may have many threads but they’re
>> all doing different things and contention is much less likely to be a
>> problem.
>
>
> Note that contention can arise even if threads are doing different
> things but use common components. E.g. if you include a custom malloc
> implementation into your program, then everything that allocates
> becomes a point of contention.

Yes, of course.

Dmitry Vyukov

unread,
Apr 25, 2014, 1:39:16 PM4/25/14
to Bob Wilson, llv...@cs.uiuc.edu

I think you are right.
Our compiler folks will correct me if I am wrong, but I would expect
that we also want something like "scalable-count' mode for FDO. The
14x slowdown that we've observed on a server app will render most of
our apps broken due to pervasive timeouts. This mode must provide
precision good enough for FDO and minimal performance impact for
highly multi-threaded apps.

Reply all
Reply to author
Forward
0 new messages