_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
If accuracy is not critical, incrementing the counters without any guards might be good enough.
How about per-thread if the counter is hot enough?
if (counter_is_local)
counter++
if (counter > threshold)
counter_is_local=false
else
increment_thread_local_counter()
Hi,The current design of -fprofile-instr-generate has the same fundamental flawas the old gcc's gcov instrumentation: it has contention on counters.A trivial synthetic test case was described here:For the problem to appear we need to have a hot function that is simultaneously executedby multiple threads -- then we will have high contention on the racy profile counters.Such situation is not necessary very frequent, but when it happens-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)
An example is the multi-threaded vp9 video encoder.cd libvpx/F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configuremake -j32# get sample video from from https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4mWhen running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead(8.5 vs 10 seconds)When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).I am not saying that this flaw is a showstopper, but with the continued movetowards multithreading it will be hurting more and more users of coverage and PGO.AFAICT, most of our PGO users simply can not run their software in single-threaded mode,and some of them surely have hot functions running in all threads at once.At the very least we should document this problem, but better try fixing it.
Some ideas:- per-thread counters. Solves the problem at huge cost in RAM per-thread
- 8-bit per-thread counters, dumping into central counters on overflow.
- per-cpu counters (not portable, requires very modern kernel with lots of patches)- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.
- reduce contention on hot counters by not incrementing them if they are big enough:{if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?
- self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;
Other thoughts?--kcc
I think they'd need to be synced with global counters before function
On 2014-Apr-17, at 10:38, Xinliang David Li <xinli...@gmail.com> wrote:
>
> Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high.
calls as well, since any function call can call "exit()".
- per-cpu counters (not portable, requires very modern kernel with lots of patches)- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.
Is it possible to hook on something more clever than function entry?
Choosing to do this based on thread creation or something like that
could make this extra check very cheap.
> MAX is a fixed cap so even on systems with 100s of cores we don't do something
> silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards when we
> only have a small number of threads in the program. NUMBER_OF_CORES, if
> supported on the OS, can limit the shards. If we don't have the number of
> threads, we just use the number of cores. If we don't have the number of
> cores, we can just guess 8 (or something).
I like the general idea of this approach. The obvious TLS approach
worried me in that it explodes memory usage by much more than you'll
generally need.
> Then, we can gracefully fall back on the following strategies to pick an index
> into the shards:
>
> - Per-core non-atomic counter updates (if we support them) using restartable
> sequences
> - Use a core-ID as the index into the shards to statistically minimize the
> contention, and do the increments atomically so it remains correct even if the
> core-ID load happens before a core migration and the counter increment occurs
> afterward
> - Use (thread-ID % number of cores) if we don't have support for getting a
> core-ID from the target OS. This will still have a reasonable distribution I
> suspect, even if not perfect.
>
> Finally, I wouldn't merge on shutdown if possible. I would just write larger
> raw profile data for multithreaded runs, and let the profdata tool merge them.
Agreed. Doing the merging offline is a clear win. The space overhead is
short lived enough that it shouldn't be a big issue.
> If this is still too much memory, then I would suggest doing the above, but
> doing it independently for each function so that only those functions actually
> called via multithreaded code end up sharding their counters.
I'd worry that doing this per-function might be too fine of granularity,
but there are certainly use cases where large chunks of the code are
only exercised by one (of many) threads, so something along these lines
could be worthwhile.
> I think this would be reasonably straightforward to implement, not
> significantly grow the cost of single-threaded instrumentation, and largely
> mitigate the contention on the counters. It can benefit from advanced hooks
> into the OS when those are available, but seems pretty easy to implement on
> roughly any OS with reasonable tradeoffs. The only really hard requirement is
> the ability to get a thread-id, but I think that is fairly reasonable (C++
> even makes this essentially mandatory).
>
> Thoughts?
_______________________________________________
Chandler Carruth <chan...@google.com> writes:
> if (thread-ID != main's thread-ID && shard_count < std::min(MAX, NUMBER_OF_CORES)) {
> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES));
> // if shard_count changed with this, we can also call a library routine here
> that does the work of allocating the actual extra shards.
> }
Is it possible to hook on something more clever than function entry?
Choosing to do this based on thread creation or something like that
could make this extra check very cheap.
Good thinking, but why do you think runtime selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered?
For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N == MAX. If MAX is larger than NUM_OF_CORES, for large MT apps, the # of threads tends to be larger than NUM_OF_CORES, so it also ends up with N == MAX. For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes.
Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it.
On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <xinli...@gmail.com> wrote:
Good thinking, but why do you think runtime selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered?Because extremely few applications statically decide how many threads to use in the real world (in my experience). This is even more relevant if you consider each <unit of code, maybe post-inlined function> independently, where you might have many threads but near 0 overlapping functions on those threads. The number of cores also changes from machine to machine, and can even change based on the particular OS mode in which your application runs.
For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N == MAX. If MAX is larger than NUM_OF_CORES, for large MT apps, the # of threads tends to be larger than NUM_OF_CORES, so it also ends up with N == MAX. For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes.
Sorry, this was just pseudo code, and very rough at that.The goal was to allow programs with >1 thread but significantly fewer threads than cores to not pay (in memory) for all of the shards. There are common patterns here such as applications that are essentially single threaded, but with one or two background threads. Also, the hard compile-time max is a compile time constant, but the number of cores isn't (see above) so at least once per execution of the program, we'll need to dynamically take the min of the two.
Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it.Yes, and a certain amount of this is actually fine because the whole point was to minimize contention rather than perfectly eliminate it.
On Thu, Apr 17, 2014 at 04:21:46PM +0400, Kostya Serebryany wrote:I'd strongly go with this schema with one tweak. Use the stack pointer
> - sharded counters: each counter represented as N counters sitting in
> different cache lines. Every thread accesses the counter with index TID%N.
> Solves the problem partially, better with larger values of N, but then
> again it costs RAM.
as base value with some fudging to not just use the lowest bits. It is
typically easier to get.
> MAX is a fixed cap so even on systems with 100s of cores we don't do something silly.Makes not sense to me.
Why do you want only systems with up to 100 cores to be fast and scalable? 100+ core system *especially* need good scalability (contention tends to be superlinear).
What you are proposing is basically: If I have 10 engineers in my company, I probably want to give them 10 working desks as well. But let's not go insane. If I have 1000 engineers, 100 desks must be enough for them. This must reduce costs.
The baseline memory consumption for systems (and amount of RAM!) is O(NCORES), not O(1). In some read-mostly cases it's possible to achieve O(1) memory consumption, and that's great. But if it's not the case here, let it be so.
Threads do not produce contention, it's cores that produce contention.
> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))
The formula must be: shard_count = k*NCORES
And if you want less memory in single-threaded case, then: shard_count = min(k*NCORES, c*NTHREADS)
One more proposal: simple per-thread counters allocated with mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan.Chrome has ~3M basic blocks instrumented for coverage,so even largest applications will hardly have more than, say, 10M basic blocks
(number can be configurable at application start time). This gives us 80Mb for the array of 64-bit counters.That's a lot if multiplied by the number of threads, but the MAP_NORESERVE trick solves the problem --each thread will only touch the pages where it actually increment the counters.On thread exit the whole 80Mb counter array are will be merged into a central array of counters and then discarded,but we can also postpone this until another new thread is created -- then we just reuse the counter array.This brings two challenges.#1. The basic blocks should be numbered sequentially. I see only one way to accomplish this: with the help of linker (and dynamic linker for DSOs).The compiler would emit code using offsets that will later be transformed into constants by the linker.Not sure if any existing linker support this kind of thing. Anyone?#2. How to access the per-thread counter array. If we simply store the pointer to the array in TLS, the instrumentation will be more expensive just because of need to load and keep this pointer.If the counter array is part of TLS itself, we'll have to intrude into the pthread library (or wrap it) so that this part of TLS is mapped with MAP_NORESERVE.
On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvy...@google.com> wrote:
> MAX is a fixed cap so even on systems with 100s of cores we don't do something silly.Makes not sense to me.
Why do you want only systems with up to 100 cores to be fast and scalable? 100+ core system *especially* need good scalability (contention tends to be superlinear).Please don't argue motives, and instead argue the technical merits. I do want systems with 100s of cores to be fast and scalable.
What you are proposing is basically: If I have 10 engineers in my company, I probably want to give them 10 working desks as well. But let's not go insane. If I have 1000 engineers, 100 desks must be enough for them. This must reduce costs.
The baseline memory consumption for systems (and amount of RAM!) is O(NCORES), not O(1). In some read-mostly cases it's possible to achieve O(1) memory consumption, and that's great. But if it's not the case here, let it be so.I think you are drastically overstating what I am suggesting. The bad analogy isn't helping.The only way we get contention is if we have the same function accessing the same counters within the function on multiple cores at the same time. It is entirely conceivable that programs which manage to do this for *every* core in a system with many hundreds of cores are rare. As a consequence, it might be a practical compromise to reduce the number of shards below the number of cores if the memory overhead is not providing commensurate performance. Clearly, measurements and such are needed here, but it is at least a tunable knob that we should know about and consider in our measurements.
On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <k...@google.com> wrote:
One more proposal: simple per-thread counters allocated with mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan.Chrome has ~3M basic blocks instrumented for coverage,so even largest applications will hardly have more than, say, 10M basic blocksI think this is a *gross* underestimation. I work with applications more than one order of magnitude larger than Chrome.
(number can be configurable at application start time). This gives us 80Mb for the array of 64-bit counters.That's a lot if multiplied by the number of threads, but the MAP_NORESERVE trick solves the problem --each thread will only touch the pages where it actually increment the counters.On thread exit the whole 80Mb counter array are will be merged into a central array of counters and then discarded,but we can also postpone this until another new thread is created -- then we just reuse the counter array.This brings two challenges.#1. The basic blocks should be numbered sequentially. I see only one way to accomplish this: with the help of linker (and dynamic linker for DSOs).The compiler would emit code using offsets that will later be transformed into constants by the linker.Not sure if any existing linker support this kind of thing. Anyone?#2. How to access the per-thread counter array. If we simply store the pointer to the array in TLS, the instrumentation will be more expensive just because of need to load and keep this pointer.If the counter array is part of TLS itself, we'll have to intrude into the pthread library (or wrap it) so that this part of TLS is mapped with MAP_NORESERVE.#3. It essentially *requires* a complex merge on shutdown rather than a simple flush.
I'm not even sure how to do the merge without dirtying still more pages of the no-reserve memory.
It's not at all clear to me that this scales up (either in memory usage, memory reservation, or shutdown time) to larger applications. Chrome isn't a useful upper bound here.Array processing is fast. Contention is slow. I would expect this to be a net win.For the additional memory consumption during final merge, we can process one per-thread array, unmap it, process second array, unmap it, and so on. This will not require bringing all the pages into memory.
Hi,
This is long thread, so I will combine several comments into single email.
>> - 8-bit per-thread counters, dumping into central counters on overflow.>The overflow will happen very quickly with 8bit counter.Yes, but it reduces contention by 256x (a thread must execute at least 256 loop iterations between increments). In practice, if you reduce contention below some threshold, it does not represent a problem anymore.
>> - per-thread counters. Solves the problem at huge cost in RAM per-thread>It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.Do we have any numbers about number of counters? If there are 100K 1-byte counters, I would consider it as practical.
How does it work?
> In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling.
> Basically counters will be updated every N samples.
>> It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior.> I don't really agree.
>We are talking about developers here. Nobody would know the exact thread counts, but developers know the ballpark number
I strongly believe that we must relief developers from this choice during build time, and do our best to auto-tune (if the final scheme requires tuning).
First, such questions puts unnecessary burden on developers. We don't ask what register allocation algorithm to use for each function, right?
Second, there are significant chances we will get a wrong answer, because e.g. developer's view of how threaded is the app can differ from reality or from our classification.
Third, the app can be build by a build engineer; or the procedure can be applied to a base with 10'000 apps; or threaded-ness can change; or the app can work in different modes; or the app can have different phases.
Moreover, this approach allows to implement the counter increment using a callback:if ((fast_thread_local_rand() & counter) == 0) __cov_increment(&counter);which in turn will let us use the same hack as in AsanCoverage: use the PC to map the counter to the source code.(= no need to create separate section in the objects).
—kcc
I can see that the behavior of our current instrumentation is going to be a problem for the kinds of applications that you’re looking at. If you can find a way to get the overhead down without losing accuracy and without hurting the performance for applications without significant contention, then we can just adopt that. If you can’t do it without tradeoffs, then we should have a separate option to let those who care switch between different kinds of instrumentation.
What are your requirements for accuracy?
Current implementation does not provide 100% accuracy, so it's
something less than 100%. What is it?
What use cases for numeric counters (as opposed to bool flag) do we
need to support? Is it only feedback-driven optimizations?
> and without
> hurting the performance for applications without significant contention,
What is the acceptable threshold? 0.01% would be fine, right? What is
the maximum value that you are ready to agree with?
This is mostly correct. Probably modulo signals/interrupts (which can
arrive in between of increment and lead to a lose of arbitrary number
of increments -- all increments in handler itself).
But it's difficult to say what is a single-threaded app today, as lots
of libraries can create threads (it's usually not part of a contract).
I can see that the behavior of our current instrumentation is going to be a
problem for the kinds of applications that you’re looking at. If you can
find a way to get the overhead down without losing accuracy
What are your requirements for accuracy?
Current implementation does not provide 100% accuracy, so it's
something less than 100%. What is it?
What use cases for numeric counters (as opposed to bool flag) do we
need to support? Is it only feedback-driven optimizations?
and without
hurting the performance for applications without significant contention,
What is the acceptable threshold? 0.01% would be fine, right? What is
the maximum value that you are ready to agree with?
If I would look at coverage report with exact numbers and observe 2
statements that must have the same hit count, but in the report they
have different hit counts; I would be highly confused.
Go coverage profiler has 3 modes of operation: set, count and atomic:
http://golang.org/cmd/go/#hdr-Description_of_testing_flags
As you can guess, the last one (atomic) is intended for exactly such
case -- when user looks at exact numbers.
> and without
> hurting the performance for applications without significant contention,
>
>
> What is the acceptable threshold? 0.01% would be fine, right? What is
> the maximum value that you are ready to agree with?
>
>
> Like I said, I don’t have a specific value in mind. My sense is that most of
> the applications we care about are quite different from the massively
> parallel code that Google cares about. We may have many threads but they’re
> all doing different things and contention is much less likely to be a
> problem.
Note that contention can arise even if threads are doing different
things but use common components. E.g. if you include a custom malloc
implementation into your program, then everything that allocates
becomes a point of contention.
_______________________________________________
I think you are right.
Our compiler folks will correct me if I am wrong, but I would expect
that we also want something like "scalable-count' mode for FDO. The
14x slowdown that we've observed on a server app will render most of
our apps broken due to pervasive timeouts. This mode must provide
precision good enough for FDO and minimal performance impact for
highly multi-threaded apps.