TL;DR: std::string is responsible for almost half of all allocations in the Chrome browser process; please be careful how you use it!
In the course of optimizing SyzyASan performance, the Syzygy team discovered that nearly 25000 (!!) allocations are made for every keystroke in the Omnibox. We've since built some rudimentary memory profiling tools and have found a few issues:
strings being passed as char* (using c_str) and then converted back to string
Using wrappers to StringAppendV, such as StringPrintf, to concatenate strings or for simple number conversions (see note below)
Using a temporary set to search for a specific value, which creates O(n) allocations and runs in O(n.logn) time, only to call find on it to return true/false if the value is present. Ouch!
Not reserving space in a vector when the size is known
We have put together the following CLs which reduce the number of temporary allocations: 753603002, 747013003 & 745373002. The impact seen by applying these patches is a decrease of around 10% of the total number of allocations made by the browser process (~2/3rd of those are strings), all of which correspond to temporary allocations (a new followed by a delete). We are looking at ways of getting better metrics, including gains on the cpu time. If anyone has a good idea of how we can better benchmark the impact of those CLs, we would love your input.
Also note that these optimizations do not reduce the memory footprint of Chrome; they reduce temporary allocations and increase performance. We are also looking into the problem of total memory footprint.
Following these first tests, here are some observations that we noted:
StringPrintF and its variants make a lot of temporary allocations and are not a good use for string concatenation. We ran a test pitting StringPrintf against append for string concatenation and found the latter 2.5x to 3x faster. This is not surprising when analyzing base/strings/stringprintf.cc
A lot of times, a std::string is being passed as a const char* (using c_str()), only to be converted back to std::string. Using base::StringPiece or passing a const ref to a std::string all the way is much more efficient
When dealing with a vector/string, memory reservation is not always used, which increases the number of allocations during insertions
Note: StringAppendVT, which is used by StringPrintf and other variants, is for various reasons, quite complex code. Many temporaries are created and much copying performed. We're looking into ways to optimize this much used function.
--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.
Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.
The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).
Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.
On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.
Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.Well, in the general case, yeah, but as I was chatting with Tom, std::string() is somewhat of a special little butterfly. And I mean that it's allocation strategy is fairly fixed in both initial size (due to small string optimizations implemented by, at last check, all of our std::string() libraries) and growth strategies (typically 2x). So not only do these allocations fall rather nicely into the bucket sizes of most allocators, but their allocations/frees are highly reusuable among other strings, especially in these "large allocations of small strings" cases.
That's why I asked for some of the measurement, because of all the objects, std::string() is perhaps the nicest behaving with allocators, especially tcmalloc, so I'm surprised this would even come up, except under instrumentation (or, as pointed out, Android)
The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).Right, it sounds like you're saying StringPrintf vs SStringPrintf, thus forcing new strings when an existing string may be perfect (and an optimal size due to prior growth strategy)
Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.Violent agreement here that more education of common, easily fixed anti-patterns is a good thing. However, I'm not entirely convinced it's an "std::string emergency", and am nervous considering some of the trends I've seen on optimization that might tend to microoptimize past readability.So we're 90% in agreement, 10% in concern, and 100% curious about the data behind this so we can make more informed recommendations/guidelines and justifiably weight some of the readability-vs-complexity discussions that these threads can spawn :)
Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)
--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
On Mon Dec 01 2014 at 12:25:18 PM Ryan Sleevi <rsl...@chromium.org> wrote:On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)
Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.Well, in the general case, yeah, but as I was chatting with Tom, std::string() is somewhat of a special little butterfly. And I mean that it's allocation strategy is fairly fixed in both initial size (due to small string optimizations implemented by, at last check, all of our std::string() libraries) and growth strategies (typically 2x). So not only do these allocations fall rather nicely into the bucket sizes of most allocators, but their allocations/frees are highly reusuable among other strings, especially in these "large allocations of small strings" cases.Note that this isn't really about strings being evil, or temporary allocations being bad. It's about being careful with interfaces, as there are cases where we swap from const std::string& to const char* and back, and this impedance mismatch is simply wasteful. The focus should be on using base::StringPiece where usable so as not to incur temporaries and copying at every second function call.That's why I asked for some of the measurement, because of all the objects, std::string() is perhaps the nicest behaving with allocators, especially tcmalloc, so I'm surprised this would even come up, except under instrumentation (or, as pointed out, Android)The measurement has only been in memory space right now. We were chasing down an apparent UI lag in the Omnibox when using ASAN instrumentation, and observed some excessive use of temporaries.The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).Right, it sounds like you're saying StringPrintf vs SStringPrintf, thus forcing new strings when an existing string may be perfect (and an optimal size due to prior growth strategy)I haven't looked at the implementation of SStringPrintf, but it sounds like it is better behaved. I agree with the general sentiment that explicit string appends are generally uglier from a readability standpoint (when compared to a sequence of append operations). Georges, maybe look into addressing the problem at the level of StringPrintf directly? Or switching to using SStringPrintf where appropriate.Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.Violent agreement here that more education of common, easily fixed anti-patterns is a good thing. However, I'm not entirely convinced it's an "std::string emergency", and am nervous considering some of the trends I've seen on optimization that might tend to microoptimize past readability.So we're 90% in agreement, 10% in concern, and 100% curious about the data behind this so we can make more informed recommendations/guidelines and justifiably weight some of the readability-vs-complexity discussions that these threads can spawn :)We'll try to come back with some more data.Cheers,Chris
--
Only partially related to this thread, but could someone comment on the performance of dlmalloc on Android? What should we expect from it? Are there improvements coming?
The measurement has only been in memory space right now. We were chasing down an apparent UI lag in the Omnibox when using ASAN instrumentation, and observed some excessive use of temporaries.
Note that this isn't really about strings being evil, or temporary allocations being bad. It's about being careful with interfaces, as there are cases where we swap from const std::string& to const char* and back, and this impedance mismatch is simply wasteful. The focus should be on using base::StringPiece where usable so as not to incur temporaries and copying at every second function call.
--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
Chromebook Pixel 40.0.2194.3 (dev channel) is pretty janky for me.
On Mon, Dec 1, 2014 at 3:43 PM, Peter Kasting <pkas...@chromium.org> wrote:
> On Mon, Dec 1, 2014 at 3:23 PM, James Cook <jame...@chromium.org> wrote:
>>
>> Chromebook Pixel 40.0.2194.3 (dev channel) is pretty janky for me.
>
>
> OmahaProxy reports that has a base position of 300226. Enabling
> RenderTextHarfBuzz on CrOS landed as 300348.
>
> I believe you can try using about:flags to flip "HarfBuzz for UI text" to
> "Enabled". If that doesn't help, file a bug about the particular jank
> you're seeing and if it involves particular omnibox inputs we can ask you
> for the relevant histogram data.
>
> PK
>
> --
> --
> Chromium Developers mailing list: chromi...@chromium.org
> View archives, change email options, or unsubscribe:
> http://groups.google.com/a/chromium.org/group/chromium-dev
--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
- If you control the whole tree of functions below you, or are a leaf, then lean towards const StringPiece&. Otherwise, lean towards const std::string&.
On Mon, Dec 1, 2014 at 10:15 AM, Chris Hamilton <chr...@chromium.org> wrote:On Mon Dec 01 2014 at 12:25:18 PM Ryan Sleevi <rsl...@chromium.org> wrote:On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)We don't touch HeapAlloc/HeapCreate, etc. (and indeed we had some ridiculous number of heaps at last count). Only the CRT level functions are revectored (the stuff in https://code.google.com/p/chromium/codesearch#chromium/src/base/allocator/prep_libc.py&l=59 ).However, because of tcmalloc's inability to release address space, cpu@ has been experimenting with not using tcmalloc on Windows, and at the moment at head, we do not. https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&l=1265
It is still a bit early to tell, but the switch to winheap had a small impact in perf, almost exclusively in windows xp [1] doing svg, and guess what, some very nice speedups, see the bug for details.Could be that either winheap is not as bad as this thread would lead you to believe or tcmalloc is not as awesome as this thread will led you to believe. Maybe I get to do a Myth busters episode or maybe I get to eat my words.
--
+jschuhDon't we have some important security enhancements in Chromium tcmalloc?
On Monday, December 1, 2014 10:43:05 AM UTC-8, Scott Graham wrote:On Mon, Dec 1, 2014 at 10:15 AM, Chris Hamilton <chr...@chromium.org> wrote:On Mon Dec 01 2014 at 12:25:18 PM Ryan Sleevi <rsl...@chromium.org> wrote:On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)We don't touch HeapAlloc/HeapCreate, etc. (and indeed we had some ridiculous number of heaps at last count). Only the CRT level functions are revectored (the stuff in https://code.google.com/p/chromium/codesearch#chromium/src/base/allocator/prep_libc.py&l=59 ).However, because of tcmalloc's inability to release address space, cpu@ has been experimenting with not using tcmalloc on Windows, and at the moment at head, we do not. https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&l=1265It is still a bit early to tell, but the switch to winheap had a small impact in perf, almost exclusively in windows xp [1] doing svg, and guess what, some very nice speedups, see the bug for details.Could be that either winheap is not as bad as this thread would lead you to believe or tcmalloc is not as awesome as this thread will led you to believe. Maybe I get to do a Myth busters episode or maybe I get to eat my words.
On Tue, Dec 2, 2014 at 10:52 AM, Carlos Pizano <c...@chromium.org> wrote:It is still a bit early to tell, but the switch to winheap had a small impact in perf, almost exclusively in windows xp [1] doing svg, and guess what, some very nice speedups, see the bug for details.Could be that either winheap is not as bad as this thread would lead you to believe or tcmalloc is not as awesome as this thread will led you to believe. Maybe I get to do a Myth busters episode or maybe I get to eat my words.I think tcmalloc isn't that awesome. Or more specifically, it probably is awesome for massively multithreaded server loads where there's lots of memory to spare. This is not how we need an allocator optimized for Chrome.Switching parts of Blink from tcmalloc to PartitionAlloc led to speedups and memory savings.We were thinking of switching the browser process away from tcmalloc to PartitionAlloc, which is tracked (and stalled) at https://code.google.com/p/chromium/issues/detail?id=339604. On Linux x64, we saw 30% memory savings for some cases.Linking back to the std::string discussion, the comment in that bug at https://code.google.com/p/chromium/issues/detail?id=339604#c66 is interesting -- it's a histogram of allocation sizes, and we do seem to bias towards very small sizes. Perhaps that hits a pathological case in tcmalloc.