Be careful when using std::string

50,096 views
Skip to first unread message

Georges Khalil

unread,
Dec 1, 2014, 11:12:37 AM12/1/14
to “chromium-dev@chromium.org”

TL;DR: std::string is responsible for almost half of all allocations in the Chrome browser process; please be careful how you use it!



In the course of optimizing SyzyASan performance, the Syzygy team discovered that nearly 25000 (!!) allocations are made for every keystroke in the Omnibox. We've since built some rudimentary memory profiling tools and have found a few issues:


  • strings being passed as char* (using c_str) and then converted back to string

  • Using wrappers to StringAppendV, such as StringPrintf, to concatenate strings or for simple number conversions (see note below)

  • Using a temporary set to search for a specific value, which creates O(n) allocations and runs in O(n.logn) time, only to call find on it to return true/false if the value is present. Ouch!

  • Not reserving space in a vector when the size is known


We have put together the following CLs which reduce the number of temporary allocations: 753603002, 747013003 & 745373002. The impact seen by applying these patches is a decrease of around 10% of the total number of allocations made by the browser process (~2/3rd of those are strings), all of which correspond to temporary allocations (a new followed by a delete). We are looking at ways of getting better metrics, including gains on the cpu time. If anyone has a good idea of how we can better benchmark the impact of those CLs, we would love your input.


Also note that these optimizations do not reduce the memory footprint of Chrome; they reduce temporary allocations and increase performance. We are also looking into the problem of total memory footprint.


Following these first tests, here are some observations that we noted:

  • StringPrintF and its variants make a lot of temporary allocations and are not a good use for string concatenation. We ran a test pitting StringPrintf against append for string concatenation and found the latter 2.5x to 3x faster. This is not surprising when analyzing base/strings/stringprintf.cc

  • A lot of times, a std::string is being passed as a const char* (using c_str()), only to be converted back to std::string. Using base::StringPiece or passing a const ref to a std::string all the way is much more efficient

  • When dealing with a vector/string, memory reservation is not always used, which increases the number of allocations during insertions


Note: StringAppendVT, which is used by StringPrintf and other variants, is for various reasons, quite complex code. Many temporaries are created and much copying performed. We're looking into ways to optimize this much used function.


Ryan Sleevi

unread,
Dec 1, 2014, 11:24:38 AM12/1/14
to geor...@chromium.org, “chromium-dev@chromium.org”
Do you have numbers quantifying the allocation cost with other metrics and resources, such as CPU time, cache misses, memory usage, etc?

I ask because much of the focus in the past years has been on improving allocator performance such that CPU time, cache misses, and memory uses are significantly negligible in the common case.

For profiling tools, this certainly gives them a hard time, and when using things like the WinHeap, this can be a downright awful time. But that's largely irrelevant for our users, and that's where we should be spending resources on.

That said, your comments about StringPrintF surprise me, as generally vnsprintf and friends are more favourable to string usage (because they can pre-scan to find the desired length prior to the actual conversion, thus reducing the growth factor typically employed by .append()). Is it a fundamental issue in StringPrintF, or is it more of a usage anti-pattern compared to SStringPrintF (which can use the supplied string, thus its already-allocated and grown value)?

While there is some good practices involved here - e.g. not using set<> for temporaries (especially large ones) and greater use of base::StringPiece - I'd feel comfortable knowing that measurable impact this has in the "real world", given that the focus on so many allocators has been code will behave exactly as you described.

Chris Hamilton

unread,
Dec 1, 2014, 12:09:41 PM12/1/14
to rsl...@chromium.org, geor...@chromium.org, “chromium-dev@chromium.org”
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.

Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.

The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).

Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Ryan Sleevi

unread,
Dec 1, 2014, 12:25:48 PM12/1/14
to Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.

Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.

Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.
 

Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.

Well, in the general case, yeah, but as I was chatting with Tom, std::string() is somewhat of a special little butterfly. And I mean that it's allocation strategy is fairly fixed in both initial size (due to small string optimizations implemented by, at last check, all of our std::string() libraries) and growth strategies (typically 2x). So not only do these allocations fall rather nicely into the bucket sizes of most allocators, but their allocations/frees are highly reusuable among other strings, especially in these "large allocations of small strings" cases.

That's why I asked for some of the measurement, because of all the objects, std::string() is perhaps the nicest behaving with allocators, especially tcmalloc, so I'm surprised this would even come up, except under instrumentation (or, as pointed out, Android)
 

The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).

Right, it sounds like you're saying StringPrintf vs SStringPrintf, thus forcing new strings when an existing string may be perfect (and an optimal size due to prior growth strategy)
 

Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.

Violent agreement here that more education of common, easily fixed anti-patterns is a good thing. However, I'm not entirely convinced it's an "std::string emergency", and am nervous considering some of the trends I've seen on optimization that might tend to microoptimize past readability.

So we're 90% in agreement, 10% in concern, and 100% curious about the data behind this so we can make more informed recommendations/guidelines and justifiably weight some of the readability-vs-complexity discussions that these threads can spawn :)

Chris Hamilton

unread,
Dec 1, 2014, 1:16:23 PM12/1/14
to rsl...@chromium.org, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
On Mon Dec 01 2014 at 12:25:18 PM Ryan Sleevi <rsl...@chromium.org> wrote:
On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.

Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.

Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.

Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)
 
 

Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.

Well, in the general case, yeah, but as I was chatting with Tom, std::string() is somewhat of a special little butterfly. And I mean that it's allocation strategy is fairly fixed in both initial size (due to small string optimizations implemented by, at last check, all of our std::string() libraries) and growth strategies (typically 2x). So not only do these allocations fall rather nicely into the bucket sizes of most allocators, but their allocations/frees are highly reusuable among other strings, especially in these "large allocations of small strings" cases.

Note that this isn't really about strings being evil, or temporary allocations being bad. It's about being careful with interfaces, as there are cases where we swap from const std::string& to const char* and back, and this impedance mismatch is simply wasteful. The focus should be on using base::StringPiece where usable so as not to incur temporaries and copying at every second function call.
 
That's why I asked for some of the measurement, because of all the objects, std::string() is perhaps the nicest behaving with allocators, especially tcmalloc, so I'm surprised this would even come up, except under instrumentation (or, as pointed out, Android)

The measurement has only been in memory space right now. We were chasing down an apparent UI lag in the Omnibox when using ASAN instrumentation, and observed some excessive use of temporaries.
 
 

The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).

Right, it sounds like you're saying StringPrintf vs SStringPrintf, thus forcing new strings when an existing string may be perfect (and an optimal size due to prior growth strategy)

I haven't looked at the implementation of SStringPrintf, but it sounds like it is better behaved. I agree with the general sentiment that explicit string appends are generally uglier from a readability standpoint (when compared to a sequence of append operations). Georges, maybe look into addressing the problem at the level of StringPrintf directly? Or switching to using SStringPrintf where appropriate.
 
 

Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.

Violent agreement here that more education of common, easily fixed anti-patterns is a good thing. However, I'm not entirely convinced it's an "std::string emergency", and am nervous considering some of the trends I've seen on optimization that might tend to microoptimize past readability.

So we're 90% in agreement, 10% in concern, and 100% curious about the data behind this so we can make more informed recommendations/guidelines and justifiably weight some of the readability-vs-complexity discussions that these threads can spawn :)

We'll try to come back with some more data.

Cheers,

Chris

Erik Wright

unread,
Dec 1, 2014, 1:17:25 PM12/1/14
to chr...@chromium.org, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”
This is probably a naïve question, but presumably these std::string allocations entail not just some bookkeeping but also some copying of string data from one buffer to another? So the allocator optimizations Ryan mentioned would reduce one area for concern but not have any impact on the cost of shoveling data around... no?

Ryan Sleevi

unread,
Dec 1, 2014, 1:22:59 PM12/1/14
to Erik Wright, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”
Correct. Not free (neither is 50ns a malloc/free pair any short of cheap), but also fairly inexpensive for "most" string usages ( https://software.intel.com/en-us/node/513420 ).

Generally doesn't show up anywhere remotely on a CPU profile until you're writing things like HTTP servers :)

Ryan Sleevi

unread,
Dec 1, 2014, 1:27:47 PM12/1/14
to Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
On Mon, Dec 1, 2014 at 6:15 PM, Chris Hamilton <chr...@chromium.org> wrote:

Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)
 

tcmalloc only interposes at the new/delete/malloc/free layer. You can still get to the WinHeap via the Win API entry points (HeapCreate and friends). https://code.google.com/p/chromium/codesearch#chromium/src/base/allocator/prep_libc.py&l=1 for the hijinks where we mess with the MSVCRT bits.

Chris Harrelson

unread,
Dec 1, 2014, 1:38:38 PM12/1/14
to rsl...@chromium.org, Chris Hamilton, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
Only partially related to this thread, but could someone comment on the performance of dlmalloc on Android? What should we expect from it? Are there improvements coming?

Chris

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.

Scott Graham

unread,
Dec 1, 2014, 1:43:05 PM12/1/14
to chr...@chromium.org, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
On Mon, Dec 1, 2014 at 10:15 AM, Chris Hamilton <chr...@chromium.org> wrote:


On Mon Dec 01 2014 at 12:25:18 PM Ryan Sleevi <rsl...@chromium.org> wrote:
On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.

Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.

Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.

Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)

We don't touch HeapAlloc/HeapCreate, etc. (and indeed we had some ridiculous number of heaps at last count). Only the CRT level functions are revectored (the stuff in https://code.google.com/p/chromium/codesearch#chromium/src/base/allocator/prep_libc.py&l=59 ).

However, because of tcmalloc's inability to release address space, cpu@ has been experimenting with not using tcmalloc on Windows, and at the moment at head, we do not. https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&l=1265


 
 
 

Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.

Well, in the general case, yeah, but as I was chatting with Tom, std::string() is somewhat of a special little butterfly. And I mean that it's allocation strategy is fairly fixed in both initial size (due to small string optimizations implemented by, at last check, all of our std::string() libraries) and growth strategies (typically 2x). So not only do these allocations fall rather nicely into the bucket sizes of most allocators, but their allocations/frees are highly reusuable among other strings, especially in these "large allocations of small strings" cases.

Note that this isn't really about strings being evil, or temporary allocations being bad. It's about being careful with interfaces, as there are cases where we swap from const std::string& to const char* and back, and this impedance mismatch is simply wasteful. The focus should be on using base::StringPiece where usable so as not to incur temporaries and copying at every second function call.
 
That's why I asked for some of the measurement, because of all the objects, std::string() is perhaps the nicest behaving with allocators, especially tcmalloc, so I'm surprised this would even come up, except under instrumentation (or, as pointed out, Android)

The measurement has only been in memory space right now. We were chasing down an apparent UI lag in the Omnibox when using ASAN instrumentation, and observed some excessive use of temporaries.
 
 

The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).

Right, it sounds like you're saying StringPrintf vs SStringPrintf, thus forcing new strings when an existing string may be perfect (and an optimal size due to prior growth strategy)

I haven't looked at the implementation of SStringPrintf, but it sounds like it is better behaved. I agree with the general sentiment that explicit string appends are generally uglier from a readability standpoint (when compared to a sequence of append operations). Georges, maybe look into addressing the problem at the level of StringPrintf directly? Or switching to using SStringPrintf where appropriate.
 
 

Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.

Violent agreement here that more education of common, easily fixed anti-patterns is a good thing. However, I'm not entirely convinced it's an "std::string emergency", and am nervous considering some of the trends I've seen on optimization that might tend to microoptimize past readability.

So we're 90% in agreement, 10% in concern, and 100% curious about the data behind this so we can make more informed recommendations/guidelines and justifiably weight some of the readability-vs-complexity discussions that these threads can spawn :)

We'll try to come back with some more data.

Cheers,

Chris

--

Tom Hudson

unread,
Dec 1, 2014, 1:46:07 PM12/1/14
to Chris Harrelson, Ryan Sleevi, Chris Hamilton, geor...@chromium.org, “chromium-dev@chromium.org”, Primiano Tucci
On Mon, Dec 1, 2014 at 1:37 PM, Chris Harrelson <chri...@chromium.org> wrote:
Only partially related to this thread, but could someone comment on the performance of dlmalloc on Android? What should we expect from it? Are there improvements coming?


As far as I know, we used dlmalloc() months/years ago, but we're using jemalloc() and PartitionAlloc on Android today.

From a recent android-native@ thread: apparently with L, OEMs have a choice, and low-end devices are using dlmalloc() while high-end devices are using jemalloc()? libc on older devices will still use dlmalloc, which is explicitly not intended for use in multi-threaded applications.

primiano@ is the Chrome Android expert on a lot of this stuff in London; I don't know who his MTV counterpart is.

PhistucK

unread,
Dec 1, 2014, 1:51:56 PM12/1/14
to chr...@chromium.org, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org

On Mon, Dec 1, 2014 at 8:15 PM, Chris Hamilton <chr...@chromium.org> wrote:
The measurement has only been in memory space right now. We were chasing down an apparent UI lag in the Omnibox when using ASAN instrumentation, and observed some excessive use of temporaries.

​On my (probably 'underpowered') machine, there is serious lag when typing in the omnibox, especially the 'Control + T and type' scenario. The entire browser often hangs for a few seconds. Is that, by any chance, related to heap contention and whatever else this thread is discussing?



PhistucK

Peter Kasting

unread,
Dec 1, 2014, 6:17:17 PM12/1/14
to Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
On Mon, Dec 1, 2014 at 10:15 AM, Chris Hamilton <chr...@chromium.org> wrote:
Note that this isn't really about strings being evil, or temporary allocations being bad. It's about being careful with interfaces, as there are cases where we swap from const std::string& to const char* and back, and this impedance mismatch is simply wasteful. The focus should be on using base::StringPiece where usable so as not to incur temporaries and copying at every second function call.

I would say that in general the lesson here should be "avoid using char* (or its array form) except when defining compile-time constants".  Explicitly telling people to use StringPiece is IMO not as important as telling them _not_ to use char*.  Especially as our toolchains support C++11 increasingly, doing things like just passing around std::strings ought to become "good enough" compared to StringPiece, which our codebase uses rarely enough that it's not particularly idiomatic.

I'm very happy about your change to make the prefs APIs take strings instead of char*.  Using char* never made sense to me and uglified caller code.

PK

Peter Kasting

unread,
Dec 1, 2014, 6:18:40 PM12/1/14
to PhistucK Productions, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
Platform and version?  We were seeing a ton of omnibox slowdown on Linux boxes until recently due to not using harfbuzz (IIRC).

PK 

James Cook

unread,
Dec 1, 2014, 6:23:46 PM12/1/14
to Peter Kasting, PhistucK Productions, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
Chromebook Pixel 40.0.2194.3 (dev channel) is pretty janky for me.


--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.

Peter Kasting

unread,
Dec 1, 2014, 6:43:33 PM12/1/14
to James Cook, PhistucK Productions, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
On Mon, Dec 1, 2014 at 3:23 PM, James Cook <jame...@chromium.org> wrote:
Chromebook Pixel 40.0.2194.3 (dev channel) is pretty janky for me.

OmahaProxy reports that has a base position of 300226.  Enabling RenderTextHarfBuzz on CrOS landed as 300348.

I believe you can try using about:flags to flip "HarfBuzz for UI text" to "Enabled".  If that doesn't help, file a bug about the particular jank you're seeing and if it involves particular omnibox inputs we can ask you for the relevant histogram data.

PK

Eric Seidel

unread,
Dec 1, 2014, 10:19:15 PM12/1/14
to Peter Kasting, James Cook, PhistucK Productions, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
Blink has historically had spent a lot of effort optimizing string
allocation (and retention). Blink however uses immutable,
reference-counted strings as well as atomized strings (as well as
fancy compile-time allocated static atomized strings). We have a
std::string<char>-like owned-buffer class called CString, but it's
used little.

http://blink.lc/blink/tree/Source/wtf/text/WTFString.h //
RefPtr<WTF::String> + helpers
http://blink.lc/blink/tree/Source/wtf/text/StringImpl.h // includes
the StaticStringTable.
http://blink.lc/blink/tree/Source/wtf/text/AtomicString.h // Atomized
StringImpl subclass.
http://blink.lc/blink/tree/Source/wtf/text/CString.h // Vector<char> + helpers

Eventually Blink's WTF is likely to merge with Chromium's base library
in some form as Blink slowly merges with Chromium. One of the
problems to solve there will be about merging some of the string fancy
from Blink into base (since Blink is highly tuned with the assumption
of usage of these string types).

It's possible that some of this string fancy may be useful to the rest
of Chromium. But honestly I don't know enough about what string
manipulation/retention patterns are common in Chromium, or whether
ref-counted strings like Blink would be useful there. I'm also aware
that threads are much more common in Chromium than Blink and thus the
copying-nature of std::string is possibly more desirable.

Just throwing that out there as context to consider as these
string-usage investigations continue. :)

Viet-Trung Luu

unread,
Dec 1, 2014, 10:53:58 PM12/1/14
to Eric Seidel, Peter Kasting, James Cook, PhistucK Productions, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
Chromium doesn't do much that's sophisticated in terms of strings.

As Chris mentioned, passing base::StringPiece (or, as Chromium prefers, const base::StringPiece&) everywhere instead of const std::string& would be great. Encoding issues (e.g., UTF-8 vs -16) aside, every string representation is cheaply convertible to a StringPiece (which is just a pointer and a length). So (assuming no encoding problems), functions taking StringPiece would work well with Blink-ish string types.

Unfortunately, one function that demands a const std::string& that you need to call poisons the well. E.g., if your function looks like:

void Foo(const base::StringPiece& s) {
  A(s);
  B(s);
  C(s);
  ...
}

If A, B, C, ... all take a StringPiece, everything is great. But if only a single one of them demands an const std::string&, you're better off taking a const std::string& parameter yourself. (If s came from an std::string, then you're strictly better off. If not, and it came from, say, a const char*, you're still no worse off.)

And, as Peter says, Chromium tends strongly towards just passing const std::string& (probably at least partly due to the above reason). :(
 

On Mon, Dec 1, 2014 at 3:43 PM, Peter Kasting <pkas...@chromium.org> wrote:
> On Mon, Dec 1, 2014 at 3:23 PM, James Cook <jame...@chromium.org> wrote:
>>
>> Chromebook Pixel 40.0.2194.3 (dev channel) is pretty janky for me.
>
>
> OmahaProxy reports that has a base position of 300226.  Enabling
> RenderTextHarfBuzz on CrOS landed as 300348.
>
> I believe you can try using about:flags to flip "HarfBuzz for UI text" to
> "Enabled".  If that doesn't help, file a bug about the particular jank
> you're seeing and if it involves particular omnibox inputs we can ask you
> for the relevant histogram data.
>
> PK
>
> --
> --
> Chromium Developers mailing list: chromi...@chromium.org
> View archives, change email options, or unsubscribe:
> http://groups.google.com/a/chromium.org/group/chromium-dev

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
    http://groups.google.com/a/chromium.org/group/chromium-dev

Brett Wilson

unread,
Dec 2, 2014, 12:17:58 AM12/2/14
to Viet-Trung Luu, Eric Seidel, Peter Kasting, James Cook, PhistucK Productions, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
In my experience converting stuff to use StringPiece, I've found the
above limitation to be very limiting. Even for some leaf-node-seeming
functions in base, it's not uncommon for some special case to call
into some different code that takes a std::string.

The result can easily be much less optimal code than you started with.
And even if your code is perfectly StringPiece consistent now,
somebody may need to add a call to some random Android API later and
will ruin the whole thing.

There are some specialized workloads, especially when dealing with
parsing like Blink and GN do, where it makes sense to make your
subsystem consistently use StringPiece. There are also some *really*
low level functions like UTF8ToUTF16 where StringPiece input makes
perfect sense. But in most of the rest of the cases, optimizing
strings adds bugs, hurts readability, and doesn't make your code
faster anyway. Instead, focus on making your overall algorithm avoid
lots of unnecessary allocations rather than micro-optimizing strings.

Brett

PhistucK

unread,
Dec 2, 2014, 2:32:15 AM12/2/14
to Peter Kasting, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
Windows (whatever version), since almost ever. :(


PhistucK

Chris Hamilton

unread,
Dec 2, 2014, 9:46:49 AM12/2/14
to Brett Wilson, Viet-Trung Luu, Eric Seidel, Peter Kasting, James Cook, PhistucK Productions, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
Yeah, StringPiece only really works when it's StringPiece *all* the way down, as you point out. And in many situations there is code at the leaves that we don't control, so for any given tree of function calls you're better to pay the conversion once at the root then multiple times at the leaves.

I also agree that micro-optimization is unnecessary, and that effort could be better spent elsewhere.

This discussion was motivated by a simple accidental observation of some very wasteful usage of std::string. The few changes suggested by Georges cut allocation volume by about 10% in the Chrome browser process. The impact of these tiny changes was such that we felt a discussion would be worthwhile, to at least bring the issue into the minds of devs. I don't think we need to spend a lot of effort micro optimizing the existing code base, but a little clean up is justified.

Given the various points brought up in the thread I think the takeaways are something like the following (please feel free to agree, modify or vehemently disagree):

- Avoid using const char* in interfaces
- If you control the whole tree of functions below you, or are a leaf, then lean towards const StringPiece&. Otherwise, lean towards const std::string&.
- Think about (temporary) allocation volume when writing code, but don't obsess about it.

I also think that it's worth following up looking into replacing the StringPrintf-like functions with a Google3-like templated StrCat. The efficiency gains from doing this by hand in a few locations[1] are sufficient to justify the experiment, at least. Expect some follow-up here.

[1] The resulting code is currently far less readable (and won't land as is) as it consists of a chain of std::string::append calls rather than a single easy-to-read StringPrintf. StrCat is equally readable, and essentially expands to the series of manual appends.

Alex Vakulenko

unread,
Dec 2, 2014, 10:38:35 AM12/2/14
to chr...@chromium.org, Brett Wilson, Viet-Trung Luu, Eric Seidel, Peter Kasting, James Cook, PhistucK Productions, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
- If you control the whole tree of functions below you, or are a leaf, then lean towards const StringPiece&. Otherwise, lean towards const std::string&.

Should we lean towards StringPiece and not const StringPiece& instead? I've read numerous reports from C++ library team recommending to pass StringPiece by value instead by reference.
 

James Cook

unread,
Dec 2, 2014, 11:33:26 AM12/2/14
to Alex Vakulenko, Chris Hamilton, Brett Wilson, Viet-Trung Luu, Eric Seidel, Peter Kasting, PhistucK Productions, Ryan Sleevi, geor...@chromium.org, “chromium-dev@chromium.org”, tomh...@chromium.org
FYI - the omnibox jank I was seeing is gone in 41.0.2220.0 dev channel.

Anecdotes are not data, but here's an anecdote about vector::reserve(). I went to a talk at GDC back in the original Xbox days. Microsoft maintained an internal lab where they would bring in developers of high profile games and help them optimize their source. Not interesting.  But they also maintained a database of every individual optimization they implemented, grouped them by class and collected statistics on impact.

Many games build lists of things during their main loop. One of the largest impact changes was making sure all these vectors called std::vector::reserve() with a reasonable number.

Caveats: I have no real data. I have no idea what allocator they were using.  But if you have a guess at the size of your vector, a call to reserve() probably doesn't hurt.

+1 to const std::string& over char*. :-)

Kudos to the Syzygy team for noticing this and bringing it up.

James

Carlos Pizano

unread,
Dec 2, 2014, 1:52:17 PM12/2/14
to chromi...@chromium.org, chr...@chromium.org, rsl...@chromium.org, geor...@chromium.org, tomh...@chromium.org


On Monday, December 1, 2014 10:43:05 AM UTC-8, Scott Graham wrote:


On Mon, Dec 1, 2014 at 10:15 AM, Chris Hamilton <chr...@chromium.org> wrote:


On Mon Dec 01 2014 at 12:25:18 PM Ryan Sleevi <rsl...@chromium.org> wrote:
On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.

Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.

Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.

Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)

We don't touch HeapAlloc/HeapCreate, etc. (and indeed we had some ridiculous number of heaps at last count). Only the CRT level functions are revectored (the stuff in https://code.google.com/p/chromium/codesearch#chromium/src/base/allocator/prep_libc.py&l=59 ).

However, because of tcmalloc's inability to release address space, cpu@ has been experimenting with not using tcmalloc on Windows, and at the moment at head, we do not. https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&l=1265



It is still a bit early to tell, but the switch to winheap had a small impact in perf, almost exclusively in windows xp [1] doing svg, and guess what, some very nice speedups, see the bug for details. 

Could be that either winheap is not as bad as this thread would lead you to believe or tcmalloc is not as awesome as this thread will led you to believe. Maybe I get to do a Myth busters episode or maybe I get to eat my words.

--cpu


[1] xp heaps are not LFH by default.

Chris Hamilton

unread,
Dec 2, 2014, 3:32:44 PM12/2/14
to Carlos Pizano, chromi...@chromium.org, rsl...@chromium.org, geor...@chromium.org, tomh...@chromium.org
On Tue Dec 02 2014 at 1:52:20 PM Carlos Pizano <c...@chromium.org> wrote:
It is still a bit early to tell, but the switch to winheap had a small impact in perf, almost exclusively in windows xp [1] doing svg, and guess what, some very nice speedups, see the bug for details. 

Could be that either winheap is not as bad as this thread would lead you to believe or tcmalloc is not as awesome as this thread will led you to believe. Maybe I get to do a Myth busters episode or maybe I get to eat my words.


The WinHeap got seriously better sometime around Win7, and we (anecdotally) stopped noticing a difference when enabling it in Chrome.

My point wasn't specifically about any heap implementation being bad or good. My point was simply that so many allocations are performed in some spots of Chrome, that HeapAlloc/HeapFree new/delete malloc/free aren't that far from being a bottleneck. Slow them down by an order of magnitude (ie: when using ASAN instrumentation) and jank starts occurring in those busy spots.

Paweł Hajdan, Jr.

unread,
Dec 3, 2014, 5:01:45 AM12/3/14
to Chris Hamilton, Justin Schuh, Carlos Pizano, chromium-dev, Ryan Sleevi, geor...@chromium.org, tomh...@chromium.org
+jschuh

Don't we have some important security enhancements in Chromium tcmalloc?

Paweł

--

Justin Schuh

unread,
Dec 3, 2014, 12:37:27 PM12/3/14
to Paweł Hajdan, Jr., Chris Hamilton, Carlos Pizano, chromium-dev, Ryan Sleevi, geor...@chromium.org, tomh...@chromium.org
On Wed, Dec 3, 2014 at 2:00 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
+jschuh

Don't we have some important security enhancements in Chromium tcmalloc?

We have in our own version of tcmalloc, but WinHeap is actually very well hardened from a security perspective. So, I'd consider the security impact to be neutral at worst.

Chris Evans

unread,
Dec 3, 2014, 2:33:07 PM12/3/14
to Carlos Pizano, chromium-dev, chr...@chromium.org, Ryan Sleevi, geor...@chromium.org, tomh...@chromium.org
On Tue, Dec 2, 2014 at 10:52 AM, Carlos Pizano <c...@chromium.org> wrote:


On Monday, December 1, 2014 10:43:05 AM UTC-8, Scott Graham wrote:


On Mon, Dec 1, 2014 at 10:15 AM, Chris Hamilton <chr...@chromium.org> wrote:


On Mon Dec 01 2014 at 12:25:18 PM Ryan Sleevi <rsl...@chromium.org> wrote:
On Mon, Dec 1, 2014 at 5:09 PM, Chris Hamilton <chr...@chromium.org> wrote:
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.

Right, Tom Hudson reminded me that we aren't using tcmalloc on Android, but dlmalloc, so the insanely awesome 50ns malloc/free pairs (from http://goog-perftools.sourceforge.net/doc/tcmalloc.html ) that we get on desktop isn't true for mobile.

Note that you mention heap lock contention, but tcmalloc's use of thread-local structures is intentionally designed to avoid this. Are you talking in the general case when using instrumentation / WinHeap? Otherwise, I'm quite surprised we're even close at all.

Sorry, I should have pointed that out: we're stuck with using the WinHeap when instrumented. (This actually raises a question I don't know the answer to: when using tcmalloc, what happens to the process heap? Presumably it's still a WinHeap, or is it somehow redirected as well?)

We don't touch HeapAlloc/HeapCreate, etc. (and indeed we had some ridiculous number of heaps at last count). Only the CRT level functions are revectored (the stuff in https://code.google.com/p/chromium/codesearch#chromium/src/base/allocator/prep_libc.py&l=59 ).

However, because of tcmalloc's inability to release address space, cpu@ has been experimenting with not using tcmalloc on Windows, and at the moment at head, we do not. https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&l=1265



It is still a bit early to tell, but the switch to winheap had a small impact in perf, almost exclusively in windows xp [1] doing svg, and guess what, some very nice speedups, see the bug for details. 

Could be that either winheap is not as bad as this thread would lead you to believe or tcmalloc is not as awesome as this thread will led you to believe. Maybe I get to do a Myth busters episode or maybe I get to eat my words.

I think tcmalloc isn't that awesome. Or more specifically, it probably is awesome for massively multithreaded server loads where there's lots of memory to spare. This is not how we need an allocator optimized for Chrome.

Switching parts of Blink from tcmalloc to PartitionAlloc led to speedups and memory savings.

We were thinking of switching the browser process away from tcmalloc to PartitionAlloc, which is tracked (and stalled) at https://code.google.com/p/chromium/issues/detail?id=339604. On Linux x64, we saw 30% memory savings for some cases.

Linking back to the std::string discussion, the comment in that bug at https://code.google.com/p/chromium/issues/detail?id=339604#c66 is interesting -- it's a histogram of allocation sizes, and we do seem to bias towards very small sizes. Perhaps that hits a pathological case in tcmalloc.


Cheers
Chris

Peter Kasting

unread,
Dec 3, 2014, 2:48:29 PM12/3/14
to Chris Evans, Carlos Pizano, chromium-dev, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, tomh...@chromium.org
On Wed, Dec 3, 2014 at 11:31 AM, Chris Evans <cev...@chromium.org> wrote:
On Tue, Dec 2, 2014 at 10:52 AM, Carlos Pizano <c...@chromium.org> wrote:
It is still a bit early to tell, but the switch to winheap had a small impact in perf, almost exclusively in windows xp [1] doing svg, and guess what, some very nice speedups, see the bug for details. 

Could be that either winheap is not as bad as this thread would lead you to believe or tcmalloc is not as awesome as this thread will led you to believe. Maybe I get to do a Myth busters episode or maybe I get to eat my words.

I think tcmalloc isn't that awesome. Or more specifically, it probably is awesome for massively multithreaded server loads where there's lots of memory to spare. This is not how we need an allocator optimized for Chrome.

Switching parts of Blink from tcmalloc to PartitionAlloc led to speedups and memory savings.

We were thinking of switching the browser process away from tcmalloc to PartitionAlloc, which is tracked (and stalled) at https://code.google.com/p/chromium/issues/detail?id=339604. On Linux x64, we saw 30% memory savings for some cases.

Linking back to the std::string discussion, the comment in that bug at https://code.google.com/p/chromium/issues/detail?id=339604#c66 is interesting -- it's a histogram of allocation sizes, and we do seem to bias towards very small sizes. Perhaps that hits a pathological case in tcmalloc.

It would be interesting to see on Windows how this allocator compares to the Windows native allocator.  This is probably due to ignorance, but I feel like we should be biased towards using platform native allocators unless our allocators are provably better.  Don't ask me to rigorously defend that assumption :)

Randomly, I also note:
* Firefox uses jemalloc.  Have we investigated this?  I heard anecdotally we may not be interested in it.
* Mozilla did some interesting work recently to implement a library that can log all allocation requests and replay them, for better testing of allocators and allocator strategies.  Perhaps we can use their code?  See http://glandium.org/blog/?p=3337 for some more info.

PK

Chris Hamilton

unread,
Dec 3, 2014, 3:14:47 PM12/3/14
to Chris Evans, Carlos Pizano, chromium-dev, Ryan Sleevi, geor...@chromium.org, tomh...@chromium.org
Bit of an aside:

SyzyASAN can optionally use ctmalloc or the windows heap, so we could actually use it as a platform to evaluate the use of ctmalloc in Chrome. With all of the 'ASAN' features turned off the instrumentation amounts to nothing more than hijacking the heap functions and pointing them at custom ones...

It also can be used for doing all of the memory profiling you can imagine, including recording and playing back 'traces' of heap use for benchmarking (which we use internally for profiling and benchmarking the SyzyASAN RTL).

Chris Hamilton

unread,
Dec 3, 2014, 3:16:23 PM12/3/14
to Peter Kasting, Chris Evans, Carlos Pizano, chromium-dev, Ryan Sleevi, geor...@chromium.org, tomh...@chromium.org
Syzygy can be used to do this and works out of the box with Chrome. (The replay code isn't all landed yet, but is being used internally for benchmarking the SyzyASAN RTL Should all be ready for consumption in Q1.)

Primiano Tucci

unread,
Dec 4, 2014, 5:54:43 AM12/4/14
to Peter Kasting, Chris Evans, Carlos Pizano, chromium-dev, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, Tom Hudson
> Firefox uses jemalloc.  Have we investigated this?  I heard anecdotally we may not be interested in it.

Android had historically the option to switch (at build time) the allocator between dlmalloc and jemalloc in both bionic [1] (the libc-equivalent) and art [2] (the java runtime).
The default option in both cases is jemalloc but most of the board configs (the device specific overrides) set that to dlmalloc (e.g., Nexus 4 [3]. So the effective default for most devices (at least Nexus-es) is dlmalloc.

In the Android world, there are active investigations in these days on jemalloc [4]. So far, all the measurements seem to suggest that jemalloc ends up using much more memory than dlmalloc (up to +150M system-wide) when used out of the box. However, there seem to be active work aimed at tuning and improving jemalloc (e.g. [5] sent out 16 hours ago).

As regards Chrome for Android the situation, IIRC, is the following: the default Chrome allocator == default Android allocator (so jemalloc or dlmalloc depending on the board config). However, we have a number of specializations in our codebase, as it has been mentioned in this thread  (e.g., Blink FastMalloc which proxies on  part alloc).
The rationale is that we trust Android doing the right choice for apps the various target devices (low end vs super high specs). IMHO if and when the dlmalloc -> jemalloc switch will happen in Android we'll need to carry our (Chrome) investigations to check the impact of that on Chrome.


[4] http://paste/6288123575140352 (internal only, sorry)

--

Peter Kasting

unread,
Dec 4, 2014, 5:38:28 PM12/4/14
to Primiano Tucci, Chris Evans, Carlos Pizano, chromium-dev, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, Tom Hudson
On Thu, Dec 4, 2014 at 2:54 AM, Primiano Tucci <prim...@chromium.org> wrote:
As regards Chrome for Android the situation, IIRC, is the following: the default Chrome allocator == default Android allocator (so jemalloc or dlmalloc depending on the board config). However, we have a number of specializations in our codebase, as it has been mentioned in this thread  (e.g., Blink FastMalloc which proxies on  part alloc).
The rationale is that we trust Android doing the right choice for apps the various target devices (low end vs super high specs). IMHO if and when the dlmalloc -> jemalloc switch will happen in Android we'll need to carry our (Chrome) investigations to check the impact of that on Chrome.

Thanks for this detail.  This sort of dovetails with my general impression that we ought to default to using system allocators in most places.

I'm probably more interested in the desktop situation where we've been using tcmalloc already; I don't know if anyone knows the story on jemalloc there.

PK

Nico Weber

unread,
Dec 4, 2014, 5:44:50 PM12/4/14
to Peter Kasting, Primiano Tucci, Chris Evans, Carlos Pizano, chromium-dev, Chris Hamilton, Ryan Sleevi, geor...@chromium.org, Tom Hudson
I believe mbelshe played with it on either Windows or Linux long ago. It did worse that tcmalloc back then, but jemalloc has improved since and we haven't tried again. The base/allocator library still has most of the plumbing in place to support allocator switching, and you can probably recover the jemalloc code from history (if my memory is correct and it was in fact there) if you want to give it another try.

Xavier Bigand

unread,
Dec 18, 2014, 4:00:39 PM12/18/14
to chromi...@chromium.org, rsl...@chromium.org, geor...@chromium.org


Le lundi 1 décembre 2014 18:09:41 UTC+1, Chris Hamilton a écrit :
Hard to get meaningful numbers here (we're trying), but these changes do alleviate heap lock contention. It's worth noting that Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag. With ASAN instrumentation the additional bit of work that occurs in the allocator is enough to push it over that edge on some machines. So while we're not typically affected by this in a normal release of Chrome, instrumented versions definitely feel it. The amount of breathing room here is even less on underpowered machines (ie: mobile), so it is an issue worth taking note it.

Regardless of how good your allocator is there is going to be some amount of extra fragmentation induced when you increase the volume of allocations, and hopefully this will show up in cache performance. Georges is actively working on measuring this.

The StringPrintf issues have more to do with the way the code is written rather than vsprintf. It doesn't appear that it's ability to return the required buffer size is actually being used, other than on the Windows platform. Maybe not all platforms are consistent with this? Either way, it might be worthwhile to look into optimizing base::StringAppendVT (the workhorse function for StringPrintf).



I don't know if it can help you, but I know that vsnprintf can return the correct size required for the buffer on all platforms. We use it successfully on Android (gcc), Windows (mingw, cl) and MacOSX (clang, gcc) to format our std::string.
Here is our code (static function) :
std::string	StringUtilities::sprintf(const char* format, ...)
{
    va_list     vlist;
    int         length = 0;
    std::string text;
    char*       buffer = nullptr;

    va_start(vlist, format);
    length = vsnprintf(buffer, 0, format, vlist);
    if (length > 0)
    {
        buffer = (char*)malloc(length + 1);
        length = vsnprintf(buffer, length + 1, format, vlist);
        buffer[length] = 0;
    }
    va_end(vlist);
    text = buffer;        // It can be possible to avoid this copy
    free(buffer);
    return text;
}


Please notice, there is an issue with it :
 - We don't fill the resulting std::string directly, so there is a copy can be avoid here. I didn't found (and searched) a way to construct a std::string with a manually allocated buffer, the std::string API don't allow to set his internal buffer size (and capacity). Else it would be possible to preallocate the std::string buffer and give it to vsnprintf instead of using an temporary buffer. (string::resize can be used, but it will initialize buffer with a memset I suppose).
The way of doing the copy also require a strlen call, can be avoided too by using the buffer constructor.

 
Overall, think it worthwhile to highlight the problem in the name of education; these are generally trivial to avoid when writing code, and a little extra thought at design time can significantly reduce the number of temporary allocations incurred.

On Mon Dec 01 2014 at 11:24:14 AM Ryan Sleevi <rsl...@chromium.org> wrote:
On Mon, Dec 1, 2014 at 4:10 PM, Georges Khalil <geor...@chromium.org> wrote:

TL;DR: std::string is responsible for almost half of all allocations in the Chrome browser process; please be careful how you use it!



In the course of optimizing SyzyASan performance, the Syzygy team discovered that nearly 25000 (!!) allocations are made for every keystroke in the Omnibox. We've since built some rudimentary memory profiling tools and have found a few issues:


  • strings being passed as char* (using c_str) and then converted back to string

  • Using wrappers to StringAppendV, such as StringPrintf, to concatenate strings or for simple number conversions (see note below)

  • Using a temporary set to search for a specific value, which creates O(n) allocations and runs in O(n.logn) time, only to call find on it to return true/false if the value is present. Ouch!

  • Not reserving space in a vector when the size is known


We have put together the following CLs which reduce the number of temporary allocations: 753603002, 747013003 & 745373002. The impact seen by applying these patches is a decrease of around 10% of the total number of allocations made by the browser process (~2/3rd of those are strings), all of which correspond to temporary allocations (a new followed by a delete). We are looking at ways of getting better metrics, including gains on the cpu time. If anyone has a good idea of how we can better benchmark the impact of those CLs, we would love your input.


Also note that these optimizations do not reduce the memory footprint of Chrome; they reduce temporary allocations and increase performance. We are also looking into the problem of total memory footprint.


Following these first tests, here are some observations that we noted:

  • StringPrintF and its variants make a lot of temporary allocations and are not a good use for string concatenation. We ran a test pitting StringPrintf against append for string concatenation and found the latter 2.5x to 3x faster. This is not surprising when analyzing base/strings/stringprintf.cc

  • A lot of times, a std::string is being passed as a const char* (using c_str()), only to be converted back to std::string. Using base::StringPiece or passing a const ref to a std::string all the way is much more efficient

  • When dealing with a vector/string, memory reservation is not always used, which increases the number of allocations during insertions


Note: StringAppendVT, which is used by StringPrintf and other variants, is for various reasons, quite complex code. Many temporaries are created and much copying performed. We're looking into ways to optimize this much used function.


Do you have numbers quantifying the allocation cost with other metrics and resources, such as CPU time, cache misses, memory usage, etc?

I ask because much of the focus in the past years has been on improving allocator performance such that CPU time, cache misses, and memory uses are significantly negligible in the common case.

For profiling tools, this certainly gives them a hard time, and when using things like the WinHeap, this can be a downright awful time. But that's largely irrelevant for our users, and that's where we should be spending resources on.

That said, your comments about StringPrintF surprise me, as generally vnsprintf and friends are more favourable to string usage (because they can pre-scan to find the desired length prior to the actual conversion, thus reducing the growth factor typically employed by .append()). Is it a fundamental issue in StringPrintF, or is it more of a usage anti-pattern compared to SStringPrintF (which can use the supplied string, thus its already-allocated and grown value)?

While there is some good practices involved here - e.g. not using set<> for temporaries (especially large ones) and greater use of base::StringPiece - I'd feel comfortable knowing that measurable impact this has in the "real world", given that the focus on so many allocators has been code will behave exactly as you described.

Xavier Bigand

unread,
Dec 18, 2014, 5:01:44 PM12/18/14
to chromi...@chromium.org, rsl...@chromium.org, geor...@chromium.org
Updated code without the temporary buffer (tested on Android, Windows (mingw)) :

std::string	StringUtilities::sprintf(const char* format, ...)
{
    va_list     vlist;
    int         length = 0;
    std::string text;

    // vlist is restarted for the second use to work on MacOS X 64bits
    va_start(vlist, format);
    length = vsnprintf(nullptr, 0, format, vlist);
    va_end(vlist);
    if (length > 0)
    {
        text.resize(length + 1);    // We need add 1 for the final '\0' character
        va_start(vlist, format);
        length = vsnprintf((char*)text.data(), text.capacity(), format, vlist);
        va_end(vlist);
    }
    return text;

Chris Hopman

unread,
Dec 18, 2014, 5:07:12 PM12/18/14
to flamaro...@gmail.com, chromi...@chromium.org, rsl...@chromium.org, geor...@chromium.org
We already use vsnprintf in this way. We handle the case where there is a recoverable error (that flow might only happen for wide strings, though).

See https://code.google.com/p/chromium/codesearch#chromium/src/base/strings/stringprintf.cc&sq=package:chromium&type=cs&l=42

Chris Hopman

unread,
Dec 18, 2014, 5:17:37 PM12/18/14
to flamaro...@gmail.com, chromi...@chromium.org, rsl...@chromium.org, geor...@chromium.org
I guess the claim is that vsnprintf(nullptr, ...) will always return the correct size. It's possible that that is better, though if many of such formattings fit in the stack buffer I would be surprised if essentially doing the formatting twice is going to be any faster (and if most don't fit in that buffer, why do we have it?) . If vsnprintf(nullptr, ...) does always return the correct size we could do something like:

vsnprintf to stack buffer
if (failed) {
  vsnprintf(null, ...) to get size
  vsnprintf() for real
}

Does that work for the wide string case using vswprintf? Is it any better than what we do now? Is it so much better that we would want a non-wide string specific implementation?

Xavier Bigand

unread,
Dec 18, 2014, 7:58:38 PM12/18/14
to chromi...@chromium.org, flamaro...@gmail.com, rsl...@chromium.org, geor...@chromium.org
Yes that the point, it seems there is a special case when passing a nullptr as buffer addr. This make vsnprintf always returning the length of the fully processed format string. I firstly tried like you to use a stack allocated buffer, but in this case on some architectures when the buffer size wasn't enough the returned length was -1 instead of the size necessary to fit the complete result.

By using nullptr, there is no need to increase the buffer size in a loop.

Brett Wilson

unread,
Dec 18, 2014, 10:55:33 PM12/18/14
to flamaro...@gmail.com, Chromium-dev, Ryan Sleevi, geor...@chromium.org
Do we often (ever?) wsprintf strings >1024 characters long? I would
expect a double call for each printf to be much more expensive on
average than what we do now.

Brett
Reply all
Reply to author
Forward
0 new messages