RefCounted vs RefCountedThreadSafe

William Chan (陈智昌)

unread,

Mar 18, 2011, 2:07:10 AM3/18/11

to chromium-dev

I landed http://src.chromium.org/viewvc/chrome?view=rev&revision=78649 which switched base::RefCounted to use RefCountedThreadSafeBase instead of RefCountedBase (which I deleted). Preliminary examination of the perf dashboard indicates no negative impact. I'm leaving this running overnight. Feel free to check http://build.chromium.org/f/chromium/perf/dashboard/overview.html to see if you see any impact. I don't.

Given that on our performance tests show no apparent impact, and given the plethora of data races on reference counting bugs we've seen before (http://code.google.com/p/chromium/issues/list?can=1&q=data+race+reference+counter for a partial list), such as crbug.com/15577 (which some may remember as the nasty heap corruption crash from hell, back in Chrome 3), I'd like to propose:

(a) We rename base::RefCounted to base::RefCountedThreadUnsafe and comment it heavily with nasty comments warning against use.

(b) We rename base::RefCountedThreadSafe to base::RefCounted

As for the children of the existing RefCounted/RefCountedThreadSafe, I propose we make them *all* inherit from RefCountedThreadSafe (maybe with a *few* exceptions if people really insist that certain base library classes on principle should not inherit from base::RefCountedThreadSafe, but I'd push back hard on almost everything).

I'd like to point out in favor of these rather large changes, the fact that the performance gains of using RefCounted instead of RefCountedThreadSafe in Chromium code are apparently negligible. On the other hand, the costs are very high. The kinds of bugs caused by data races on refcounts are very hard to discover (TSan helps a lot, but that only works if you have enough test coverage) and it causes heap corruption, which makes it very hard to track down the source of the problem.

Scott Hess

unread,

Mar 18, 2011, 2:23:16 AM3/18/11

to will...@chromium.org, chromium-dev

On Thu, Mar 17, 2011 at 11:07 PM, William Chan (陈智昌)
<will...@chromium.org> wrote:
> I'd like to point out in favor of these rather large changes, the fact that
> the performance gains of using RefCounted instead of RefCountedThreadSafe in
> Chromium code are apparently negligible. On the other hand, the costs are
> very high. The kinds of bugs caused by data races on refcounts are very hard
> to discover (TSan helps a lot, but that only works if you have enough test
> coverage) and it causes heap corruption, which makes it very hard to track
> down the source of the problem.

Are there any existing cases where you do believe that
RefCountedThreadUnsafe is warranted?

Are all of those cases warranted because the problem cannot be
refactored to not need that type of ref-counter, or would they allow
for alternate solutions which did not need the RefCountedThreadUnsafe?

Developers are all-the-time developing code which can't afford to not
use the fastest possible calls. But in this problem space has proven
to be beyond even well-qualified developers, and when a mistake is
made a LOT of developer time is wasted. Perhaps we'd be better off to
simply not provide a non-threadsafe version at all, and be done with
it.

-scott

Albert J. Wong (王重傑)

unread,

Mar 18, 2011, 2:26:47 AM3/18/11

to sh...@google.com, Scott Hess, will...@chromium.org, chromium-dev

+1

Either that, or throw it into base::subtle.

William Chan (陈智昌)

unread,

Mar 18, 2011, 2:39:10 AM3/18/11

to Scott Hess, chromium-dev

You may very well be right for Chromium code. I partially addressed this when I stated:

"maybe with a *few* exceptions if people really insist that certain base library classes on principle should not inherit from base::RefCountedThreadSafe, but I'd push back hard on almost everything"

This caveat is simply dependent on whether we consider all the code we write to be only consumed by Chromium. Within our refcounting uses in Chromium code proper, I find it difficult to believe RefCountedThreadUnsafe would be justifiable. But in base/ or other low-level libraries, where one could imagine them being more general purpose, I would say that RefCountedThreadUnsafe may indeed have its place, since it's conceivable that you are preventing someone from writing really performance intensive code due to not using RefCountedThreadUnsafe in certain core utility classes.

Within Chromium code, I suspect that if we mass convert all code to using RefCountedThreadSafe (to be renamed RefCounted) and put scary comments on RefCountedThreadUnsafe, that people won't use it. People usually copy what they see. No one currently knows about the existence of a RefCountedThreadUnsafe and would have to learn about it and use it, despite the lack of existing examples using it and the scary warnings on it.

Perhaps I am too naive and not draconian enough (I usually think I'm more draconian than most folks!). I think Albert's base::subtle suggestion is good too.

-scott

Paweł Hajdan, Jr.

unread,

Mar 18, 2011, 4:33:54 AM3/18/11

to will...@chromium.org, chromium-dev

On Fri, Mar 18, 2011 at 07:07, William Chan (陈智昌) <will...@chromium.org> wrote:

I'd like to point out in favor of these rather large changes, the fact that the performance gains of using RefCounted instead of RefCountedThreadSafe in Chromium code are apparently negligible. On the other hand, the costs are very high. The kinds of bugs caused by data races on refcounts are very hard to discover (TSan helps a lot, but that only works if you have enough test coverage) and it causes heap corruption, which makes it very hard to track down the source of the problem.

If the performance difference is negligible, I think it'd be better to get rid of the thread-unsafe variant. Even if someone uses it safely, some (possibly hard to correlate) code changes might make it unsafe.

We have some possible plans now (RefCountedThreadUnsafe, base::subtle), and here is mine: don't provide the ThreadUnsafe variant at all. If someone asks for it later one of the above can still be applied.

Eradicating an entire class of really crazy bugs sounds like a huge win to me.

Timur Iskhodzhanov

unread,

Mar 18, 2011, 8:15:55 AM3/18/11

to will...@chromium.org, chromium-dev

LGTM++

If for some reason you keep the RefCountedThreadUnsafe class, please
make sure to un-comment the DFAKE_SCOPED_LOCK_THREAD_LOCKED lines.
This will "sort of" protect us from inappropriate usage of this class.

Thanks for doing this!
btw, why did you start this effort? Have you found some particular bug recently?

> --
> Chromium Developers mailing list: chromi...@chromium.org
> View archives, change email options, or unsubscribe:
> http://groups.google.com/a/chromium.org/group/chromium-dev
>

Antoine Labour

unread,

Mar 18, 2011, 12:36:23 PM3/18/11

to will...@chromium.org, chromium-dev

On Thu, Mar 17, 2011 at 11:07 PM, William Chan (陈智昌) <will...@chromium.org> wrote:

I landed http://src.chromium.org/viewvc/chrome?view=rev&revision=78649 which switched base::RefCounted to use RefCountedThreadSafeBase instead of RefCountedBase (which I deleted). Preliminary examination of the perf dashboard indicates no negative impact. I'm leaving this running overnight. Feel free to check http://build.chromium.org/f/chromium/perf/dashboard/overview.html to see if you see any impact. I don't.

Is any of these measured on an in-order core ? E.g. a CR-48 ? I expect the relative cost of atomic operations to be higher there. Since it's also our weakest platform, I'd like to see measurements there...

Antoine

William Chan (陈智昌)

unread,

Mar 18, 2011, 1:06:07 PM3/18/11

to Antoine Labour, chromium-dev, ch...@chromium.org, Marc-Antoine Ruel

[ +chase, +maruel to help with questions about bots ]

On Fri, Mar 18, 2011 at 9:36 AM, Antoine Labour <pi...@google.com> wrote:

On Thu, Mar 17, 2011 at 11:07 PM, William Chan (陈智昌) <will...@chromium.org> wrote:

I landed http://src.chromium.org/viewvc/chrome?view=rev&revision=78649 which switched base::RefCounted to use RefCountedThreadSafeBase instead of RefCountedBase (which I deleted). Preliminary examination of the perf dashboard indicates no negative impact. I'm leaving this running overnight. Feel free to check http://build.chromium.org/f/chromium/perf/dashboard/overview.html to see if you see any impact. I don't.

Is any of these measured on an in-order core ? E.g. a CR-48 ? I expect the relative cost of atomic operations to be higher there. Since it's also our weakest platform, I'd like to see measurements there...

I do not believe so. I think you've pointed out an important hole in our performance testing coverage. I have no idea how to run the perf tests on a CR-48. I think if we care about this platform (and I believe we do), we should add bots for it to the waterfall. Are the Linux ChromiumOS/ARM bots using hardware similar to a CR-48? Do we simply need to turn on perf tests for them?

That said, I'm skeptical we'll see much difference. The browser process just doesn't do that much performance-intensive work. And the renderer code is primarily WebKit/V8, so it won't be affected much by a change to Chromium refcounting.

Chase Phillips

unread,

Mar 18, 2011, 1:49:35 PM3/18/11

to William Chan (陈智昌), Antoine Labour, chromium-dev, Marc-Antoine Ruel

On Fri, Mar 18, 2011 at 10:06, William Chan (陈智昌) <will...@chromium.org> wrote:

[ +chase, +maruel to help with questions about bots ]

On Fri, Mar 18, 2011 at 9:36 AM, Antoine Labour <pi...@google.com> wrote:

On Thu, Mar 17, 2011 at 11:07 PM, William Chan (陈智昌) <will...@chromium.org> wrote:

I landed http://src.chromium.org/viewvc/chrome?view=rev&revision=78649 which switched base::RefCounted to use RefCountedThreadSafeBase instead of RefCountedBase (which I deleted). Preliminary examination of the perf dashboard indicates no negative impact. I'm leaving this running overnight. Feel free to check http://build.chromium.org/f/chromium/perf/dashboard/overview.html to see if you see any impact. I don't.

Is any of these measured on an in-order core ? E.g. a CR-48 ? I expect the relative cost of atomic operations to be higher there. Since it's also our weakest platform, I'd like to see measurements there...

I do not believe so. I think you've pointed out an important hole in our performance testing coverage. I have no idea how to run the perf tests on a CR-48. I think if we care about this platform (and I believe we do), we should add bots for it to the waterfall. Are the Linux ChromiumOS/ARM bots using hardware similar to a CR-48? Do we simply need to turn on perf tests for them?

No, the Chromium Perf waterfall has XP/Vista (dual|single) core, Mac 10.(5|6), and Linux systems, no CrOS atm. Re: in-order, I believe all cores on chromium.perf are out-of-order cores.

Lately I've been looking into adding Chromium on CR-48, but it's not been much more than some brief conversations with others. If there's significant interest in seeing whatever ChromiumOS-specific perf results we can muster per-Chromium-CL externally, I can look at what it would take to make these available via a dashboard.

Marc-Antoine Ruel

unread,

Mar 18, 2011, 2:12:19 PM3/18/11

to Chase Phillips, William Chan (陈智昌), Antoine Labour, chromium-dev

So I guess the best solution is to do a 3 steps tests run of page cycler:

- Reference build

- Custom build without the patch

- Custom build with the patch

Are RefCounted objects touched sufficiently often to affect be visible from a perf standpoint somehow? How many RefCounted reference count updates are done per second during a page cycler run?

M-A

jbates

unread,

Mar 18, 2011, 4:20:08 PM3/18/11

to Chromium-dev

+1

What a coincidence, I was just looking through the RefCounted header
and wondering why the default is not thread safe. This will be a huge
safety improvement. Atomic ops are cheaper than cache misses,
allocations, etc, which are all over our codebase. These refcounts
shouldn't be triggered in any performance critical inner loops anyway.
That would be a design flaw?

On Mar 17, 11:07 pm, William Chan (陈智昌) <willc...@chromium.org> wrote:
> I landedhttp://src.chromium.org/viewvc/chrome?view=rev&revision=78649which

> switched base::RefCounted to use RefCountedThreadSafeBase instead of
> RefCountedBase (which I deleted). Preliminary examination of the perf
> dashboard indicates no negative impact. I'm leaving this running overnight.

> Feel free to checkhttp://build.chromium.org/f/chromium/perf/dashboard/overview.htmlto see if

> you see any impact. I don't.
>
> Given that on our performance tests show no apparent impact, and given the

> plethora of data races on reference counting bugs we've seen before (http://code.google.com/p/chromium/issues/list?can=1&q=data+race+refer...

Darin Fisher

unread,

Mar 18, 2011, 5:04:45 PM3/18/11

to William Chan (陈智昌), Chromium-dev

[I didn't mean to take this off list. Bringing it back.]

On Fri, Mar 18, 2011 at 11:13 AM, William Chan (陈智昌) <will...@chromium.org> wrote:

On Fri, Mar 18, 2011 at 10:29 AM, Darin Fisher <da...@google.com> wrote:

This is very interesting, and you make a strong case.

My concern is that our tests may not be conclusive. I worry about being in situations where we are trying to optimize code that is suffering from "death by a thousand cuts", in which things are slow, and there is no low hanging fruit. Atomic reference counting can negatively impact CPU perf, no?

You're absolutely right that atomic refcounting *can* hurt CPU performance. I think that our tests may not be 100% conclusive, but I think the experiment makes a big statement. It shows that even if we converted *all* RefCounted to RefCountedThreadSafe, the results aren't noticeable by our current tests.

Our tests mostly highlight page load time (dominated by WebKit performance) and startup time. The tests don't necessarily cover things that may stress single-threaded Chromium code. Maybe there are some interesting UI actions or history backend processes that would be more sensitive to the change you are proposing here. I don't know exactly what is missing, but I do know that we should be cautious about assuming that just because a change didn't regress our tests doesn't mean that the change was good.

Case in point:

For Chrome 2, the V8 team changed an algorithm that had no visible impact on any of our performance tests. It helped reduce Gmail memory usage IIRC. However, just as we were about to ship Chrome 2, we learned that it greatly impacted page load time for many Google properties as observed by those web sites' own metrics. It was too late to fix. When we uncovered the problem and fixed it, a new page cycler was then added (morejs), so that we would not regress it again.

My point is that absence of regression in our existing tests is not proof that you haven't made the product slower. Moreover, we even know that most of our existing tests probably aren't that relevant since they mainly exercise WebKit. (If you made the same change to WTF::RefCounted<T>, perhaps you would notice a regression on the page cycler tests.)

Going back to the analogy, it shows that even a thousand cuts seems to have any noticeable impact by our existing metrics. I think this makes the strong case that by default, we should choose RefCountedThreadSafe. It may be possible that for certain low-level utility classes, or otherwise performance intensive code (I'm thinking of stuff in base/), we may want to stick with RefCounted.

Also, most of our code is single threaded, so it seems a bit unfortunate to impose these admittedly unknown costs.

Finally, why is this a concern now? What changed? Isn't TSAN solving the problem? We also have assertions in the code that help with many common sources of this issue.

Nothing really changed. TSAN continues to keep catching bugs. That shows that people keep getting this wrong. TSAN also only works when we have appropriate test coverage. So that means we can be pretty sure that there are bugs that aren't be caught by TSAN. TSAN also is too slow for Mac and Windows browser/UI tests, so that means that there's definitely a huge chunk of code that's not being tested. It also happens to be the code where we do the most task posting across threads, so that's unfortunate.

The problem is every time one of these slips through our testing, it takes many engineers many hours to track down the source of the bug, because it causes heap corruption (double free, use-after-free). The source of heap corruption is super difficult to track down, so it takes forever to fix. This happened to us in Chrome 3 where the TCMalloc heap corruption was discovered to be due to a data race on a refcount. It took many weeks with many engineers working full-time to track down the single bug. And they didn't track it down, Timurrrr happened to catch it on TSAN which happened to fix the crashes, but no one was sure that it would fix it.

Currently there's another heap corruption that's causing Chrome 11 crash rates to skyrocket, and many engineers have spent well over a week working on it with no progress. I'm not saying that it's due to refcounting again, but I think that if we can basically eliminate an entire category of bugs that are hugely difficult to track down, for no *perceivable costs*, it's probably worth it.

As I said above, this is a very compelling argument. I'm mostly convinced. I am just concerned about the claim that this will have no perceivable costs when you don't really know that to be true. I am very interested in seeing Cr48 results too, but I think we should also think hard about what other performance tests we should have to help us feel more confident here.

-Darin

-darin

On Mar 17, 2011 11:23 PM, "William Chan (陈智昌)" <will...@chromium.org> wrote:

Darin Fisher

unread,

Mar 18, 2011, 5:39:26 PM3/18/11

to Albert Wong (王重傑), William Chan (陈智昌), Chromium-dev

Well, there's UMA if you know what you want to measure :)

-Darin

On Fri, Mar 18, 2011 at 2:29 PM, Albert Wong (王重傑) <ajw...@google.com> wrote:

What about purpose putting in the change in an easily revertable form for a dev channel release and then let users pound on it for a bit? Is there a way to collect more than annecdotal evidence from something like that?

Albert J. Wong (王重傑)

unread,

Mar 18, 2011, 5:42:50 PM3/18/11

to da...@google.com, Darin Fisher, William Chan (陈智昌), Chromium-dev

[ retry from right address ]

On Fri, Mar 18, 2011 at 9:04 PM, Darin Fisher <da...@chromium.org> wrote:

[I didn't mean to take this off list. Bringing it back.]

On Fri, Mar 18, 2011 at 11:13 AM, William Chan (陈智昌) <will...@chromium.org> wrote:

On Fri, Mar 18, 2011 at 10:29 AM, Darin Fisher <da...@google.com> wrote:

This is very interesting, and you make a strong case.

My concern is that our tests may not be conclusive. I worry about being in situations where we are trying to optimize code that is suffering from "death by a thousand cuts", in which things are slow, and there is no low hanging fruit. Atomic reference counting can negatively impact CPU perf, no?

You're absolutely right that atomic refcounting *can* hurt CPU performance. I think that our tests may not be 100% conclusive, but I think the experiment makes a big statement. It shows that even if we converted *all* RefCounted to RefCountedThreadSafe, the results aren't noticeable by our current tests.

Our tests mostly highlight page load time (dominated by WebKit performance) and startup time. The tests don't necessarily cover things that may stress single-threaded Chromium code. Maybe there are some interesting UI actions or history backend processes that would be more sensitive to the change you are proposing here. I don't know exactly what is missing, but I do know that we should be cautious about assuming that just because a change didn't regress our tests doesn't mean that the change was good.

Case in point:

For Chrome 2, the V8 team changed an algorithm that had no visible impact on any of our performance tests. It helped reduce Gmail memory usage IIRC. However, just as we were about to ship Chrome 2, we learned that it greatly impacted page load time for many Google properties as observed by those web sites' own metrics. It was too late to fix. When we uncovered the problem and fixed it, a new page cycler was then added (morejs), so that we would not regress it again.

My point is that absence of regression in our existing tests is not proof that you haven't made the product slower. Moreover, we even know that most of our existing tests probably aren't that relevant since they mainly exercise WebKit. (If you made the same change to WTF::RefCounted<T>, perhaps you would notice a regression on the page cycler tests.)

What about committing the change in an easily revertible form for a dev channel release and then let users pound on it for a bit? That would at least provide a sanity check. And if we're creative, we might be able to do some sort of user studies to show if there is a perceptible speed difference. It's not quite as nice as having hard, quantifiable, metrics for specific actions, etc. But it may be easier to setup, and tracks perception, which is arguable as/more important than numbers.

Going back to the analogy, it shows that even a thousand cuts seems to have any noticeable impact by our existing metrics. I think this makes the strong case that by default, we should choose RefCountedThreadSafe. It may be possible that for certain low-level utility classes, or otherwise performance intensive code (I'm thinking of stuff in base/), we may want to stick with RefCounted.

Also, most of our code is single threaded, so it seems a bit unfortunate to impose these admittedly unknown costs.

Finally, why is this a concern now? What changed? Isn't TSAN solving the problem? We also have assertions in the code that help with many common sources of this issue.

Nothing really changed. TSAN continues to keep catching bugs. That shows that people keep getting this wrong. TSAN also only works when we have appropriate test coverage. So that means we can be pretty sure that there are bugs that aren't be caught by TSAN. TSAN also is too slow for Mac and Windows browser/UI tests, so that means that there's definitely a huge chunk of code that's not being tested. It also happens to be the code where we do the most task posting across threads, so that's unfortunate.

The problem is every time one of these slips through our testing, it takes many engineers many hours to track down the source of the bug, because it causes heap corruption (double free, use-after-free). The source of heap corruption is super difficult to track down, so it takes forever to fix. This happened to us in Chrome 3 where the TCMalloc heap corruption was discovered to be due to a data race on a refcount. It took many weeks with many engineers working full-time to track down the single bug. And they didn't track it down, Timurrrr happened to catch it on TSAN which happened to fix the crashes, but no one was sure that it would fix it.

Currently there's another heap corruption that's causing Chrome 11 crash rates to skyrocket, and many engineers have spent well over a week working on it with no progress. I'm not saying that it's due to refcounting again, but I think that if we can basically eliminate an entire category of bugs that are hugely difficult to track down, for no *perceivable costs*, it's probably worth it.

As I said above, this is a very compelling argument. I'm mostly convinced. I am just concerned about the claim that this will have no perceivable costs when you don't really know that to be true. I am very interested in seeing Cr48 results too, but I think we should also think hard about what other performance tests we should have to help us feel more confident here.

-Darin

-darin

On Mar 17, 2011 11:23 PM, "William Chan (陈智昌)" <will...@chromium.org> wrote:

Darin Fisher

unread,

Mar 18, 2011, 5:49:59 PM3/18/11

to Albert J. Wong (王重傑), William Chan (陈智昌), Chromium-dev

On Fri, Mar 18, 2011 at 2:42 PM, Albert J. Wong (王重傑) <ajw...@chromium.org> wrote:

[ retry from right address ]

On Fri, Mar 18, 2011 at 9:04 PM, Darin Fisher <da...@chromium.org> wrote:

[I didn't mean to take this off list. Bringing it back.]

On Fri, Mar 18, 2011 at 11:13 AM, William Chan (陈智昌) <will...@chromium.org> wrote:

On Fri, Mar 18, 2011 at 10:29 AM, Darin Fisher <da...@google.com> wrote:

This is very interesting, and you make a strong case.

My concern is that our tests may not be conclusive. I worry about being in situations where we are trying to optimize code that is suffering from "death by a thousand cuts", in which things are slow, and there is no low hanging fruit. Atomic reference counting can negatively impact CPU perf, no?

You're absolutely right that atomic refcounting *can* hurt CPU performance. I think that our tests may not be 100% conclusive, but I think the experiment makes a big statement. It shows that even if we converted *all* RefCounted to RefCountedThreadSafe, the results aren't noticeable by our current tests.

Our tests mostly highlight page load time (dominated by WebKit performance) and startup time. The tests don't necessarily cover things that may stress single-threaded Chromium code. Maybe there are some interesting UI actions or history backend processes that would be more sensitive to the change you are proposing here. I don't know exactly what is missing, but I do know that we should be cautious about assuming that just because a change didn't regress our tests doesn't mean that the change was good.

Case in point:

For Chrome 2, the V8 team changed an algorithm that had no visible impact on any of our performance tests. It helped reduce Gmail memory usage IIRC. However, just as we were about to ship Chrome 2, we learned that it greatly impacted page load time for many Google properties as observed by those web sites' own metrics. It was too late to fix. When we uncovered the problem and fixed it, a new page cycler was then added (morejs), so that we would not regress it again.

My point is that absence of regression in our existing tests is not proof that you haven't made the product slower. Moreover, we even know that most of our existing tests probably aren't that relevant since they mainly exercise WebKit. (If you made the same change to WTF::RefCounted<T>, perhaps you would notice a regression on the page cycler tests.)

What about committing the change in an easily revertible form for a dev channel release and then let users pound on it for a bit? That would at least provide a sanity check. And if we're creative, we might be able to do some sort of user studies to show if there is a perceptible speed difference. It's not quite as nice as having hard, quantifiable, metrics for specific actions, etc. But it may be easier to setup, and tracks perception, which is arguable as/more important than numbers.

I'm not sure what this will accomplish. Are there no other performance tests that we could create that might be helpful?

-Darin

Jim Roskind

unread,

Mar 19, 2011, 4:05:25 PM3/19/11

to will...@chromium.org, chromium-dev

I am very concerned about this change, as it continues to lead into what I've called "death by a thousand cuts." At best, each such change will be unmeasurable, but the net impact will be great.

Although we try to measure performance on our bots, the bots are a mild attempt to track major regressions. If you want to look at raw performance, you need a full build, with global (whole program?) optimization. In addition, larger benchmarks need to be used. Without global optimization, subtle inefficiencies, including lock contention, are invisible.

Worse than the fact that a number of ref-counted vars will be protected unnecessarily by locks, future code will tend to more cavalierly rely on this, when our approach to an apartment model for threading can properly isolate code on a thread, and preclude the need for locks. This is indeed the secret behind TCMalloc. Thread caching to avoid locks is a big deal.

I am very much in favor of having debug checking done that prevents accidental use non-thread-safe code across threads, but to move towards this approach in general should create an unrecoverable loss in performance.

I can hear the echo in my ears of Knuth arguing about premature optimization. I've always felt that code should not be prematurely optimized, but it should be written so that it can be optimized. This direction seems to a move to permanently abandon a powerful optimization technique. Thread local variables and data handling, never needing locks, are too valuable to discard IMO. This change precludes efficient reference counting on a thread, an that is a notable loss.

Jim

p.s., There is a famous tale in this regard about Oreo Cookies. Over a series of years Nabisco made changes, each of which was provably impossible for customers to detect, and each saved money for production. After several years, competitors of Oreo started eating Nabisco's lunch (proverbially speaking), and a test against an original recipe revealed a large and measurable loss of quality. The good news for Oreo is that they were able to revert their multi-year changes. Lets not give up the advantages we have built by taking a series of changes that we KNOW degrade the product, even if they are each hard to measure.

--

Darin Fisher

unread,

Mar 19, 2011, 5:58:10 PM3/19/11

to j...@chromium.org, will...@chromium.org, chromium-dev

This echo's my concerns exactly. Sadly, I don't know how to quantify the significance of using non-threadsafe RefCounts when warranted.

-Darin

William Chan (陈智昌)

unread,

Mar 21, 2011, 12:24:52 AM3/21/11

to Jim Roskind, chromium-dev

Thanks for your comments Jim! I'm replying inline.

On Sat, Mar 19, 2011 at 1:05 PM, Jim Roskind <j...@chromium.org> wrote:

I am very concerned about this change, as it continues to lead into what I've called "death by a thousand cuts." At best, each such change will be unmeasurable, but the net impact will be great.

The great thing here is the experiment I ran is equivalent to combining each one of those thousand cuts into one ubercut. And judging by its effects on our existing perf tests, the cutting implement is as sharp as a spoon.

Although we try to measure performance on our bots, the bots are a mild attempt to track major regressions. If you want to look at raw performance, you need a full build, with global (whole program?) optimization. In addition, larger benchmarks need to be used. Without global optimization, subtle inefficiencies, including lock contention, are invisible.

Can you explain this? Why does global optimization matter here? According to http://msdn.microsoft.com/en-us/library/0zza0de8(v=vs.71).aspx, whole program optimization allows visual studio to (1) Optimize the use of registers across function boundaries and (2) Inline a function in a module even when the function is defined in another module. My understanding is this allows for reducing the costs of function calls and the inlining may allow the compiler to reorder instructions more optimally. How does this relate to lock contention? Can you clarify this?

Worse than the fact that a number of ref-counted vars will be protected unnecessarily by locks, future code will tend to more cavalierly rely on this, when our approach to an apartment model for threading can properly isolate code on a thread, and preclude the need for locks. This is indeed the secret behind TCMalloc. Thread caching to avoid locks is a big deal.

Well, note that we use atomic operations, not locks, for our threadsafe refcounting. There is no locking.

I am very much in favor of having debug checking done that prevents accidental use non-thread-safe code across threads, but to move towards this approach in general should create an unrecoverable loss in performance.

I can hear the echo in my ears of Knuth arguing about premature optimization. I've always felt that code should not be prematurely optimized, but it should be written so that it can be optimized. This direction seems to a move to permanently abandon a powerful optimization technique. Thread local variables and data handling, never needing locks, are too valuable to discard IMO. This change precludes efficient reference counting on a thread, an that is a notable loss.

I hear the same echos of premature optimization too. Something about the root of all evil or what not...so why are you arguing for an optimization that has no visible positive impact? :P That said, I think you make a very valid point. As skeptical as I am that it really makes a difference performance wise to the Chromium codebase (bluntly speaking, I don't think this is a powerful optimization technique for our codebase, where I'm not including Webkit since they use their own refcounting implementation, and I'm interested to hear any explanations to the contrary), perhaps it's useful to keep the coding practice as is. What if we still make people use RefCounted and RefCountedThreadUnsafe (I purposely imply that we will still switch the default) in the code, and enforce this with tests, but in the official builds, we switch to the "safe/slow" version? If we ever detect that it is much slower, we then stop doing this switch?

Honestly though, I think our code follows the 80/20 rule, and it's unfortunate to trade off safety for zero-visible performance improvement, since a bug can occur anywhere in the code, yet the performance intensive code is primarily focused in a few regions. I generally prefer to trade off code maintainability / safety for performance unless I think that this is performance intensive code where it matters.

I think the most major concern is to what degree we can rely on the existing perf tests when making a decision like this.

Jim

p.s., There is a famous tale in this regard about Oreo Cookies. Over a series of years Nabisco made changes, each of which was provably impossible for customers to detect, and each saved money for production. After several years, competitors of Oreo started eating Nabisco's lunch (proverbially speaking), and a test against an original recipe revealed a large and measurable loss of quality. The good news for Oreo is that they were able to revert their multi-year changes. Lets not give up the advantages we have built by taking a series of changes that we KNOW degrade the product, even if they are each hard to measure.

This is very similar to your thousands cut analogy above. I believe I've already refuted the general point, since we've combined all of the changes into a single lump change and are evaluating it. Your point about being able to restore the previous changes is acknowledged and addressed above.

William Chan (陈智昌)

unread,

Mar 21, 2011, 1:02:55 AM3/21/11

to Darin Fisher, j...@chromium.org, chromium-dev

On Sat, Mar 19, 2011 at 2:58 PM, Darin Fisher <da...@chromium.org> wrote:

This echo's my concerns exactly. Sadly, I don't know how to quantify the significance of using non-threadsafe RefCounts when warranted.

Just to double check, does your concern being echoed change your stance that you were mostly convinced by my previous arguments? :P

So, I ran a Mac Chromium instance where I recorded the number of RefCounted (not the threadsafe one) ::AddRefs/Releases in the browser process while I tried some standard UI operations (session restore of 5 tabs, open tab, browse to a page, close tabs, quit). In 10 seconds, I counted 1800~ calls (900~ pairs). Now, if you look at http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html, where for a Intel-based iMac with a 2 GHz Core Duo processor and 1 GB of RAM running Mac OS X v10.5, an uncontested (These *should* all be uncontested, otherwise we've got a bug and we better use threadsafe refcounting! Minor caveat about two addresses sharing the same cache line, which is somewhat unlikely given the fact that our objects are inheriting from the refcounted base class) atomic compare-and-swap costs approximately 0.05 microseconds. So, 1800*0.05us = 90us. So, over the 10 seconds, we added 90 microseconds of overhead, for a grand total of .001%? Now, maybe people question the accuracy of the Mac numbers and my unscientific test numbers, but do we really think they're 3 orders of magnitude off?

Another bogeyman I've heard people mention before is causing jank in UI operations. Now, I'm going to be generous and assume human perception can notice 1ms. To achieve 1ms of delay, we'd have to have on the order of 20000 atomic operations (10000 AddRef/Release pairs). Given that in my admittedly totally unscientific test, I spent 10 seconds using the browser, and accumulated 1800 atomic operations, I am pretty skeptical this bogeyman exists.

What I would do if I were at my Linux desktop and could run google-perftools, would be I would get a profile of a browser process and show everyone how amazingly little time is spent in refcounting. I'll grab a profile when I get back to SF in a week and post it. What performance intensive browser-process only (I assume we mostly trust perf cyclers for renderer stuff, and have little reason to suspect that a Chromium refcounting change's effect on the renderer process would not show up in the perf cyclers) operation are we worried about? History? Safe browsing? I dunno. But whatever one you come up with, do you think that changing RefCountedBase would seriously affect it? Do you think that said browser process operation is using more than tens of thousands of AddRef/Release() pairs per second?

PS: I forgot that we use a memory barrier in our atomic refcount decrement operation. Let's round up to the time for acquiring an uncontested mutex according to the Mac doc. That's 0.2us instead of 0.05us, so a factor of 4. So, I may be further off by a factor of 4 in the above calculations. Sorry. I don't think it makes much of a difference though.

PPS: Lest anyone forget, all the arguments to the contrary have been about imaginary/hypothetical performance costs. Refcounting data race bugs crop up in reality (refer to my original post).

Jim Roskind

unread,

Mar 21, 2011, 3:26:33 AM3/21/11

to William Chan (陈智昌), chromium-dev

On Sun, Mar 20, 2011 at 9:24 PM, William Chan (陈智昌) <will...@chromium.org> wrote:

Thanks for your comments Jim! I'm replying inline.

On Sat, Mar 19, 2011 at 1:05 PM, Jim Roskind <j...@chromium.org> wrote:

Although we try to measure performance on our bots, the bots are a mild attempt to track major regressions. If you want to look at raw performance, you need a full build, with global (whole program?) optimization. In addition, larger benchmarks need to be used. Without global optimization, subtle inefficiencies, including lock contention, are invisible.

Can you explain this? Why does global optimization matter here? According to http://msdn.microsoft.com/en-us/library/0zza0de8(v=vs.71).aspx, whole program optimization allows visual studio to (1) Optimize the use of registers across function boundaries and (2) Inline a function in a module even when the function is defined in another module. My understanding is this allows for reducing the costs of function calls and the inlining may allow the compiler to reorder instructions more optimally. How does this relate to lock contention? Can you clarify this?

You asked for an explanation... so here's my shot at it:

In my aging discussions with compiler jocks, the giant advantage that appears when you have global optimization results from propagation of information from a call site into a function. This is subtlety described in the MSDN link you provided as "inlining across module boundaries." The decision to inline *can* be made based on the information known at the call site (in another module). As a most extreme example, when calling a function with a fixed argument (such as a boolean "true"), this information can be propagated into the function, and a decision to inline (or not) can be made that *with* that constant. Specifically, if the function becomes small enough to inline, when we propagate the boolean to any/all sites in the inlined function, then the compiler should inline. Another example of a constant propagation involves passing in a constant int, which is used to control the number of iterations in a loop in a called function. When the constant is small, the inlining can even unroll a loop. Constants are often deduced at the call site, such as by deducing what a strlen would be on a string known at compile time.

It was my understanding (historically) that this global propagation and selective inline optimization proved very valuable. (again... this is old info... dating back to the 90's).

In addition, of recent we've (on chromium) taken to moving as many function definitions out of header files as possible. Historically these definitions resided in the header files as the only way to get global inlining. Since we now presume that whole program optimization is provided for our release build, the thought is "why bother putting them in headers?" The (insufficient) answer is that when they are not in headers, a developer's "release flag" build will NOT have any significant inlining. This further accentuates the difference between a "production release" build (which takes 4 hours to link/optimize) and a "release flag" build (which developers regularly build on their desktops).

Bottom line: Genuine production quality releases should run much faster, and much differently from developer grade release builds. This difference should also be growing!?!

As programs run slower (re: developer grade builds), pipeline stalls for an atomic access to main memory become less significant, and *should* be buried in noise. The canonical example of this is the attempt by a programmer to hold a lock for a minimal length of time. This often involves a "quick" access to a data member, which is historically inlined. Without the inlining, the body of the lock runs slower, and the cost of the lock acquisition and release, relative to executed code, becomes less measurable. IMO, if you want to look at the impact of such subtle changes, it is best to do so on full grade production builds.

I will add that some of your numbers give me pause. <yeah... I do read what ya' write ;-) > Unfortunately I don't have a good picture in my head of how various grades of processors respond to all these subtle changes. I'm always nervous about deriving too much data from the bleeding edge processors we have on dev boxes inside Google.

Uber Bottom Line: I'd like to see some ongoing perf test comparing complete production quality builds. I'd like to see these tests run on a variety of machine, ranging from single core, to dual core ,to our crazy dual-quad-core-hyperthreaded monster boxes.

Jim

jbates

unread,

Mar 21, 2011, 1:43:53 PM3/21/11

to Chromium-dev

Has anyone done hotspot analysis of chromium to prioritize our
performance optimization discussions? All this hypothetical talk is
making my head hurt :)

In general though, I thought chromium prioritizes safe coding
practices over performance for the most part. (Isn't that the point of
the RefCounted base class? There are faster and less safe ways of
managing object lifetimes...)

John

On Mar 21, 12:26 am, Jim Roskind <j...@chromium.org> wrote:
> On Sun, Mar 20, 2011 at 9:24 PM, William Chan (陈智昌)

> <willc...@chromium.org>wrote:

>
>
>
>
>
>
>
>
>
> > Thanks for your comments Jim! I'm replying inline.
>
> > On Sat, Mar 19, 2011 at 1:05 PM, Jim Roskind <j...@chromium.org> wrote:
>
> >> Although we try to measure performance on our bots, the bots are a mild
> >> attempt to track major regressions. If you want to look at raw performance,
> >> you need a full build, with global (whole program?) optimization. In
> >> addition, larger benchmarks need to be used. Without global optimization,
> >> subtle inefficiencies, including lock contention, are invisible.
>
> > Can you explain this? Why does global optimization matter here? According

> > tohttp://msdn.microsoft.com/en-us/library/0zza0de8(v=vs.71).aspx, whole

William Chan (陈智昌)

unread,

Mar 21, 2011, 1:53:17 PM3/21/11

to jba...@chromium.org, Chromium-dev

On Mon, Mar 21, 2011 at 10:43 AM, jbates <jba...@chromium.org> wrote:

Has anyone done hotspot analysis of chromium to prioritize our
performance optimization discussions? All this hypothetical talk is
making my head hurt :)

Yes. jamesr@ has done work profiling the renderer. At least erg@ and I have both profiled the browser process before. I think davemoore@ may have as well. I'll let them correct me if I'm wrong, but I do not recall refcounting ever appearing in the profiles. Of course, it could be because the individual cost is so small that the sampling mechanism will never catch a sample within a refcount, but the accumulated cost ("death by a thousand cuts") is significant.

In general though, I thought chromium prioritizes safe coding
practices over performance for the most part. (Isn't that the point of
the RefCounted base class? There are faster and less safe ways of
managing object lifetimes...)

I wouldn't say there's a general convention on this. It varies by which engineer you talk to. But many will reference premature optimization (http://c2.com/cgi/wiki?PrematureOptimization) and the Pareto Principle (80/20 rule, http://en.wikipedia.org/wiki/Pareto_principle). The reason why Jim and Darin are concerned is because RefCounted is fairly low level, so it makes its performance implications more difficult to reason about. I've attempted to provide back of the envelope calculations to give a ballpark idea of its implications, which according to my calculations, show that it is several orders of magnitude away from being significant.

Darin Fisher

unread,

Mar 23, 2011, 1:15:54 PM3/23/11

to William Chan (陈智昌), j...@chromium.org, chromium-dev

On Sun, Mar 20, 2011 at 10:02 PM, William Chan (陈智昌) <will...@chromium.org> wrote:

On Sat, Mar 19, 2011 at 2:58 PM, Darin Fisher <da...@chromium.org> wrote:

This echo's my concerns exactly. Sadly, I don't know how to quantify the significance of using non-threadsafe RefCounts when warranted.

Just to double check, does your concern being echoed change your stance that you were mostly convinced by my previous arguments? :P

Nope.

So, I ran a Mac Chromium instance where I recorded the number of RefCounted (not the threadsafe one) ::AddRefs/Releases in the browser process while I tried some standard UI operations (session restore of 5 tabs, open tab, browse to a page, close tabs, quit). In 10 seconds, I counted 1800~ calls (900~ pairs).

~900 pairs over 10 seconds seems rather insignificant. This test doesn't seem to be sensitive to AddRef/Release cost. Maybe there are other types of operations that could be?

Now, if you look at http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html, where for a Intel-based iMac with a 2 GHz Core Duo processor and 1 GB of RAM running Mac OS X v10.5, an uncontested (These *should* all be uncontested, otherwise we've got a bug and we better use threadsafe refcounting! Minor caveat about two addresses sharing the same cache line, which is somewhat unlikely given the fact that our objects are inheriting from the refcounted base class) atomic compare-and-swap costs approximately 0.05 microseconds. So, 1800*0.05us = 90us. So, over the 10 seconds, we added 90 microseconds of overhead, for a grand total of .001%? Now, maybe people question the accuracy of the Mac numbers and my unscientific test numbers, but do we really think they're 3 orders of magnitude off?

Another bogeyman I've heard people mention before is causing jank in UI operations. Now, I'm going to be generous and assume human perception can notice 1ms. To achieve 1ms of delay, we'd have to have on the order of 20000 atomic operations (10000 AddRef/Release pairs). Given that in my admittedly totally unscientific test, I spent 10 seconds using the browser, and accumulated 1800 atomic operations, I am pretty skeptical this bogeyman exists.

What I would do if I were at my Linux desktop and could run google-perftools, would be I would get a profile of a browser process and show everyone how amazingly little time is spent in refcounting. I'll grab a profile when I get back to SF in a week and post it. What performance intensive browser-process only (I assume we mostly trust perf cyclers for renderer stuff, and have little reason to suspect that a Chromium refcounting change's effect on the renderer process would not show up in the perf cyclers) operation are we worried about? History? Safe browsing? I dunno. But whatever one you come up with, do you think that changing RefCountedBase would seriously affect it? Do you think that said browser process operation is using more than tens of thousands of AddRef/Release() pairs per second?

PS: I forgot that we use a memory barrier in our atomic refcount decrement operation. Let's round up to the time for acquiring an uncontested mutex according to the Mac doc. That's 0.2us instead of 0.05us, so a factor of 4. So, I may be further off by a factor of 4 in the above calculations. Sorry. I don't think it makes much of a difference though.

PPS: Lest anyone forget, all the arguments to the contrary have been about imaginary/hypothetical performance costs. Refcounting data race bugs crop up in reality (refer to my original post).

It occurred to me that there is one other interesting consequence of making RefCounted threadsafe. Or, rather I should say that threadsafe RefCounted implies that the destructor of the reference counted object is safe to call on any thread. This is not true for many classes. It seems like if you avoid a data race problem by making RefCounted threadsafe that in many cases you won't actually help anything because you'll still have problems if the destructor is run on the wrong thread.

Stepping back, maybe we should be looking into ways to make more of our classes be NonThreadSafe by default, and have people opt-in to using a class instance across threads? Maybe we could make RefCounted extend from NonThreadSafe, and then change any users that violate that to use RefCountedThreadSafe instead?

I'm saying that RefCounted data race bugs are just the canary in the coal mine. Papering over that canary by switching RefCounted to be threadsafe still leaves a set of impending problems.

-Darin

Paweł Hajdan, Jr.

unread,

Mar 23, 2011, 1:39:34 PM3/23/11

to da...@google.com, Darin Fisher, William Chan (陈智昌), j...@chromium.org, chromium-dev

On Wed, Mar 23, 2011 at 18:15, Darin Fisher <da...@chromium.org> wrote:

It occurred to me that there is one other interesting consequence of making RefCounted threadsafe. Or, rather I should say that threadsafe RefCounted implies that the destructor of the reference counted object is safe to call on any thread. This is not true for many classes. It seems like if you avoid a data race problem by making RefCounted threadsafe that in many cases you won't actually help anything because you'll still have problems if the destructor is run on the wrong thread.

We have the DeleteOnUIThread, DeleteOnIOThread and so on traits to avoid this problem.

William Chan (陈智昌)

unread,

Mar 23, 2011, 7:56:31 PM3/23/11

to Darin Fisher, j...@chromium.org, chromium-dev

On Wed, Mar 23, 2011 at 10:15 AM, Darin Fisher <da...@chromium.org> wrote:

On Sun, Mar 20, 2011 at 10:02 PM, William Chan (陈智昌) <will...@chromium.org> wrote:

On Sat, Mar 19, 2011 at 2:58 PM, Darin Fisher <da...@chromium.org> wrote:

This echo's my concerns exactly. Sadly, I don't know how to quantify the significance of using non-threadsafe RefCounts when warranted.

Just to double check, does your concern being echoed change your stance that you were mostly convinced by my previous arguments? :P

Nope.

So, I ran a Mac Chromium instance where I recorded the number of RefCounted (not the threadsafe one) ::AddRefs/Releases in the browser process while I tried some standard UI operations (session restore of 5 tabs, open tab, browse to a page, close tabs, quit). In 10 seconds, I counted 1800~ calls (900~ pairs).

~900 pairs over 10 seconds seems rather insignificant. This test doesn't seem to be sensitive to AddRef/Release cost. Maybe there are other types of operations that could be?

I can't think of any. Can you? I am skeptical that within the chrome browser process they exist.

Now, if you look at http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html, where for a Intel-based iMac with a 2 GHz Core Duo processor and 1 GB of RAM running Mac OS X v10.5, an uncontested (These *should* all be uncontested, otherwise we've got a bug and we better use threadsafe refcounting! Minor caveat about two addresses sharing the same cache line, which is somewhat unlikely given the fact that our objects are inheriting from the refcounted base class) atomic compare-and-swap costs approximately 0.05 microseconds. So, 1800*0.05us = 90us. So, over the 10 seconds, we added 90 microseconds of overhead, for a grand total of .001%? Now, maybe people question the accuracy of the Mac numbers and my unscientific test numbers, but do we really think they're 3 orders of magnitude off?

Another bogeyman I've heard people mention before is causing jank in UI operations. Now, I'm going to be generous and assume human perception can notice 1ms. To achieve 1ms of delay, we'd have to have on the order of 20000 atomic operations (10000 AddRef/Release pairs). Given that in my admittedly totally unscientific test, I spent 10 seconds using the browser, and accumulated 1800 atomic operations, I am pretty skeptical this bogeyman exists.

What I would do if I were at my Linux desktop and could run google-perftools, would be I would get a profile of a browser process and show everyone how amazingly little time is spent in refcounting. I'll grab a profile when I get back to SF in a week and post it. What performance intensive browser-process only (I assume we mostly trust perf cyclers for renderer stuff, and have little reason to suspect that a Chromium refcounting change's effect on the renderer process would not show up in the perf cyclers) operation are we worried about? History? Safe browsing? I dunno. But whatever one you come up with, do you think that changing RefCountedBase would seriously affect it? Do you think that said browser process operation is using more than tens of thousands of AddRef/Release() pairs per second?

PS: I forgot that we use a memory barrier in our atomic refcount decrement operation. Let's round up to the time for acquiring an uncontested mutex according to the Mac doc. That's 0.2us instead of 0.05us, so a factor of 4. So, I may be further off by a factor of 4 in the above calculations. Sorry. I don't think it makes much of a difference though.

PPS: Lest anyone forget, all the arguments to the contrary have been about imaginary/hypothetical performance costs. Refcounting data race bugs crop up in reality (refer to my original post).

It occurred to me that there is one other interesting consequence of making RefCounted threadsafe. Or, rather I should say that threadsafe RefCounted implies that the destructor of the reference counted object is safe to call on any thread. This is not true for many classes. It seems like if you avoid a data race problem by making RefCounted threadsafe that in many cases you won't actually help anything because you'll still have problems if the destructor is run on the wrong thread.

Stepping back, maybe we should be looking into ways to make more of our classes be NonThreadSafe by default, and have people opt-in to using a class instance across threads? Maybe we could make RefCounted extend from NonThreadSafe, and then change any users that violate that to use RefCountedThreadSafe instead?

Sounds great to me! Then we can also enable the AddRef() and Release() DCHECK(CalledOnValidThread()) assertions rather than rely on the thread collision warner (which is currently turned off). I suspect this will require a large number of RefCounted objects to become RefCountedThreadSafe, because it's not uncommon in our code to create a RefCounted object on one thread and then post it to another thread, where it lives henceforth and dies. This indeed is why the thread collision warner is not enabled (I read maruel's comment saying he couldn't enable it. I thought that was foolish and tried myself and lost steam quite quickly). If we're willing to just switch these guys to RefCountedThreadSafe and make RefCounted inherit from NonThreadSafe by default and assert in AddRef()/Release(), I think that will be a great first step.

I'm saying that RefCounted data race bugs are just the canary in the coal mine. Papering over that canary by switching RefCounted to be threadsafe still leaves a set of impending problems.

Heh, an interesting analogy. I think that often it's the refcount that is the main issue. Having a race on the refcount can cause varying degrees of problems, many related to heap corruption. I would consider it *lucky* if it blew up due to invoking the destructor on the wrong thread, because the crash report will then often show the crash in the destructor. So, while you call it papering over the canary, I still think that usually it's address the main issue.

Why is it so easy to have a race on a refcount? One reason is because of scoped_refptrs. We may think we're only accessing / refcounting on one thread, but if we store a scoped_refptr in another object, and that object is copied/passed across threads (resulting in copying the scoped_refptr in a different thread, or destroying the scoped_refptr in a different thread), then it becomes harder to reason about.

Reply all

Reply to author

Forward