Are there any existing cases where you do believe that
RefCountedThreadUnsafe is warranted?
Are all of those cases warranted because the problem cannot be
refactored to not need that type of ref-counter, or would they allow
for alternate solutions which did not need the RefCountedThreadUnsafe?
Developers are all-the-time developing code which can't afford to not
use the fastest possible calls. But in this problem space has proven
to be beyond even well-qualified developers, and when a mistake is
made a LOT of developer time is wasted. Perhaps we'd be better off to
simply not provide a non-threadsafe version at all, and be done with
it.
-scott
-scott
I'd like to point out in favor of these rather large changes, the fact that the performance gains of using RefCounted instead of RefCountedThreadSafe in Chromium code are apparently negligible. On the other hand, the costs are very high. The kinds of bugs caused by data races on refcounts are very hard to discover (TSan helps a lot, but that only works if you have enough test coverage) and it causes heap corruption, which makes it very hard to track down the source of the problem.
If for some reason you keep the RefCountedThreadUnsafe class, please
make sure to un-comment the DFAKE_SCOPED_LOCK_THREAD_LOCKED lines.
This will "sort of" protect us from inappropriate usage of this class.
Thanks for doing this!
btw, why did you start this effort? Have you found some particular bug recently?
> --
> Chromium Developers mailing list: chromi...@chromium.org
> View archives, change email options, or unsubscribe:
> http://groups.google.com/a/chromium.org/group/chromium-dev
>
I landed http://src.chromium.org/viewvc/chrome?view=rev&revision=78649 which switched base::RefCounted to use RefCountedThreadSafeBase instead of RefCountedBase (which I deleted). Preliminary examination of the perf dashboard indicates no negative impact. I'm leaving this running overnight. Feel free to check http://build.chromium.org/f/chromium/perf/dashboard/overview.html to see if you see any impact. I don't.
On Thu, Mar 17, 2011 at 11:07 PM, William Chan (陈智昌) <will...@chromium.org> wrote:
I landed http://src.chromium.org/viewvc/chrome?view=rev&revision=78649 which switched base::RefCounted to use RefCountedThreadSafeBase instead of RefCountedBase (which I deleted). Preliminary examination of the perf dashboard indicates no negative impact. I'm leaving this running overnight. Feel free to check http://build.chromium.org/f/chromium/perf/dashboard/overview.html to see if you see any impact. I don't.Is any of these measured on an in-order core ? E.g. a CR-48 ? I expect the relative cost of atomic operations to be higher there. Since it's also our weakest platform, I'd like to see measurements there...
[ +chase, +maruel to help with questions about bots ]On Thu, Mar 17, 2011 at 11:07 PM, William Chan (陈智昌) <will...@chromium.org> wrote:
I landed http://src.chromium.org/viewvc/chrome?view=rev&revision=78649 which switched base::RefCounted to use RefCountedThreadSafeBase instead of RefCountedBase (which I deleted). Preliminary examination of the perf dashboard indicates no negative impact. I'm leaving this running overnight. Feel free to check http://build.chromium.org/f/chromium/perf/dashboard/overview.html to see if you see any impact. I don't.Is any of these measured on an in-order core ? E.g. a CR-48 ? I expect the relative cost of atomic operations to be higher there. Since it's also our weakest platform, I'd like to see measurements there...I do not believe so. I think you've pointed out an important hole in our performance testing coverage. I have no idea how to run the perf tests on a CR-48. I think if we care about this platform (and I believe we do), we should add bots for it to the waterfall. Are the Linux ChromiumOS/ARM bots using hardware similar to a CR-48? Do we simply need to turn on perf tests for them?
On Fri, Mar 18, 2011 at 10:29 AM, Darin Fisher <da...@google.com> wrote:
This is very interesting, and you make a strong case.
My concern is that our tests may not be conclusive. I worry about being in situations where we are trying to optimize code that is suffering from "death by a thousand cuts", in which things are slow, and there is no low hanging fruit. Atomic reference counting can negatively impact CPU perf, no?
You're absolutely right that atomic refcounting *can* hurt CPU performance. I think that our tests may not be 100% conclusive, but I think the experiment makes a big statement. It shows that even if we converted *all* RefCounted to RefCountedThreadSafe, the results aren't noticeable by our current tests.
Going back to the analogy, it shows that even a thousand cuts seems to have any noticeable impact by our existing metrics. I think this makes the strong case that by default, we should choose RefCountedThreadSafe. It may be possible that for certain low-level utility classes, or otherwise performance intensive code (I'm thinking of stuff in base/), we may want to stick with RefCounted.Also, most of our code is single threaded, so it seems a bit unfortunate to impose these admittedly unknown costs.
Finally, why is this a concern now? What changed? Isn't TSAN solving the problem? We also have assertions in the code that help with many common sources of this issue.
Nothing really changed. TSAN continues to keep catching bugs. That shows that people keep getting this wrong. TSAN also only works when we have appropriate test coverage. So that means we can be pretty sure that there are bugs that aren't be caught by TSAN. TSAN also is too slow for Mac and Windows browser/UI tests, so that means that there's definitely a huge chunk of code that's not being tested. It also happens to be the code where we do the most task posting across threads, so that's unfortunate.The problem is every time one of these slips through our testing, it takes many engineers many hours to track down the source of the bug, because it causes heap corruption (double free, use-after-free). The source of heap corruption is super difficult to track down, so it takes forever to fix. This happened to us in Chrome 3 where the TCMalloc heap corruption was discovered to be due to a data race on a refcount. It took many weeks with many engineers working full-time to track down the single bug. And they didn't track it down, Timurrrr happened to catch it on TSAN which happened to fix the crashes, but no one was sure that it would fix it.Currently there's another heap corruption that's causing Chrome 11 crash rates to skyrocket, and many engineers have spent well over a week working on it with no progress. I'm not saying that it's due to refcounting again, but I think that if we can basically eliminate an entire category of bugs that are hugely difficult to track down, for no *perceivable costs*, it's probably worth it.
-darin
On Mar 17, 2011 11:23 PM, "William Chan (陈智昌)" <will...@chromium.org> wrote:
What about purpose putting in the change in an easily revertable form for a dev channel release and then let users pound on it for a bit? Is there a way to collect more than annecdotal evidence from something like that?
[I didn't mean to take this off list. Bringing it back.]On Fri, Mar 18, 2011 at 11:13 AM, William Chan (陈智昌) <will...@chromium.org> wrote:
On Fri, Mar 18, 2011 at 10:29 AM, Darin Fisher <da...@google.com> wrote:
This is very interesting, and you make a strong case.
My concern is that our tests may not be conclusive. I worry about being in situations where we are trying to optimize code that is suffering from "death by a thousand cuts", in which things are slow, and there is no low hanging fruit. Atomic reference counting can negatively impact CPU perf, no?
You're absolutely right that atomic refcounting *can* hurt CPU performance. I think that our tests may not be 100% conclusive, but I think the experiment makes a big statement. It shows that even if we converted *all* RefCounted to RefCountedThreadSafe, the results aren't noticeable by our current tests.Our tests mostly highlight page load time (dominated by WebKit performance) and startup time. The tests don't necessarily cover things that may stress single-threaded Chromium code. Maybe there are some interesting UI actions or history backend processes that would be more sensitive to the change you are proposing here. I don't know exactly what is missing, but I do know that we should be cautious about assuming that just because a change didn't regress our tests doesn't mean that the change was good.Case in point:For Chrome 2, the V8 team changed an algorithm that had no visible impact on any of our performance tests. It helped reduce Gmail memory usage IIRC. However, just as we were about to ship Chrome 2, we learned that it greatly impacted page load time for many Google properties as observed by those web sites' own metrics. It was too late to fix. When we uncovered the problem and fixed it, a new page cycler was then added (morejs), so that we would not regress it again.My point is that absence of regression in our existing tests is not proof that you haven't made the product slower. Moreover, we even know that most of our existing tests probably aren't that relevant since they mainly exercise WebKit. (If you made the same change to WTF::RefCounted<T>, perhaps you would notice a regression on the page cycler tests.)
Going back to the analogy, it shows that even a thousand cuts seems to have any noticeable impact by our existing metrics. I think this makes the strong case that by default, we should choose RefCountedThreadSafe. It may be possible that for certain low-level utility classes, or otherwise performance intensive code (I'm thinking of stuff in base/), we may want to stick with RefCounted.Also, most of our code is single threaded, so it seems a bit unfortunate to impose these admittedly unknown costs.
Finally, why is this a concern now? What changed? Isn't TSAN solving the problem? We also have assertions in the code that help with many common sources of this issue.
Nothing really changed. TSAN continues to keep catching bugs. That shows that people keep getting this wrong. TSAN also only works when we have appropriate test coverage. So that means we can be pretty sure that there are bugs that aren't be caught by TSAN. TSAN also is too slow for Mac and Windows browser/UI tests, so that means that there's definitely a huge chunk of code that's not being tested. It also happens to be the code where we do the most task posting across threads, so that's unfortunate.The problem is every time one of these slips through our testing, it takes many engineers many hours to track down the source of the bug, because it causes heap corruption (double free, use-after-free). The source of heap corruption is super difficult to track down, so it takes forever to fix. This happened to us in Chrome 3 where the TCMalloc heap corruption was discovered to be due to a data race on a refcount. It took many weeks with many engineers working full-time to track down the single bug. And they didn't track it down, Timurrrr happened to catch it on TSAN which happened to fix the crashes, but no one was sure that it would fix it.Currently there's another heap corruption that's causing Chrome 11 crash rates to skyrocket, and many engineers have spent well over a week working on it with no progress. I'm not saying that it's due to refcounting again, but I think that if we can basically eliminate an entire category of bugs that are hugely difficult to track down, for no *perceivable costs*, it's probably worth it.As I said above, this is a very compelling argument. I'm mostly convinced. I am just concerned about the claim that this will have no perceivable costs when you don't really know that to be true. I am very interested in seeing Cr48 results too, but I think we should also think hard about what other performance tests we should have to help us feel more confident here.-Darin-darin
On Mar 17, 2011 11:23 PM, "William Chan (陈智昌)" <will...@chromium.org> wrote:
[ retry from right address ]On Fri, Mar 18, 2011 at 9:04 PM, Darin Fisher <da...@chromium.org> wrote:
[I didn't mean to take this off list. Bringing it back.]On Fri, Mar 18, 2011 at 11:13 AM, William Chan (陈智昌) <will...@chromium.org> wrote:
On Fri, Mar 18, 2011 at 10:29 AM, Darin Fisher <da...@google.com> wrote:
This is very interesting, and you make a strong case.
My concern is that our tests may not be conclusive. I worry about being in situations where we are trying to optimize code that is suffering from "death by a thousand cuts", in which things are slow, and there is no low hanging fruit. Atomic reference counting can negatively impact CPU perf, no?
You're absolutely right that atomic refcounting *can* hurt CPU performance. I think that our tests may not be 100% conclusive, but I think the experiment makes a big statement. It shows that even if we converted *all* RefCounted to RefCountedThreadSafe, the results aren't noticeable by our current tests.Our tests mostly highlight page load time (dominated by WebKit performance) and startup time. The tests don't necessarily cover things that may stress single-threaded Chromium code. Maybe there are some interesting UI actions or history backend processes that would be more sensitive to the change you are proposing here. I don't know exactly what is missing, but I do know that we should be cautious about assuming that just because a change didn't regress our tests doesn't mean that the change was good.Case in point:For Chrome 2, the V8 team changed an algorithm that had no visible impact on any of our performance tests. It helped reduce Gmail memory usage IIRC. However, just as we were about to ship Chrome 2, we learned that it greatly impacted page load time for many Google properties as observed by those web sites' own metrics. It was too late to fix. When we uncovered the problem and fixed it, a new page cycler was then added (morejs), so that we would not regress it again.My point is that absence of regression in our existing tests is not proof that you haven't made the product slower. Moreover, we even know that most of our existing tests probably aren't that relevant since they mainly exercise WebKit. (If you made the same change to WTF::RefCounted<T>, perhaps you would notice a regression on the page cycler tests.)What about committing the change in an easily revertible form for a dev channel release and then let users pound on it for a bit? That would at least provide a sanity check. And if we're creative, we might be able to do some sort of user studies to show if there is a perceptible speed difference. It's not quite as nice as having hard, quantifiable, metrics for specific actions, etc. But it may be easier to setup, and tracks perception, which is arguable as/more important than numbers.
--
I am very concerned about this change, as it continues to lead into what I've called "death by a thousand cuts." At best, each such change will be unmeasurable, but the net impact will be great.
Although we try to measure performance on our bots, the bots are a mild attempt to track major regressions. If you want to look at raw performance, you need a full build, with global (whole program?) optimization. In addition, larger benchmarks need to be used. Without global optimization, subtle inefficiencies, including lock contention, are invisible.
Worse than the fact that a number of ref-counted vars will be protected unnecessarily by locks, future code will tend to more cavalierly rely on this, when our approach to an apartment model for threading can properly isolate code on a thread, and preclude the need for locks. This is indeed the secret behind TCMalloc. Thread caching to avoid locks is a big deal.
I am very much in favor of having debug checking done that prevents accidental use non-thread-safe code across threads, but to move towards this approach in general should create an unrecoverable loss in performance.I can hear the echo in my ears of Knuth arguing about premature optimization. I've always felt that code should not be prematurely optimized, but it should be written so that it can be optimized. This direction seems to a move to permanently abandon a powerful optimization technique. Thread local variables and data handling, never needing locks, are too valuable to discard IMO. This change precludes efficient reference counting on a thread, an that is a notable loss.
Jimp.s., There is a famous tale in this regard about Oreo Cookies. Over a series of years Nabisco made changes, each of which was provably impossible for customers to detect, and each saved money for production. After several years, competitors of Oreo started eating Nabisco's lunch (proverbially speaking), and a test against an original recipe revealed a large and measurable loss of quality. The good news for Oreo is that they were able to revert their multi-year changes. Lets not give up the advantages we have built by taking a series of changes that we KNOW degrade the product, even if they are each hard to measure.
This echo's my concerns exactly. Sadly, I don't know how to quantify the significance of using non-threadsafe RefCounts when warranted.
Thanks for your comments Jim! I'm replying inline.Although we try to measure performance on our bots, the bots are a mild attempt to track major regressions. If you want to look at raw performance, you need a full build, with global (whole program?) optimization. In addition, larger benchmarks need to be used. Without global optimization, subtle inefficiencies, including lock contention, are invisible.Can you explain this? Why does global optimization matter here? According to http://msdn.microsoft.com/en-us/library/0zza0de8(v=vs.71).aspx, whole program optimization allows visual studio to (1) Optimize the use of registers across function boundaries and (2) Inline a function in a module even when the function is defined in another module. My understanding is this allows for reducing the costs of function calls and the inlining may allow the compiler to reorder instructions more optimally. How does this relate to lock contention? Can you clarify this?
Has anyone done hotspot analysis of chromium to prioritize our
performance optimization discussions? All this hypothetical talk is
making my head hurt :)
In general though, I thought chromium prioritizes safe coding
practices over performance for the most part. (Isn't that the point of
the RefCounted base class? There are faster and less safe ways of
managing object lifetimes...)
On Sat, Mar 19, 2011 at 2:58 PM, Darin Fisher <da...@chromium.org> wrote:This echo's my concerns exactly. Sadly, I don't know how to quantify the significance of using non-threadsafe RefCounts when warranted.Just to double check, does your concern being echoed change your stance that you were mostly convinced by my previous arguments? :P
So, I ran a Mac Chromium instance where I recorded the number of RefCounted (not the threadsafe one) ::AddRefs/Releases in the browser process while I tried some standard UI operations (session restore of 5 tabs, open tab, browse to a page, close tabs, quit). In 10 seconds, I counted 1800~ calls (900~ pairs).
Now, if you look at http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html, where for a Intel-based iMac with a 2 GHz Core Duo processor and 1 GB of RAM running Mac OS X v10.5, an uncontested (These *should* all be uncontested, otherwise we've got a bug and we better use threadsafe refcounting! Minor caveat about two addresses sharing the same cache line, which is somewhat unlikely given the fact that our objects are inheriting from the refcounted base class) atomic compare-and-swap costs approximately 0.05 microseconds. So, 1800*0.05us = 90us. So, over the 10 seconds, we added 90 microseconds of overhead, for a grand total of .001%? Now, maybe people question the accuracy of the Mac numbers and my unscientific test numbers, but do we really think they're 3 orders of magnitude off?Another bogeyman I've heard people mention before is causing jank in UI operations. Now, I'm going to be generous and assume human perception can notice 1ms. To achieve 1ms of delay, we'd have to have on the order of 20000 atomic operations (10000 AddRef/Release pairs). Given that in my admittedly totally unscientific test, I spent 10 seconds using the browser, and accumulated 1800 atomic operations, I am pretty skeptical this bogeyman exists.What I would do if I were at my Linux desktop and could run google-perftools, would be I would get a profile of a browser process and show everyone how amazingly little time is spent in refcounting. I'll grab a profile when I get back to SF in a week and post it. What performance intensive browser-process only (I assume we mostly trust perf cyclers for renderer stuff, and have little reason to suspect that a Chromium refcounting change's effect on the renderer process would not show up in the perf cyclers) operation are we worried about? History? Safe browsing? I dunno. But whatever one you come up with, do you think that changing RefCountedBase would seriously affect it? Do you think that said browser process operation is using more than tens of thousands of AddRef/Release() pairs per second?PS: I forgot that we use a memory barrier in our atomic refcount decrement operation. Let's round up to the time for acquiring an uncontested mutex according to the Mac doc. That's 0.2us instead of 0.05us, so a factor of 4. So, I may be further off by a factor of 4 in the above calculations. Sorry. I don't think it makes much of a difference though.PPS: Lest anyone forget, all the arguments to the contrary have been about imaginary/hypothetical performance costs. Refcounting data race bugs crop up in reality (refer to my original post).
It occurred to me that there is one other interesting consequence of making RefCounted threadsafe. Or, rather I should say that threadsafe RefCounted implies that the destructor of the reference counted object is safe to call on any thread. This is not true for many classes. It seems like if you avoid a data race problem by making RefCounted threadsafe that in many cases you won't actually help anything because you'll still have problems if the destructor is run on the wrong thread.
On Sun, Mar 20, 2011 at 10:02 PM, William Chan (陈智昌) <will...@chromium.org> wrote:On Sat, Mar 19, 2011 at 2:58 PM, Darin Fisher <da...@chromium.org> wrote:This echo's my concerns exactly. Sadly, I don't know how to quantify the significance of using non-threadsafe RefCounts when warranted.Just to double check, does your concern being echoed change your stance that you were mostly convinced by my previous arguments? :PNope.So, I ran a Mac Chromium instance where I recorded the number of RefCounted (not the threadsafe one) ::AddRefs/Releases in the browser process while I tried some standard UI operations (session restore of 5 tabs, open tab, browse to a page, close tabs, quit). In 10 seconds, I counted 1800~ calls (900~ pairs).~900 pairs over 10 seconds seems rather insignificant. This test doesn't seem to be sensitive to AddRef/Release cost. Maybe there are other types of operations that could be?
Now, if you look at http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html, where for a Intel-based iMac with a 2 GHz Core Duo processor and 1 GB of RAM running Mac OS X v10.5, an uncontested (These *should* all be uncontested, otherwise we've got a bug and we better use threadsafe refcounting! Minor caveat about two addresses sharing the same cache line, which is somewhat unlikely given the fact that our objects are inheriting from the refcounted base class) atomic compare-and-swap costs approximately 0.05 microseconds. So, 1800*0.05us = 90us. So, over the 10 seconds, we added 90 microseconds of overhead, for a grand total of .001%? Now, maybe people question the accuracy of the Mac numbers and my unscientific test numbers, but do we really think they're 3 orders of magnitude off?Another bogeyman I've heard people mention before is causing jank in UI operations. Now, I'm going to be generous and assume human perception can notice 1ms. To achieve 1ms of delay, we'd have to have on the order of 20000 atomic operations (10000 AddRef/Release pairs). Given that in my admittedly totally unscientific test, I spent 10 seconds using the browser, and accumulated 1800 atomic operations, I am pretty skeptical this bogeyman exists.What I would do if I were at my Linux desktop and could run google-perftools, would be I would get a profile of a browser process and show everyone how amazingly little time is spent in refcounting. I'll grab a profile when I get back to SF in a week and post it. What performance intensive browser-process only (I assume we mostly trust perf cyclers for renderer stuff, and have little reason to suspect that a Chromium refcounting change's effect on the renderer process would not show up in the perf cyclers) operation are we worried about? History? Safe browsing? I dunno. But whatever one you come up with, do you think that changing RefCountedBase would seriously affect it? Do you think that said browser process operation is using more than tens of thousands of AddRef/Release() pairs per second?PS: I forgot that we use a memory barrier in our atomic refcount decrement operation. Let's round up to the time for acquiring an uncontested mutex according to the Mac doc. That's 0.2us instead of 0.05us, so a factor of 4. So, I may be further off by a factor of 4 in the above calculations. Sorry. I don't think it makes much of a difference though.PPS: Lest anyone forget, all the arguments to the contrary have been about imaginary/hypothetical performance costs. Refcounting data race bugs crop up in reality (refer to my original post).It occurred to me that there is one other interesting consequence of making RefCounted threadsafe. Or, rather I should say that threadsafe RefCounted implies that the destructor of the reference counted object is safe to call on any thread. This is not true for many classes. It seems like if you avoid a data race problem by making RefCounted threadsafe that in many cases you won't actually help anything because you'll still have problems if the destructor is run on the wrong thread.Stepping back, maybe we should be looking into ways to make more of our classes be NonThreadSafe by default, and have people opt-in to using a class instance across threads? Maybe we could make RefCounted extend from NonThreadSafe, and then change any users that violate that to use RefCountedThreadSafe instead?
I'm saying that RefCounted data race bugs are just the canary in the coal mine. Papering over that canary by switching RefCounted to be threadsafe still leaves a set of impending problems.