SECURITY_DCHECK to SECURITY_CHECK in casting.h ?

298 views
Skip to first unread message

Daniel Vogelheim

unread,
Feb 15, 2023, 7:58:08 AM2/15/23
to Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
Hello, friends & owners of platform/wtf/casting.h,

I help out in Chrome's VRP program, and we noticed a steady stream of memory corruption bugs that trip over the SECURITY_DCHECKs in casting.h (e.g. in ASAN). In release builds, execution would continue since these are DCHECKs. That makes many of these bugs exploitable.

In 2022 alone, the VRP rewarded 62.5 k$ to 16 memory corruption bugs that the checks in casting.h would have caught. There's 3 fresh bugs for 2023 already.

I wonder why these are DCHECK in the first place, and whether this could be turned into a proper CHECK. If these were to crash the tab in release builds, the issues would become unexploitable, or certainly much more difficult to exploit. Also, the issues would gain more visibility in crash reports, making it more likely that the underlying root cause would get fixed. The default reason for DCHECK vs CHECK seems to be performance concerns. A trial CL shows a low but measurable effect.

I am not sure how to weigh this. Would it make sense to upgrade the two checks in casting.h to SECURITY_CHECKs? 

Sincerely,
Daniel



dan...@chromium.org

unread,
Feb 15, 2023, 9:12:11 AM2/15/23
to Daniel Vogelheim, Peter Boström, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
+Peter Boström

In general we are shifting our take on DCHECKs, and leaning toward CHECKs, though wholesale transition without care risks stability problems. Not that it means we should not convert DCHECKs to CHECKs but that we want to put effort into finding the mistakes they would catch before they catch them in stable.

Some very good context here: https://www.youtube.com/watch?v=YDCNICYiq08

So, yes, it would make these CHECKs. It would be worth attempting to flush out bugs and fix them first with DumpWithoutCrashing. You can sync up with pbos@ for how to go about this cleanly.


--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-architecture-dev/CALG6KPPDK7gDaguRaMvRW%2B_NuYPQWL%2Boq-qrC0aO7H3pZqzwuQ%40mail.gmail.com.

Peter Boström

unread,
Feb 15, 2023, 9:23:42 AM2/15/23
to dan...@chromium.org, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
It's hard to tell without more blink expertise, but it looks to me like that code may be a hotspot and if so then checking IsA<T> may be a performance issue as it doesn't trivially look trivial.

I don't think I've seen that ::To signature show up ever in our DCHECK builds, so I would assume it's fairly rare and that the slower DumpWithoutCrashing rollout would be unnecessary. If I'm wrong then hand-wavy prefer crash over the UB just about to happen?

If you can show that we can take the performance hit or that this doesn't move the benchmarks, then I think you should just replace this with a SECURITY_CHECK without the flush-out attempt.

Peter Boström

unread,
Feb 15, 2023, 9:32:01 AM2/15/23
to dan...@chromium.org, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
Sorry I only read Dana's email not your original email. I'd run a Speedometer job against the M1 bots as well, with ~100 repeat count. That CL shows a 1.2% increase in Speedometer runtime, which if it holds over more repeats is seen as significant. I've had CLs reverted in that range.

"The regression was ~1% (maybe a bit less) on the overall score of Speedometer2. That's pretty substantial and usually takes up many person weeks to recover"

Not sure about the security gains here. This also rules out rolling this out with base::debug::DumpWithoutCrashing() which has at least comparable performance overhead to CHECK.

dan...@chromium.org

unread,
Feb 15, 2023, 9:43:16 AM2/15/23
to Peter Boström, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
If the performance is an issue we can consider introducing two paths.

A commonly used path that does CHECK and a clearly-marked dangerous path that does DCHECK and is only used in places that are extremely hot. And we should then make effort to encapsulate the pointers that go through the DCHECK path such that it's possible to actually reason thru their types.

Daniel Vogelheim

unread,
Feb 15, 2023, 9:46:02 AM2/15/23
to Peter Boström, dan...@chromium.org, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
On Wed, Feb 15, 2023 at 3:31 PM Peter Boström <pb...@chromium.org> wrote:
Sorry I only read Dana's email not your original email. I'd run a Speedometer job against the M1 bots as well, with ~100 repeat count.

In progress. I'll report back when I have results.

Peter Boström

unread,
Feb 15, 2023, 9:49:31 AM2/15/23
to dan...@chromium.org, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
That seems reasonable to me (and way better than Just Giving Up). ::ToUnsafe<T> would not be bad naming if you need ideas.

Daniel Vogelheim

unread,
Feb 15, 2023, 1:02:03 PM2/15/23
to Peter Boström, dan...@chromium.org, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough

The perf delta is uneven across platforms: M1: ~0.3%; x64/Win10: ~1.5%

Peter Boström

unread,
Feb 15, 2023, 1:08:46 PM2/15/23
to Daniel Vogelheim, dan...@chromium.org, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
I'd say the ~1.5% regression is probably significant enough that you want to figure out what's up and see if you can have an unsafe version of ::To that only DCHECKs (if that regression is attributable to only a few call sites that can be well audited). 0.3% is probably "not nothing" either.

Dave Tapuska

unread,
Feb 15, 2023, 1:16:44 PM2/15/23
to Peter Boström, Daniel Vogelheim, dan...@chromium.org, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
I know this is a public forum, so I'm just discussing in terms of percentages. From my understanding 84% of the bugs were found via clusterfuzz weren't they? So that already has the DCHECK on and enabling this wouldn't change that source of bugs. Leaving roughly ~11% of real bugs coming from external reports. 

dave.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.

dan...@chromium.org

unread,
Feb 15, 2023, 1:30:19 PM2/15/23
to Dave Tapuska, Peter Boström, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev, Alex Gough
On Wed, Feb 15, 2023 at 1:16 PM Dave Tapuska <dtap...@chromium.org> wrote:
I know this is a public forum, so I'm just discussing in terms of percentages. From my understanding 84% of the bugs were found via clusterfuzz weren't they? So that already has the DCHECK on and enabling this wouldn't change that source of bugs. Leaving roughly ~11% of real bugs coming from external reports. 

I think the real problem is the rest of the bugs that are not represented in either number as they are not known but are being shipped in chromium. The high quantity here suggests there's many others that aren't known.

Alex Gough

unread,
Feb 15, 2023, 1:56:36 PM2/15/23
to dan...@chromium.org, Dave Tapuska, Peter Boström, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev
I'm going to push back a bit here on labelling clusterfuzz bugs as less real - those are exactly the ones that well resourced adversaries can also find by fuzzing.

Thanks Daniel for giving this a go - 0.5-1.0% is exactly in the difficult area of "is the security worth the perf" but getting that down a bit should be enough to land some hardening.

Alex

Daniel Cheng

unread,
Feb 15, 2023, 5:19:49 PM2/15/23
to Alex Gough, dan...@chromium.org, Dave Tapuska, Peter Boström, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev
To give some context here, I've long wanted to change those to be CHECK instead of DCHECK. The main reason they haven't been changed is because of the potential perf regressions (as noted in this thread).

Ideally, we would profile and see where the hotspots are and relax (hopefully) just a few places to use UnsafeTo<> and then allow To<> to safely CHECK() by default.

Daniel

Dave Tapuska

unread,
Feb 15, 2023, 5:29:12 PM2/15/23
to Daniel Cheng, Alex Gough, dan...@chromium.org, Peter Boström, Daniel Vogelheim, Kent Tamura, Nico Weber, Kentaro Hara, platform-architecture-dev
We should likely change the DynamicTo to use UnsafeTo to avoid two virtual calls to IsA<...> in any prototype. 

I wasn't saying clusterfuzz bugs weren't real, I was just saying changing this wouldn't impact that source of bugs since DCHECK was already on there. But I do agree with Dana that there could be a source of undiscovered bugs that these could catch. 

dave.

Kentaro Hara

unread,
Feb 15, 2023, 6:46:37 PM2/15/23
to Dave Tapuska, Daniel Cheng, Alex Gough, dan...@chromium.org, Peter Boström, Daniel Vogelheim, Kent Tamura, Nico Weber, platform-architecture-dev
My bet:

Replace DCHECK with CHECK in all directories other than core/{dom,css,style,html}, and measure performance :D


--
Kentaro Hara, Tokyo

Daniel Vogelheim

unread,
Feb 16, 2023, 7:44:04 AM2/16/23
to Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, dan...@chromium.org, Peter Boström, Kent Tamura, Nico Weber, platform-architecture-dev
Hi all,

I was going to take Kentaro's bet, but then noticed that the two To<.> variants in question have around 300 usages each, but that very many of them are in generated code. So my first try is allowing unsafe use from code generators only. (Patch set 2 of 4249403.) I'll report back when I have results.

In the meantime: Would anyone have an explanation why the performance penalty would be greater on one platform vs another? If I take the pinpoint numbers at face value, then the checks are 5x more expensive on x64/Win vs M1/Mac (+1.5% vs +0.3% increase; relative to the base performance). I find that rather strange.

Daniel

Daniel Cheng

unread,
Feb 16, 2023, 8:15:14 AM2/16/23
to Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, dan...@chromium.org, Peter Boström, Kent Tamura, Nico Weber, platform-architecture-dev
On Thu, 16 Feb 2023 at 04:44, Daniel Vogelheim <voge...@google.com> wrote:
Hi all,

I was going to take Kentaro's bet, but then noticed that the two To<.> variants in question have around 300 usages each, but that very many of them are in generated code. So my first try is allowing unsafe use from code generators only. (Patch set 2 of 4249403.) I'll report back when I have results.

This sounds like a reasonable approach to me. I would actually feel bad about excluding so much non-generated code otherwise, because I'd wonder how many type confusion bugs we've had in those directories given their heavy use of our manual RTTI hierarchy.
 

In the meantime: Would anyone have an explanation why the performance penalty would be greater on one platform vs another? If I take the pinpoint numbers at face value, then the checks are 5x more expensive on x64/Win vs M1/Mac (+1.5% vs +0.3% increase; relative to the base performance). I find that rather strange.

I don't know exactly the underlying reasons, but others have made observations along similar lines: when running Meet and sharing a window, benchmarks on non-M1 platforms have a fairly sizable drop in score while a M1 Pro was apparently minimally affected.

K. Moon

unread,
Feb 16, 2023, 11:44:44 AM2/16/23
to Daniel Cheng, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, dan...@chromium.org, Peter Boström, Kent Tamura, Nico Weber, platform-architecture-dev
One guess: M1 has a pretty highly optimized memory system. One source claims this is due to a high level of parallelism in issuing memory accesses:
https://lemire.me/blog/2021/01/06/memory-access-on-the-apple-m1-processor/

Daniel Vogelheim

unread,
Feb 16, 2023, 1:32:12 PM2/16/23
to Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, dan...@chromium.org, Peter Boström, Kent Tamura, Nico Weber, platform-architecture-dev
On Thu, Feb 16, 2023 at 1:43 PM Daniel Vogelheim <voge...@google.com> wrote:
I was going to take Kentaro's bet, but then noticed that the two To<.> variants in question have around 300 usages each, but that very many of them are in generated code. So my first try is allowing unsafe use from code generators only. (Patch set 2 of 4249403.) I'll report back when I have results.


So about 1/3 of the cost was generated code.

I found some more generated usages and am running those, too, but I'm not expecting those to move the needle by a lot.

Michael Lippautz

unread,
Feb 20, 2023, 5:00:47 AM2/20/23
to Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, dan...@chromium.org, Peter Boström, Kent Tamura, Nico Weber, platform-architecture-dev
Thanks for checking performance on these changes upfront. 

You can use crossbench to find the hot paths that are affected by e.g. Speedometer2. The tool can spit out a pprof profile, similar to this one (sorry Google-only), which you can nicely search for bottlenecks. This can give you hints for where to place exceptions for hot paths in the DOM. Just make sure to compile with `symbol_level=2` to have symbols for inlined frames.

A regression for more usage of `IsA<>()` is not surprising to me as these already show up in the profile today as something that we spend significant time in.

-Michael 

Daniel Vogelheim

unread,
Feb 21, 2023, 7:46:58 AM2/21/23
to Michael Lippautz, Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, dan...@chromium.org, Peter Boström, Kent Tamura, Nico Weber, platform-architecture-dev
On Mon, Feb 20, 2023 at 11:00 AM Michael Lippautz <mlip...@chromium.org> wrote:
Thanks for checking performance on these changes upfront. 

You can use crossbench to find the hot paths that are affected by e.g. Speedometer2. The tool can spit out a pprof profile, similar to this one (sorry Google-only), which you can nicely search for bottlenecks. This can give you hints for where to place exceptions for hot paths in the DOM. Just make sure to compile with `symbol_level=2` to have symbols for inlined frames.

A regression for more usage of `IsA<>()` is not surprising to me as these already show up in the profile today as something that we spend significant time in.

Thanks, I didn't know about crossbench. 
A cursory look at the profile suggests there's a handful of large users. I tried excluding those from the checks, but that hasn't moved the needle. (Still at ~+1% for x64/Win.)

I think I've about exhausted the easy options. The initial VRP analysis suggests to me that solving this would be worth several SWE-months. I'll bring this up with my manager (currently ooo) to see whether my team has an interest in taking this on.

Will Harris

unread,
May 23, 2023, 5:17:56 PM5/23/23
to platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, danakj, Peter Boström, Kent Tamura, Nico Weber, platform-architecture-dev, Michael Lippautz
this same SECURITY_DCHECK would have caught another security bug see https://bugs.chromium.org/p/chromium/issues/detail?id=1448032

Given we've made quite a few improvements in perf through other projects, can we "spend" some of these "perfcoins" and change the occurrence in casting.h to a CHECK?


I think this would be a meaningful security improvement. Who needs to approve this sort of change?

Will 

On Tuesday, February 21, 2023 at 4:46:58 AM UTC-8 Daniel Vogelheim wrote:
On Mon, Feb 20, 2023 at 11:00 AM Michael Lippautz <mlip...@chromium.org> wrote:
Thanks for checking performance on these changes upfront. 

You can use crossbench to find the hot paths that are affected by e.g. Speedometer2. The tool can spit out a pprof profile, similar to this one (sorry Google-only), which you can nicely search for bottlenecks. This can give you hints for where to place exceptions for hot paths in the DOM. Just make sure to compile with `symbol_level=2` to have symbols for inlined frames.

A regression for more usage of `IsA<>()` is not surprising to me as these already show up in the profile today as something that we spend significant time in.

Thanks, I didn't know about crossbench. 
A cursory look at the profile suggests there's a handful of large users. I tried excluding those from the checks, but that hasn't moved the needle. (Still at ~+1% for x64/Win.)

I think I've about exhausted the easy options. The initial VRP analysis suggests to me that solving this would be worth several SWE-months. I'll bring this up with my manager (currently ooo) to see whether my team has an interest in taking this on.

Daniel


dave.

To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.


--
Kentaro Hara, Tokyo

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.

dan...@chromium.org

unread,
May 24, 2023, 11:41:22 AM5/24/23
to Will Harris, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, Peter Boström, Kent Tamura, Nico Weber, Michael Lippautz
On Tue, May 23, 2023 at 5:18 PM Will Harris <w...@chromium.org> wrote:
this same SECURITY_DCHECK would have caught another security bug see https://bugs.chromium.org/p/chromium/issues/detail?id=1448032

Given we've made quite a few improvements in perf through other projects, can we "spend" some of these "perfcoins" and change the occurrence in casting.h to a CHECK?


I think this would be a meaningful security improvement. Who needs to approve this sort of change?

There is no one who can approve it, which is a problem.

We keep having security bugs so we should feel empowered to do this IMO.

Preventing security bugs are of huge importance to our product, for each bug that happens we should determine a way to prevent that same bug from ever happening again in another place. This is our way to do it for these casts, and we don't know any better way.
 

Will 

On Tuesday, February 21, 2023 at 4:46:58 AM UTC-8 Daniel Vogelheim wrote:
On Mon, Feb 20, 2023 at 11:00 AM Michael Lippautz <mlip...@chromium.org> wrote:
Thanks for checking performance on these changes upfront. 

You can use crossbench to find the hot paths that are affected by e.g. Speedometer2. The tool can spit out a pprof profile, similar to this one (sorry Google-only), which you can nicely search for bottlenecks. This can give you hints for where to place exceptions for hot paths in the DOM. Just make sure to compile with `symbol_level=2` to have symbols for inlined frames.

A regression for more usage of `IsA<>()` is not surprising to me as these already show up in the profile today as something that we spend significant time in.

Thanks, I didn't know about crossbench. 
A cursory look at the profile suggests there's a handful of large users. I tried excluding those from the checks, but that hasn't moved the needle. (Still at ~+1% for x64/Win.)

I think I've about exhausted the easy options. The initial VRP analysis suggests to me that solving this would be worth several SWE-months. I'll bring this up with my manager (currently ooo) to see whether my team has an interest in taking this on.

Daniel


dave.

To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.


--
Kentaro Hara, Tokyo

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.

Peter Boström

unread,
May 24, 2023, 12:19:31 PM5/24/23
to dan...@chromium.org, Will Harris, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz
This seems to me like either an ATL question or something that ATL should be able to help with an approval path for.

Will Harris

unread,
May 24, 2023, 1:18:06 PM5/24/23
to platform-architecture-dev, Peter Boström, Will Harris, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, danakj
Okay that seems reasonable. I put a CL up for review and sent it to rbyers@ who seems like the ATL most responsible for web platform.

Thanks,

Will

Will 


Daniel


dave.

To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.


--
Kentaro Hara, Tokyo

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.

Rick Byers

unread,
May 24, 2023, 2:03:31 PM5/24/23
to platform-architecture-dev, Will Harris, Peter Boström, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Daniel Cheng, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, danakj, Scott Violet, chrome-atls-discuss, Mike Wittman
Hey folks,
I agree with the principle that we should be willing to trade some perf for concrete security wins (as we obviously do with sandboxing etc.). I find the security argument here pretty compelling given the VRP payouts and I like that metric as a concrete way to get a rough measure of security value to be traded off against other sorts of value.

But to do that tradeoff properly I think we need to dive a little deeper into the perf costs. +Scott has asked for updated pinpoint jobs. Scott, once we have that, can you own putting the perf costs in perspective (eg. relative to the benchmark gains y'all have gotten)? I wonder if we also need to try to do some analysis of in-the-wild perf costs too? Eg. if this meaningfully regresses LCP for a non-trivial fraction of the web then that would be a tough cost to eat also. +wittman for any thoughts. Any chance we could finch this, even though that would also add a bunch of overhead? Or is a binary A/B test the only practical option?

Also do we know if anyone has gone searching for micro-optimization opportunities in the IsA path yet? Best case we find some way to speed that up which more than offsets the cost here.

Rick

Daniel Cheng

unread,
May 24, 2023, 3:27:21 PM5/24/23
to Rick Byers, platform-architecture-dev, Will Harris, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, danakj, Scott Violet, chrome-atls-discuss, Mike Wittman
I implemented these helpers a long time ago to replace the macro versions, and never had a chance to circle back around it, but the eventual hope was always to change to CHECK.

Here's what I had originally imagined:
- we would have a plugin that consistently enforces that people use IsA<T>(), To<T>(), and DynamicTo<T>() for Blink types that have these traits defined, and never allows use of static_cast<T>.
- we would add an UnsafeTo<T> escape hatch for places that really cannot be burdened with the cost of a CHECK
- The default To<T> implementation would switch to CHECK().
- Hot spots based on profiling would revert to UnsafeTo<T>.

IsA<T> largely delegates to the template specializations; we could try to see if there are optimization opportunities, but there might be a lot of noise in the profiling; I guess we would probably just have to focus on the ones that are called a lot.

Daniel

Daniel Cheng

unread,
May 24, 2023, 3:28:05 PM5/24/23
to Rick Byers, platform-architecture-dev, Will Harris, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, danakj, Scott Violet, chrome-atls-discuss, Mike Wittman
Also, just fyi, this thread is currently cross-posted between external and internal lists. I'm not sure if there's an external equivalent to the ATL list?

Mike Wittman

unread,
May 24, 2023, 3:38:35 PM5/24/23
to Daniel Cheng, Rick Byers, platform-architecture-dev, Will Harris, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, danakj, Scott Violet, chrome-atls-discuss
With regard to in-the-wild perf costs: doing a Finch experiment (if possible) or binary A/B and reviewing impacts on guardrail metrics would be the gold standard for understanding impact.

Alternately we could temporarily land the change in canary/dev and use Stack Sampled Metrics in-the-wild profiling to understand low-level impact and address hot spots. If the low-level impact is marginal, or can be made so, it's not likely to significantly impact the top-level metrics.

-Mike

dan...@chromium.org

unread,
May 24, 2023, 3:41:45 PM5/24/23
to Mike Wittman, Daniel Cheng, Rick Byers, platform-architecture-dev, Will Harris, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, chrome-atls-discuss
On Wed, May 24, 2023 at 3:38 PM Mike Wittman <wit...@google.com> wrote:
With regard to in-the-wild perf costs: doing a Finch experiment (if possible) or binary A/B and reviewing impacts on guardrail metrics would be the gold standard for understanding impact.

The change here is adding a ([[unlikely]]) branch. A finch experiment would add another branch. I don't think there's any way to measure here with finch, the finch branch itself is already worse than the change being tested.

dan...@chromium.org

unread,
May 24, 2023, 3:43:20 PM5/24/23
to Mike Wittman, Daniel Cheng, Rick Byers, platform-architecture-dev, Will Harris, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, chrome-atls-discuss
On Wed, May 24, 2023 at 3:38 PM Mike Wittman <wit...@google.com> wrote:
With regard to in-the-wild perf costs: doing a Finch experiment (if possible) or binary A/B and reviewing impacts on guardrail metrics would be the gold standard for understanding impact.

Alternately we could temporarily land the change in canary/dev and use Stack Sampled Metrics in-the-wild profiling to understand low-level impact and address hot spots. If the low-level impact is marginal, or can be made so, it's not likely to significantly impact the top-level metrics.

This, combined with dcheng's UnsafeTo<T> suggestion, would allow cleaning up any hot spots that need it (which is precisely what such an unsafe back door is for, where it can be properly explained and encapsulated).

Charles Harrison

unread,
May 24, 2023, 3:47:47 PM5/24/23
to dan...@chromium.org, Mike Wittman, Daniel Cheng, Rick Byers, platform-architecture-dev, Will Harris, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, chrome-atls-discuss
A finch branch is more overhead than the proposed change, but do we think the results from finch measurement would be invalid? If the perf impact of this is too high we could run this experiment only on early channels.

To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-architecture-dev/CAHtyhaS4sGXy5UMH3mc%3DYs9%2BDmo7V0KnFgiXQrZTUke9uNc2jA%40mail.gmail.com.

Will Harris

unread,
May 24, 2023, 4:05:03 PM5/24/23
to Charles Harrison, dan...@chromium.org, Mike Wittman, Daniel Cheng, Rick Byers, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, chrome-atls-discuss
I think to get realistic results here we'd have to run a binary experiment (likely, on Dev channel) with a synthetic finch trial group being reported, because, as others have said, the added check for whether or not the client is in the control or treatment group would upset the data. Running a binary experiment is possible, but takes more work than just a finch experiment backed by a Feature flag.

The pinpoint data is in, you can see if on the CL https://chromium-review.googlesource.com/c/chromium/src/+/4563433 if someone can interpret it and decide whether or not it's good or bad, it seems some are green and some are red but I'm not sure which are more important than others...?

Given this is a known source of security bugs, are we able to land something simple now, then recover the perf/size costs in future CLs e.g. implement the solution that dcheng@ expects?  As per our guidance, we have to assume that these vulnerabilities are being actively exploited so I think we should try and land these additional checks as a matter of urgency.

Will

Rick Byers

unread,
May 24, 2023, 4:33:43 PM5/24/23
to Will Harris, Charles Harrison, dan...@chromium.org, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, John Abd-El-Malek
[-chrome-atls-discuss@google, +jam, sorry for the cross-post]

On Wed, May 24, 2023 at 4:05 PM Will Harris <w...@chromium.org> wrote:
I think to get realistic results here we'd have to run a binary experiment (likely, on Dev channel) with a synthetic finch trial group being reported, because, as others have said, the added check for whether or not the client is in the control or treatment group would upset the data. Running a binary experiment is possible, but takes more work than just a finch experiment backed by a Feature flag.

The pinpoint data is in, you can see if on the CL https://chromium-review.googlesource.com/c/chromium/src/+/4563433 if someone can interpret it and decide whether or not it's good or bad, it seems some are green and some are red but I'm not sure which are more important than others...?

Given this is a known source of security bugs, are we able to land something simple now, then recover the perf/size costs in future CLs e.g. implement the solution that dcheng@ expects?  As per our guidance, we have to assume that these vulnerabilities are being actively exploited so I think we should try and land these additional checks as a matter of urgency.

Given we have evidence of non-trivial perf regressions, I don't think we can just ram this in and worry about the performance later. Performance and security are both top-line goals for chromium, and just like we often ask performance teams to delay their work for many months while they work with security teams to evaluate and mitigate security risks, I think it's reasonable to expect security teams to work hard on analyzing and mitigating performance costs before landing security improvements. I think this is the path we've followed with the really big security / performance tradeoffs (eg. Oilpan, site isolation), so I don't think smaller cases should be any different. Given this is an important polarity to be managed across our product and teams, it's IMHO unproductive to take a perspective that strictly places one side over the other.

Daniel Cheng

unread,
May 24, 2023, 4:57:00 PM5/24/23
to Rick Byers, Will Harris, Charles Harrison, dan...@chromium.org, Mike Wittman, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, John Abd-El-Malek
Would an approach like this be reasonable?

- Add UnsafeTo<T>.
- Profile jetstream/speedometer/motionmark runs locally (presumably we have some good steps on how to do this?) to find and mitigate hotspots.
- 0 regression is ideal; if we can't achieve that, how do we determine what amount of regression is permissible?
- Possibly: binary experiment (this should, at least, be relatively easy from a configuration perspective; logically, I've always had the impression that binary experiments are rather painful and very much a manual process. Has this changed?)
- After we actually release a binary, use stack samples to find any remaining hotspots to focus on?

Daniel

dan...@chromium.org

unread,
May 24, 2023, 5:01:12 PM5/24/23
to Rick Byers, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, John Abd-El-Malek
On Wed, May 24, 2023 at 4:27 PM Rick Byers <rby...@chromium.org> wrote:
[-chrome-atls-discuss@google, +jam, sorry for the cross-post]

On Wed, May 24, 2023 at 4:05 PM Will Harris <w...@chromium.org> wrote:
I think to get realistic results here we'd have to run a binary experiment (likely, on Dev channel) with a synthetic finch trial group being reported, because, as others have said, the added check for whether or not the client is in the control or treatment group would upset the data. Running a binary experiment is possible, but takes more work than just a finch experiment backed by a Feature flag.

The pinpoint data is in, you can see if on the CL https://chromium-review.googlesource.com/c/chromium/src/+/4563433 if someone can interpret it and decide whether or not it's good or bad, it seems some are green and some are red but I'm not sure which are more important than others...?

Given this is a known source of security bugs, are we able to land something simple now, then recover the perf/size costs in future CLs e.g. implement the solution that dcheng@ expects?  As per our guidance, we have to assume that these vulnerabilities are being actively exploited so I think we should try and land these additional checks as a matter of urgency.

Given we have evidence of non-trivial perf regressions, I don't think we can just ram this in and worry about the performance later. Performance and security are both top-line goals for chromium, and just like we often ask performance teams to delay their work for many months while they work with security teams to evaluate and mitigate security risks, I think it's reasonable to expect security teams to work hard on analyzing and mitigating performance costs before landing security improvements. I think this is the path we've followed with the really big security / performance tradeoffs (eg. Oilpan, site isolation), so I don't think smaller cases should be any different. Given this is an important polarity to be managed across our product and teams, it's IMHO unproductive to take a perspective that strictly places one side over the other.

I took a look at the benchmarks and I see < 1% regression on all 3 if I am reading this correctly. I believe we set a "revert" threshold where we don't accept followup work to address performance at 1%. Am I misreading things? It's quite possible.

If we follow up with stack profiling to understand in the wild (after PGO is applied) where the hot paths are and work to eliminate the overhead there, does that address this need? Is there some risk to landing this < 1% change, and following up on the hot paths that makes it too risky of a strategy?

Rick Byers

unread,
May 24, 2023, 5:17:07 PM5/24/23
to dan...@chromium.org, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, John Abd-El-Malek
On Wed, May 24, 2023 at 4:57 PM Daniel Cheng <dch...@chromium.org> wrote:
Would an approach like this be reasonable?

- Add UnsafeTo<T>.
- Profile jetstream/speedometer/motionmark runs locally (presumably we have some good steps on how to do this?) to find and mitigate hotspots.
- 0 regression is ideal; if we can't achieve that, how do we determine what amount of regression is permissible?
- Possibly: binary experiment (this should, at least, be relatively easy from a configuration perspective; logically, I've always had the impression that binary experiments are rather painful and very much a manual process. Has this changed?)
- After we actually release a binary, use stack samples to find any remaining hotspots to focus on?

This sounds good to me FWIW. If a binary experiment seems like overkill, I'm also supportive of Mike's idea to just land and look for hotspots with SSM. In general I expect micro perf concerns like this to have a much lower impact on metrics in the wild than on benchmarks. 

Daniel


On Wed, May 24, 2023 at 5:01 PM <dan...@chromium.org> wrote:
On Wed, May 24, 2023 at 4:27 PM Rick Byers <rby...@chromium.org> wrote:
[-chrome-atls-discuss@google, +jam, sorry for the cross-post]

On Wed, May 24, 2023 at 4:05 PM Will Harris <w...@chromium.org> wrote:
I think to get realistic results here we'd have to run a binary experiment (likely, on Dev channel) with a synthetic finch trial group being reported, because, as others have said, the added check for whether or not the client is in the control or treatment group would upset the data. Running a binary experiment is possible, but takes more work than just a finch experiment backed by a Feature flag.

The pinpoint data is in, you can see if on the CL https://chromium-review.googlesource.com/c/chromium/src/+/4563433 if someone can interpret it and decide whether or not it's good or bad, it seems some are green and some are red but I'm not sure which are more important than others...?

Given this is a known source of security bugs, are we able to land something simple now, then recover the perf/size costs in future CLs e.g. implement the solution that dcheng@ expects?  As per our guidance, we have to assume that these vulnerabilities are being actively exploited so I think we should try and land these additional checks as a matter of urgency.

Given we have evidence of non-trivial perf regressions, I don't think we can just ram this in and worry about the performance later. Performance and security are both top-line goals for chromium, and just like we often ask performance teams to delay their work for many months while they work with security teams to evaluate and mitigate security risks, I think it's reasonable to expect security teams to work hard on analyzing and mitigating performance costs before landing security improvements. I think this is the path we've followed with the really big security / performance tradeoffs (eg. Oilpan, site isolation), so I don't think smaller cases should be any different. Given this is an important polarity to be managed across our product and teams, it's IMHO unproductive to take a perspective that strictly places one side over the other.

I took a look at the benchmarks and I see < 1% regression on all 3 if I am reading this correctly. I believe we set a "revert" threshold where we don't accept followup work to address performance at 1%. Am I misreading things? It's quite possible.

I defer to Scott on how to reason about the magnitude of the impact for the benchmarks. I don't interpret his e-mail to quite mean "feel free to knowingly regress benchmark performance as much as you like as long as you keep it to <1% per CL" 😉. Hopefully our strategy here isn't just to do DCHECK->CHECK conversions in small enough batches that each CL flies under the detection radar?

Speaking of which, would it be reasonable to document that CHECKs are supposed to be side-effect free? IMHO we should make it easy to periodically do benchmark runs with a build that disables all CHECKs just to make sure we're aware of the total costs we're paying in aggregate.

If we follow up with stack profiling to understand in the wild (after PGO is applied) where the hot paths are and work to eliminate the overhead there, does that address this need? Is there some risk to landing this < 1% change, and following up on the hot paths that makes it too risky of a strategy?

If Scott is happy with that, then I am too. 

dan...@chromium.org

unread,
May 24, 2023, 5:24:53 PM5/24/23
to Rick Byers, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, Scott Violet, John Abd-El-Malek
On Wed, May 24, 2023 at 5:17 PM Rick Byers <rby...@chromium.org> wrote:
On Wed, May 24, 2023 at 4:57 PM Daniel Cheng <dch...@chromium.org> wrote:
Would an approach like this be reasonable?

- Add UnsafeTo<T>.
- Profile jetstream/speedometer/motionmark runs locally (presumably we have some good steps on how to do this?) to find and mitigate hotspots.
- 0 regression is ideal; if we can't achieve that, how do we determine what amount of regression is permissible?
- Possibly: binary experiment (this should, at least, be relatively easy from a configuration perspective; logically, I've always had the impression that binary experiments are rather painful and very much a manual process. Has this changed?)
- After we actually release a binary, use stack samples to find any remaining hotspots to focus on?

This sounds good to me FWIW. If a binary experiment seems like overkill, I'm also supportive of Mike's idea to just land and look for hotspots with SSM. In general I expect micro perf concerns like this to have a much lower impact on metrics in the wild than on benchmarks. 

Binary experiments are very difficult to do, take a lot of eng time and produce unreliable results (on Android?). They are useful for things that are extremely risky, and can't be reverted, occasionally, but they should not be something we reach for as a standard practice.
 

Daniel


On Wed, May 24, 2023 at 5:01 PM <dan...@chromium.org> wrote:
On Wed, May 24, 2023 at 4:27 PM Rick Byers <rby...@chromium.org> wrote:
[-chrome-atls-discuss@google, +jam, sorry for the cross-post]

On Wed, May 24, 2023 at 4:05 PM Will Harris <w...@chromium.org> wrote:
I think to get realistic results here we'd have to run a binary experiment (likely, on Dev channel) with a synthetic finch trial group being reported, because, as others have said, the added check for whether or not the client is in the control or treatment group would upset the data. Running a binary experiment is possible, but takes more work than just a finch experiment backed by a Feature flag.

The pinpoint data is in, you can see if on the CL https://chromium-review.googlesource.com/c/chromium/src/+/4563433 if someone can interpret it and decide whether or not it's good or bad, it seems some are green and some are red but I'm not sure which are more important than others...?

Given this is a known source of security bugs, are we able to land something simple now, then recover the perf/size costs in future CLs e.g. implement the solution that dcheng@ expects?  As per our guidance, we have to assume that these vulnerabilities are being actively exploited so I think we should try and land these additional checks as a matter of urgency.

Given we have evidence of non-trivial perf regressions, I don't think we can just ram this in and worry about the performance later. Performance and security are both top-line goals for chromium, and just like we often ask performance teams to delay their work for many months while they work with security teams to evaluate and mitigate security risks, I think it's reasonable to expect security teams to work hard on analyzing and mitigating performance costs before landing security improvements. I think this is the path we've followed with the really big security / performance tradeoffs (eg. Oilpan, site isolation), so I don't think smaller cases should be any different. Given this is an important polarity to be managed across our product and teams, it's IMHO unproductive to take a perspective that strictly places one side over the other.

I took a look at the benchmarks and I see < 1% regression on all 3 if I am reading this correctly. I believe we set a "revert" threshold where we don't accept followup work to address performance at 1%. Am I misreading things? It's quite possible.

I defer to Scott on how to reason about the magnitude of the impact for the benchmarks. I don't interpret his e-mail to quite mean "feel free to knowingly regress benchmark performance as much as you like as long as you keep it to <1% per CL" 😉. Hopefully our strategy here isn't just to do DCHECK->CHECK conversions in small enough batches that each CL flies under the detection radar?

I mean certainly not :) These "2 DCHECKs" are called a lot, as we can see, and are together a way to close a class of memory safety bugs for Blink. We continually see security bugs that they would have prevented, which implies there are many more security bugs in our code today that they would prevent if we turned them into CHECKs, and they are our only known strategy to do so.

Daniel's idea for a performance back door is important for any mechanism of this nature, so thanks for that Daniel.
 
Speaking of which, would it be reasonable to document that CHECKs are supposed to be side-effect free? IMHO we should make it easy to periodically do benchmark runs with a build that disables all CHECKs just to make sure we're aware of the total costs we're paying in aggregate.

That would be great to add to the documentation if it's not there already. +Peter Boström 

Scott Violet

unread,
May 24, 2023, 6:32:24 PM5/24/23
to Rick Byers, Hannes Payer, dan...@chromium.org, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek

Hello,

I feel like I'm late to the party!

I was pointed to Will's cl without knowing the scope of the problems we are trying to prevent. Is there a summary of the problem we are trying to prevent, and what it would mean if we did not do this change?

As to the performance threshold. It's 1% overall, or 2% to any subtest. Will's pinpoint runs show more than 2% regression to a couple of subtests. What matters is the pgo bots. It's currently not possible to run a try job that builds a new pgo profile and uses that as part of the try jobs.

After looking at the assembly for a similar change in more detail (specifically Peter's change here) my suspicion is that you will not find a single place that is the culprit, but rather death by a thousand paper cuts, meaning that this code is used in many places and it's the sum total of all of these places getting slightly slower. That Will's patch increases binary size by 85k is a pretty good signal the patch changes a lot of code.

  -Scott

Peter Boström

unread,
May 24, 2023, 6:52:31 PM5/24/23
to Scott Violet, Rick Byers, Hannes Payer, dan...@chromium.org, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
To specifically address the "no-op CHECKs" mentioned above, we're really not in a place where CHECKs are no-ops and code readability may also suffer if we need that to hold true (unless we provide a different macro for the side-effectful ones).

Specifically if you turn CHECKs into no-ops we for instance don't compile because of a lot of -Wuninitialized warnings. Trivially here's an example https://godbolt.org/z/3rb9x4cMx. If we were to turn that into a "CHECKs need to be no-ops" version we'd need to turn:

int foo;
CHECK(ParseInt(str, &foo));
into
int foo;
const bool foo_parsed = ParseInt(str, &foo);
CHECK(foo_parsed);

and hope that we don't incorrectly use foo_parsed later. To even get this to compile we'd probably need to go through a few thousand calls, and we'd have no idea if that build was correct. I'm not sure if this is worth figuring out vs. generally profiling the code for hotspots regardless of CHECK/not CHECK code.

I think forcing CHECKs to be side-effect free would be more interesting if we could actually determine if they are side-effect free at compile time (and then split out CHECK_WITH_SIDE_EFFECT which may have side effects). This isn't really possible right now. The closest thing I know is that __builtin_assume() will warn with -Wassume on a lot of things we consider side-effect free (such as container.find(x) != container.end()). https://godbolt.org/z/GbG1db8x5

In short I don't know that trying to make CHECKs guaranteed no-ops is worth it unless we can find a way for the compiler to know what side-effect-free means (that we can at least roughly agree with). And if we can't enforce it, we can sometimes write more concise code if we allow side-effectful CHECKs.

Peter

Will Harris

unread,
May 24, 2023, 7:06:25 PM5/24/23
to Scott Violet, Rick Byers, Hannes Payer, dan...@chromium.org, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
see inline.

On Wed, May 24, 2023 at 3:32 PM Scott Violet <s...@chromium.org> wrote:

Hello,

I feel like I'm late to the party!

I was pointed to Will's cl without knowing the scope of the problems we are trying to prevent. Is there a summary of the problem we are trying to prevent, and what it would mean if we did not do this change?

I think the original post by Daniel gives an accurate summary, we see avg of ~2 bugs a month found by clusterfuzz and a few a year reported externally, of which the most recent was CVE-2023-1215, released in this update. [$7000][1417176] High CVE-2023-1215: Type Confusion in CSS. Reported by Anonymous on 2023-02-17. I think given our knowledge of variants and bug collisions it is reasonable to assume that there are attackers who are using these type confusions to harm our users.


As to the performance threshold. It's 1% overall, or 2% to any subtest. Will's pinpoint runs show more than 2% regression to a couple of subtests. What matters is the pgo bots. It's currently not possible to run a try job that builds a new pgo profile and uses that as part of the try jobs. 

After looking at the assembly for a similar change in more detail (specifically Peter's change here) my suspicion is that you will not find a single place that is the culprit, but rather death by a thousand paper cuts, meaning that this code is used in many places and it's the sum total of all of these places getting slightly slower. That Will's patch increases binary size by 85k is a pretty good signal the patch changes a lot of code.

I'm curious for your view on the proposal from Rick to land something now, and then use stack profiling in the wild to determine the impact, or iterate on improvements? Also, I should add that if we see regressions we can always revert before branch point, which is a few weeks away, as we just branched today.

Scott Violet

unread,
May 24, 2023, 7:36:40 PM5/24/23
to Will Harris, Rick Byers, Hannes Payer, dan...@chromium.org, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
On Wed, May 24, 2023 at 4:06 PM Will Harris <w...@chromium.org> wrote:
see inline.

On Wed, May 24, 2023 at 3:32 PM Scott Violet <s...@chromium.org> wrote:

Hello,

I feel like I'm late to the party!

I was pointed to Will's cl without knowing the scope of the problems we are trying to prevent. Is there a summary of the problem we are trying to prevent, and what it would mean if we did not do this change?

I think the original post by Daniel gives an accurate summary, we see avg of ~2 bugs a month found by clusterfuzz and a few a year reported externally, of which the most recent was CVE-2023-1215, released in this update. [$7000][1417176] High CVE-2023-1215: Type Confusion in CSS. Reported by Anonymous on 2023-02-17. I think given our knowledge of variants and bug collisions it is reasonable to assume that there are attackers who are using these type confusions to harm our users.

Thanks for the details. I'm still trying to understand how severe this type of exploit is. I would assume the issue is with things we don't catch via clusterfuzz, do I have that right? Can an exploit in the code you are trying to patch be used by itself to do harm, or does it require additional exploits? How severe is two exploits a year of this type? Sorry for all the questions, I'm trying to understand how we weigh the different tradeoffs.
 


As to the performance threshold. It's 1% overall, or 2% to any subtest. Will's pinpoint runs show more than 2% regression to a couple of subtests. What matters is the pgo bots. It's currently not possible to run a try job that builds a new pgo profile and uses that as part of the try jobs. 

After looking at the assembly for a similar change in more detail (specifically Peter's change here) my suspicion is that you will not find a single place that is the culprit, but rather death by a thousand paper cuts, meaning that this code is used in many places and it's the sum total of all of these places getting slightly slower. That Will's patch increases binary size by 85k is a pretty good signal the patch changes a lot of code.

I'm curious for your view on the proposal from Rick to land something now, and then use stack profiling in the wild to determine the impact, or iterate on improvements? Also, I should add that if we see regressions we can always revert before branch point, which is a few weeks away, as we just branched today.

This is a low level change, I don't think stack profiling will be useful in this situation. Further, we know it impacts the benchmarks.

Have you investigated other avenues to achieving the hardening you desire without impacting performance?

  -Scott

John Abd-El-Malek

unread,
May 25, 2023, 2:32:18 AM5/25/23
to Scott Violet, Will Harris, Rick Byers, Hannes Payer, dan...@chromium.org, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz
Just had a chance to read this. My thoughts are aligned with Rick and +1 to Daniel's latest email, namely:
-this looks like a great security fix, and has a lot of data from VRP/fuzzing which is awesome
-it would be helpful to know what the current impact on benchmarks are (PGO builds)
-if it's really small, then sounds good to just land
-if it's >1%, then it seems it would justify the cost to see if it's a small number of hotspots that can be excluded (e.g. like the libc++ hardening work)
-binary experiment seems unreasonable for this (they're already noisy/hard enough as is for things that _really absolutely_ need them)

Michael Lippautz

unread,
May 25, 2023, 2:39:43 AM5/25/23
to John Abd-El-Malek, Scott Violet, Will Harris, Rick Byers, Hannes Payer, dan...@chromium.org, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz
Semi related: The bug triggering this discussion again is known to be in dead code that doesn't seem to have active maintainers and a plan forward. Wouldn't it be simpler to remove such code than to treat issues as security issues (why? -- if something is not shipped)

As for profiling: At least for benchmarks you can relatively easily find out what the hot paths are with flame graphs generated by go/crossbench. The best way to do this is to put the CHECK in another fully inlined helper. With gn arg `symbol_level=2` on release builds those will show up in pprof and you can create a filter or a pivot on them. Then just work left-to right on the flame graph :) 

+1 on Daniel's suggestion on adding an UnsafeTo. I think otherwise people would just move to static_cast again, even avoiding the DCHECKs.

-Michael

Hannes Payer

unread,
May 25, 2023, 3:07:48 AM5/25/23
to Michael Lippautz, John Abd-El-Malek, Scott Violet, Will Harris, Rick Byers, dan...@chromium.org, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
Taking a step back:
Looking through a few issues it looks like fuzzing could work really well to flush out these bugs. And it seems that we don't have the right fuzzers since the bug reporter m.co...@gmail.com is able to find many before us. How about significantly improving fuzzing in this area and avoiding the performance regression?

-Hannes
--

 

Hannes Payer | V8 | Google Germany GmbH | Erika-Mann Str. 33, 80636 München 

Geschäftsführer: Paul Manicle, Liana Sebastian

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde. 

     

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

dan...@chromium.org

unread,
May 25, 2023, 10:10:44 AM5/25/23
to Hannes Payer, Michael Lippautz, John Abd-El-Malek, Scott Violet, Will Harris, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
Fuzzing helps us determine the shape of bugs we have in our code so that we can then find ways to stop them, it will not close a class of security bugs like this.

There is no way to prevent type confusion bugs in C++ without runtime checks, unfortunately, because the nature of them is such that the compiler does not have more information. We don't use RTTI, we have this lighter weight system instead. We have to have an understanding that removing classes of memory unsafety bugs that are the root cause of 0-days requires some performance tradeoff. Then the way we manage that polarity is by providing a back door through the memory safety so that we can scope our (users') risk to explicit places that we can see in the code, instead of the entire codebase. And then do our best to encapsulate and manage that risk there. This gives us the tools to continue building on performance work while managing the security risk.

Scott, the 2 bugs a year reported externally doesn't really mean anything to adversaries building 0-days, the number isn't helpful for measuring severity. What it means is that we consistently find them, so there are more of these bugs in our code right now. Type confusion bugs are much like UAF, but skipping the free-and-reallocate step, they allow the ability to treat pointers like data which then leads to uncontrolled memory access. Memory safety bugs like these are used to compromise the process, then the device, and all of the user's data.

Dave Tapuska

unread,
May 25, 2023, 10:52:09 AM5/25/23
to dan...@chromium.org, Hannes Payer, Michael Lippautz, John Abd-El-Malek, Scott Violet, Will Harris, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
I put together https://chromium-review.googlesource.com/c/chromium/src/+/4567464 which adds the Unsafe<T> but leaves To<T> change separately. I also adjusted a bindings generator script to change to use Unsafe<T>. The bindings script change would prevent any check on any window. property access which the security check change would have added. This might help mitigate some of the performance penalties. wfh@ you might try rebasing your change on top of mine and seeing if the outcome is better.

dave.

Hannes Payer

unread,
May 25, 2023, 11:32:20 AM5/25/23
to Dave Tapuska, dan...@chromium.org, Michael Lippautz, John Abd-El-Malek, Scott Violet, Will Harris, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
I 100% agree that there will be more bugs in the future. But if our fuzzing finds them in practice immediately and we stop paying bug bounties then we are ~good. Regardless of this, we should improve our fuzzing in this area.

-Hannes

dan...@chromium.org

unread,
May 25, 2023, 11:48:47 AM5/25/23
to Hannes Payer, Dave Tapuska, Michael Lippautz, John Abd-El-Malek, Scott Violet, Will Harris, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
On Thu, May 25, 2023 at 11:32 AM Hannes Payer <hpa...@google.com> wrote:
I 100% agree that there will be more bugs in the future. But if our fuzzing finds them in practice immediately and we stop paying bug bounties then we are ~good. Regardless of this, we should improve our fuzzing in this area.

I hear you, that fuzzing is important. To prevent 0-days in C++ by addressing their root cause will require us to close off classes of security bugs, not just to stop paying bounties. The first implies the second but not the inverse. We've seen through time that product areas with excellent fuzzing, when they are targeted, still have 0-days occur through memory safety bugs. The libc++ hardening work and DCHECK canaries have shown that the product has orders of magnitude more execution states than we are able to find in tests or fuzzing, roughly like: tests < fuzzing < canary << stable. Fuzzing is an excellent tool for finding bugs in bug classes we can't or haven't closed yet, but has never been capable of closing bug classes. We increasingly have the tools (raw_ptr/DPD, -Wbounds, safe numerics, To<T>, etc) available to close bug classes if we are willing or able to use them. It's really important that we do so in ways that provide pathways for performance too, through explicit memory-unsafe code paths where that risk can be carefully managed.

Will Harris

unread,
May 25, 2023, 12:02:29 PM5/25/23
to dan...@chromium.org, Hannes Payer, Michael Lippautz, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Peter Boström, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
I should just add that the "two a year" are only those bugs that received CVEs i.e. they were in code that was enabled and actively shipping to users on our Stable channel. In the past year, we've had ~16 rewardable, valid, bugs¹ (each reward varies between 5000 and 9000 - depending on the report quality) that would have been prevented from being exploitable by landing CHECKs in this code. These bugs are in features that are not yet shipping so we reward them as we might have shipped the code had the bug not been reported.

Will

¹ - I never think counting bugs is a good idea, in general, but I did want to correct my previous statement.

Peter Boström

unread,
May 25, 2023, 2:41:48 PM5/25/23
to Will Harris, dan...@chromium.org, Hannes Payer, Michael Lippautz, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
Maybe someone knows how to interpret these better, but I think making some attempt to profile before/after before submitting could be useful:

Applying the diff:

$ git diff main
diff --git a/third_party/blink/renderer/platform/wtf/casting.h b/third_party/blink/renderer/platform/wtf/casting.h
index 5d25937b46a6e..8cb9e0d5bc9a0 100644
--- a/third_party/blink/renderer/platform/wtf/casting.h
+++ b/third_party/blink/renderer/platform/wtf/casting.h
@@ -112,7 +112,7 @@ bool IsA(Base* from) {
 // pointer overloads, returns nullptr if the input pointer is nullptr.
 template <typename Derived, typename Base>
 const Derived& To(const Base& from) {
-  SECURITY_DCHECK(IsA<Derived>(from));
+  CHECK(IsA<Derived>(from));
   return static_cast<const Derived&>(from);
 }
 
@@ -123,7 +123,7 @@ const Derived* To(const Base* from) {
 
 template <typename Derived, typename Base>
 Derived& To(Base& from) {
-  SECURITY_DCHECK(IsA<Derived>(from));
+  CHECK(IsA<Derived>(from));
   return static_cast<Derived&>(from);
 }
 template <typename Derived, typename Base>


Using these gn args:

# Set build arguments here. See `gn help buildargs`.
use_goma = true                                      
dcheck_always_on = false                            
is_chrome_branded = true                            
is_debug = false                                    
is_official_build = true                                      

I ran perf_5.15 record -g out/official/chrome https://browserbench.org/Speedometer2.0/ and immediately hit run test. Once that completed I got back to my terminal and hit ctrl+c to stop perf recording. I don't know if these were the right flags to use I'm not an expert. Looks like this goes from ~140 to ~135 on my machine but with a sample count of 1 and I may have other apps running so take with a grain of salt.


If these can be used to figure out any new hotspots where we'd want to use UnsafeTo<T> and then try again to make To<T> CHECKy, I think that's worth doing. It's not given that all the performance overhead is equally scattered, so if we're lucky we can pay 10% of the cost for 90% of the security win, or something. If these profiles didn't work I think we should find a way to do them with someone who knows what they're doing to see if we can exclude hotspots and try the M1 perf bots again.

Just throwing them out there, I don't know how to diff them effectively. Maybe someone does, thanks! :)
Peter

K. Moon

unread,
May 25, 2023, 2:59:19 PM5/25/23
to Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, Michael Lippautz, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
Perf has a "diff" subcommand, but I've found interpreting the results to be more art than science.

Michael Lippautz

unread,
May 26, 2023, 4:09:42 AM5/26/23
to K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, Michael Lippautz, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
I would suggest collecting the profiles through go/crossbench to reuse the runners and configurations that others have been working on.

Here's the workflow using the profiles from above:

pprofng supports a diff view where you can just add the profile id. So for the two profiles above thats: https://pprofng.corp.google.com/?id=0ce19da2fdd6fbb65953033f4ef0db6c&pivot=blink::To$&id0=93aed5b3545e0e09b4d13a79b8ccce33&tab=flame

I already added a pivot for "blink::To$".

Baseline has 0 samples as that's just a static_cast. Second profile seems to have 675M samples. That looks like it's something to work with, yay \o/

I like to use the flame graph to navigate around but the info is also available via top-down (outgoing) + bottom-up (incoming)

Cycle order for bottom up:
193M: CSSProperty::Get is the most expensive incoming caller together with its DowncastTraits::AllowFrom
185M: DynamicTo<>() still calls To<>() so there's 2x IsA<>() in your patch. That's an oversight and for whatever reasons the compiler couldn't optimize away the duplicate all the time. Low hanging fruit :)
121M: LocalDOMWindow::GetFrame

That's already 499M/675M = 73% of samples for the profiles. The next 5 callers are another 100M cycles which gets you another 15%. 

Not arguing which calls are necessary to actually keep for the security argument :) 

Does that help?

-Michael


Rick Byers

unread,
May 26, 2023, 11:09:43 AM5/26/23
to Peter Boström, Scott Violet, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
[Forking this thread for the more general conversation]

Thanks Peter, that makes sense. Then it seems to me like converting DCHECKs to CHECKs throws away an important bit of information - whether the test was side-effect free (and so something we could try turning off in order to evaluate performance). To prevent losing this information should we perhaps add a simple variant of CHECK to be used when we know it's free from side effects? Eg. CHECKI (idempotent) or CHECKP (pure)?

As much as I absolutely want to harden security, I'm worried about the loss of information about what work could be skipped if necessary for performance. I expect we'll achieve the best balance long-term when we make it as easy as possible to explore the tradeoff space. I think if we had this then folks would be less afraid about bulk DCHECK->CHECK conversions knowing they could easily be converted back if needed. WDYT?

Rick

Dave Tapuska

unread,
May 26, 2023, 11:56:52 AM5/26/23
to Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
After a few of my patchs [1, 2, 3, 4] show there are roughly 222M blink:To's now. See  https://pprofng.corp.google.com/?id=60cb950c88d7af7fe7c39a281372bbe8&pivot=blink::To$

I wasn't able to get go/crossbench to work on my glinux machine. Is there a guide I can follow? It seems to be crashing in the renderer startup script. I just used perf record, and then converted the perf.data to a proto file and uploaded it to pprofng... 

dave.

Peter Boström

unread,
May 26, 2023, 12:03:32 PM5/26/23
to Rick Byers, Scott Violet, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
Thoughts fresh and haven't solidified but I'm wary of having another CHECK macro without compiler enforcement that one is pure and not the other. I suspect we'll end up with more CHECKs than CHECKP/CHECKI just because it's what people are used to typing (and short = default one). We also end up with zero coverage that CHECKP and CHECKI can actually be removed unless we regularly ship builds and run tests without them.

If I could have things by just wishing for them then C++ would have the attribute [[pure]] and it would be widely applied in standard libraries, we'd apply it in Chromium and have a decent shot at this. Presumably we'd also generate better code. Afaik [[gnu:pure]] isn't widely applied (but I could be wrong) and it certainly isn't widely used within Chromium. Then we'd mark the CHECK_WITH_SIDE_EFFECTS() as the non-default one. Maybe we could split existing CHECKs based on if they are [[pure]], rinse and repeat.

The other option I thought of is that we actually want to be able to tell how much our CHECKs cost and look at profiling through that lens. If we could add a pivot on CHECK (that includes the conditional) we could profile better, then we could look at "are these costly CHECKs actually holding their weight". I haven't found a way of doing so. I tried putting the conditional inside a lambda so that I could call it from another stack frame (that could be force inlined and hopefully generate the same code), but that bit me as clang lock annotations don't understand that calling [&]() {foo_ = bar;}(); means that foo_ is accessed only under the lock where the lambda is called (and that the lambda doesn't leak out otherwise). If there's a way to annotate debug code so that CHECKs can be pivoted on in profiling I think that would be great.

Hope some of those are conversation starters/continuations. I wish I wrote more of my thinking / experimentation down when I thought about measuring CHECK overhead before. If you have a way to annotate CHECKs as a stackframe or anything else we can filter on in profiling I am really interested. I think that may be more tractable than splitting our existing CHECK inventory as is.
Peter

Peter Boström

unread,
May 26, 2023, 12:52:07 PM5/26/23
to Michael Lippautz, K. Moon, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
That's extremely helpful. Thank you.

Scott Violet

unread,
May 26, 2023, 2:41:59 PM5/26/23
to Dave Tapuska, Camillo Bruni, Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
On Fri, May 26, 2023 at 8:56 AM Dave Tapuska <dtap...@chromium.org> wrote:
After a few of my patchs [1, 2, 3, 4] show there are roughly 222M blink:To's now. See  https://pprofng.corp.google.com/?id=60cb950c88d7af7fe7c39a281372bbe8&pivot=blink::To$

I wasn't able to get go/crossbench to work on my glinux machine. Is there a guide I can follow? It seems to be crashing in the renderer startup script. I just used perf record, and then converted the perf.data to a proto file and uploaded it to pprofng... 

The docs on the site for go/crossbench describe the commands to use. I was running it yesterday without problem. The specific command I used was:

./cb.py speedometer_2.1 --iterations=5 --browser=$C2/out/Release/chrome --probe=profiling

To upload perf data with go/crossbench you need to run gcert first. No idea if that's the issue (I believe crosspbench warns in this case). Did you get a stack for the crash?

+Camillo Bruni as he is the crossbench author and expert.

  -Scott

Camillo Bruni

unread,
May 30, 2023, 8:37:49 AM5/30/23
to Scott Violet, Dave Tapuska, Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
Sorry for the late reply (OOO on Fridays + public holiday on Monday).
I've done the bisection now and the culprit CL is https://chromium-review.googlesource.com/c/chromium/src/+/4558609 and it's currently under investigation by sroettger@.
Camillo Bruni | Software Engineer, V8 |
 Google Germany GmbH | Erika-Mann Str. 33, 80636 München 

Registergericht und -nummer: Hamburg, HRB 86891 | Sitz der Gesellschaft: Hamburg | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Diese E-Mail ist vertraulich. Falls Ssie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.  This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Steinar H. Gunderson

unread,
May 30, 2023, 11:49:03 AM5/30/23
to Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, fut...@chromium.org, and...@chromium.org
On Fri, May 26, 2023 at 10:09:10AM +0200, Michael Lippautz wrote:
> Cycle order for bottom up:
> 193M: CSSProperty::Get is the most expensive incoming caller together with
> its DowncastTraits::AllowFrom

Our team is responsible for this, and we've had a discussion IRL. For
reference, roughly what the code does (after stripping away macros and some
irrelevant stuff):

DCHECK(i >= 0);
DCHECK(i < 529);
const Foo *foo = static_table[i];
DCHECK(foo->some_bits & kSomeConstant);
return static_cast<const DescendsFromFoo *>(foo);

Note particularly that all of these objects are compile-time constants.
This means that unless an attacker can corrupt a read-only data segment,
there is no way any Foo * in that table can ever _not_ have kSomeConstant
(and the downcast is thus always safe). However, if you manage to provoke
a read _outside_ the table by means of some bug, that's of course not so
safe.

Given this, one would assume having the range checks would be much more
valuable security-wise than checking the bits in the (potentially already
invalid) pointers. I wonder if we could do this even more cheaply, though;
what do people think about extending static_table to 1024 elements (with
some dummy elements) and doing i & 1023, keeping the DCHECK? That would
still make everything fairly controlled, incurring nearly zero overhead.

Note that we'd also like to get rid of static_table[] entirely; it could
essentially be replaced with (some_address + i * 24), which would keep
a ~6 kB array entirely away from polluting our L1 cache. Doing that would
probably also make the code slightly safer. Slightly. But it would require
us to find some safe way of putting heterogeneous objects into the same
array...

/* Steinar */
--
Homepage: https://www.sesse.net/

dan...@chromium.org

unread,
May 30, 2023, 12:01:42 PM5/30/23
to Steinar H. Gunderson, Michael Lippautz, K. Moon, Peter Boström, Will Harris, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, fut...@chromium.org, and...@chromium.org
Without fully understanding your domain space: std::variant but your size/alignment is the max of them all. or a bunch of single-type arrays with a std::variant of pointers to them?

Steinar H. Gunderson

unread,
May 30, 2023, 12:24:14 PM5/30/23
to dan...@chromium.org, Michael Lippautz, K. Moon, Peter Boström, Will Harris, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, fut...@chromium.org, and...@chromium.org
On Tue, May 30, 2023 at 11:53:10AM -0400, dan...@chromium.org wrote:
> Without fully understanding your domain space: std::variant but your
> size/alignment is the max of them all. or a bunch of single-type arrays
> with a std::variant of pointers to them?

I don't understand how std::variant would help me. I need to get back a
pointer to the base class (they all inherit from CSSUnresolvedProperty).
Also, I don't think I can afford to waste space on the type tag that
std::variant brings.

dan...@chromium.org

unread,
May 30, 2023, 12:35:53 PM5/30/23
to Steinar H. Gunderson, Michael Lippautz, K. Moon, Peter Boström, Will Harris, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, fut...@chromium.org, and...@chromium.org
I was responding to "But it would require us to find some safe way of putting heterogeneous objects into the same array..." for which variant is the tool (or union for extra unsafety).

Steinar H. Gunderson

unread,
May 30, 2023, 12:41:52 PM5/30/23
to dan...@chromium.org, Michael Lippautz, K. Moon, Peter Boström, Will Harris, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, fut...@chromium.org, and...@chromium.org
On Tue, May 30, 2023 at 12:35:32PM -0400, dan...@chromium.org wrote:
>> I don't understand how std::variant would help me. I need to get back a
>> pointer to the base class (they all inherit from CSSUnresolvedProperty).
>> Also, I don't think I can afford to waste space on the type tag that
>> std::variant brings.
> I was responding to "But it would require us to find some safe way of
> putting heterogeneous objects into the same array..." for which variant is
> the tool (or union for extra unsafety).

I understood, but std::variant won't give me the pointer to the common base
class. I'm not sure if union can in a safe way; I'd have to check.

Scott Violet

unread,
May 30, 2023, 2:14:11 PM5/30/23
to Peter Boström, Rick Byers, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
I already have a hard time reasoning about when to use CHECK vs SECURITY_CHECK. Adding a third variant would make it even harder to reason about which one to use.

  -Scott

dan...@chromium.org

unread,
May 30, 2023, 2:38:05 PM5/30/23
to Scott Violet, Peter Boström, Rick Byers, Hannes Payer, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
Is there still a good time to use SECURITY_CHECK instead of CHECK? I only understand the difference for SECURITY_DCHECK at this point, which is to tell fuzzers to keep the DCHECK cuz we hit DCHECKs all the time. Those SECURITY_DCHECKs should also be CHECK probably?

Peter Boström

unread,
May 30, 2023, 2:46:09 PM5/30/23
to dan...@chromium.org, Scott Violet, Rick Byers, Hannes Payer, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
I would assume that the SECURITY_CHECK and CHECK distinction is historical (WebKit not depending on //base seems likely) and we know that a lot of existing CHECKs have security implications if we removed them. https://codereview.chromium.org/1841363003 introduces SECURITY_CHECK.

I think making most SECURITY_DCHECKs CHECK is reasonable, but I would suggest running a M1 speedometer run (there#s 118 of them according to git grep) to see if they are in known hot spots before doing so. UnsafeTo<T> still holds a SECURITY_DCHECK if we want to retain those semantics because whenever that DCHECK fails we have a memory-safety bug, and it's intentionally a DCHECK since we don't want to pay the CHECK cost where it's used.

Scott Violet

unread,
May 30, 2023, 7:12:34 PM5/30/23
to Peter Boström, Rick Byers, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, Daniel Cheng, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
On Fri, May 26, 2023 at 9:03 AM Peter Boström <pb...@chromium.org> wrote:
Thoughts fresh and haven't solidified but I'm wary of having another CHECK macro without compiler enforcement that one is pure and not the other. I suspect we'll end up with more CHECKs than CHECKP/CHECKI just because it's what people are used to typing (and short = default one). We also end up with zero coverage that CHECKP and CHECKI can actually be removed unless we regularly ship builds and run tests without them.

If I could have things by just wishing for them then C++ would have the attribute [[pure]] and it would be widely applied in standard libraries, we'd apply it in Chromium and have a decent shot at this. Presumably we'd also generate better code. Afaik [[gnu:pure]] isn't widely applied (but I could be wrong) and it certainly isn't widely used within Chromium. Then we'd mark the CHECK_WITH_SIDE_EFFECTS() as the non-default one. Maybe we could split existing CHECKs based on if they are [[pure]], rinse and repeat.

The other option I thought of is that we actually want to be able to tell how much our CHECKs cost and look at profiling through that lens. If we could add a pivot on CHECK (that includes the conditional) we could profile better, then we could look at "are these costly CHECKs actually holding their weight". I haven't found a way of doing so. I tried putting the conditional inside a lambda so that I could call it from another stack frame (that could be force inlined and hopefully generate the same code), but that bit me as clang lock annotations don't understand that calling [&]() {foo_ = bar;}(); means that foo_ is accessed only under the lock where the lambda is called (and that the lambda doesn't leak out otherwise). If there's a way to annotate debug code so that CHECKs can be pivoted on in profiling I think that would be great.

Would something like a CHECK_IS_ON macro fix this?

  -Scott 

Daniel Cheng

unread,
May 30, 2023, 7:38:11 PM5/30/23
to Scott Violet, Peter Boström, Rick Byers, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
FWIW, I don't think CEHCKs should be allowed to have side effects.

While I would like an enforcement mechanism from the compiler, one doesn't currently exist. clang/gcc have the pure attribute, but:
- it's apparently UB if you mark something as pure but it has side effects
- the compiler doesn't actually enforce the correctness of pure

I think we can actually get pretty far by just simply skipping CHECKs half the time. I threw some ideas into crbug.com/1450090 and will give it a try.

Daniel

Scott Violet

unread,
May 30, 2023, 7:40:50 PM5/30/23
to Daniel Cheng, Peter Boström, Rick Byers, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
This seems similar to DCHECKs, in so far as if you put some code in a DCHECK you expect to execute you can have problems. We generally rely on bots to catch this sort of thing. So, if we had a bot that built with CHECKs disabled, presumably that would help catch cases, right?

  -Scott

Daniel Cheng

unread,
May 30, 2023, 7:51:49 PM5/30/23
to Scott Violet, Peter Boström, Rick Byers, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
Oh that's true; the random sample percentage could be 0%. But there's some pre-work that's needed to enable this (specifically tests that specifically are looking for CHECK deaths always need to enable CHECKs, as noted in the bug).

Daniel

Daniel Cheng

unread,
May 30, 2023, 10:00:59 PM5/30/23
to Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
crossbench is neat. I wish I had known about it earlier. If I have random questions about its usage, is there a good mailing list / chat to ask on? For example, I don't have a M1 mac, but I do have a Linux cloudtop: is collecting profiling information on a cloudtop still mostly valid? Or does not having a real GPU throw off results in other ways?

(Also is this an officially-maintained tool? Mostly wondering how likely I should expect it to keep working in the future.)

Daniel

Daniel Cheng

unread,
May 30, 2023, 11:52:37 PM5/30/23
to Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber
Here's a profile from my attempt to run it:

Using the command:
~/src/chrome/src/testing/xvfb.py ./cb.py --iterations=5 --probe=profiling --browser=~/src/chrome/src/out/release2/chrome


So I'm wondering what I'm doing wrong or if I'm missing some flags/running the wrong number of iterations or something else altogether.

Daniel

Dave Tapuska

unread,
May 31, 2023, 10:21:03 AM5/31/23
to Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
Did you compile the CHECKs in? https://chromium-review.googlesource.com/c/chromium/src/+/4563433 hasn't landed yet.

I think yours is more correct. I couldn't get crossbench to work because of the pkey issue Camillo identified above, so I just captured all processes. So the top level flame graph is made up of noise (but is correct when pivoting on blink::To$ to look at # of counts).

dave.

Peter Boström

unread,
May 31, 2023, 11:10:42 AM5/31/23
to Daniel Cheng, Scott Violet, Rick Byers, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
If we're serious about getting a CHECK-disabled build to work I think we should decide a few things:

* Is this a dev/benchmark-only build solely used to measure how expensive CHECKs are in various benchmarks and profile CHECK-y hotspots to see if we can validate assumptions cheaper? There's significant security implications from disabling CHECKs here.

* Do we need CHECK-disabled coverage?

* What do we do with existing CHECKs with intentional side effects? I think for instance:

CHECK(ParseInt(&value));

Reads a whole lot better than this, which may have shadowing issues and without the scope have variables bleed all over:
{
  const bool result = ParseInt(&value);
  CHECK(result);
}

Should we in this case instead have ParseIntOrDie(), and if so may ParseIntOrDie() CHECK internally to die, or is that now forbidden (as a CHECK-off build means it won't die) and it needs to use base::ImmediateCrash() if parsing fails?

If we have a macro for the existing semantics:

CHECK_WITH_SIDE_EFFECT(ParseInt(&value)); or CHECK_RESULT(ParseInt(&value)); (do computation regardless, but CHECK the result).

then we could say "this shouldn't fail, so in CHECK-off builds we run ParseInt(&value) but we don't evaluate the result but rather assume it's true".

I think resolving what to do with CHECKs that intentionally have side effects would be a good thing before we start doing any cleanup. Then the first steps are probably to get base_unittests to compile with CHECKs disabled. This is probably a substantial undertaking if we want to migrate existing code, and probably meaningless if we don't end up with some automated coverage of a build where CHECKs are disabled. Shipping a CHECK-disabled build would presumably be out of the question or a loooong way from now, since test coverage << canary coverage << stable coverage, we'll have no idea that the CHECK-disabled build behaves even remotely OK without releasing it to a significant population (but it would presumably still be useful on Speedometer if we can get it to pass, for benchmarking purposes).

Most urgently I think would be to decide how CHECKs with side effects should canonically be written, add CHECK_IS_ON(), then document that and that regular CHECKs should be side-effect free in the style guide. After that we can fix code piecemeal (get base_unittests to compile without CHECKs enabled, get same tests to pass, etc), and we wouldn't make code worse because of missing facilities.

* Maybe we also want to try this out on a relatively small target and report back, rather than decide that we want this to hold true no matter the cost? WDYT?

Thanks,
Peter

Dave Tapuska

unread,
May 31, 2023, 11:25:38 AM5/31/23
to Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
Before: https://pprofng.corp.google.com/?id=4838a5f38fb3a900bdc5aaaa7e909283


We now have about 4.5M vs the original 675M. The majority seem to be in Layout/Paint now, not in bindings as before. Vlad has some changes in flight to address some of these I think too.

dave

Daniel Cheng

unread,
May 31, 2023, 12:23:29 PM5/31/23
to Dave Tapuska, Michael Lippautz, K. Moon, Peter Boström, Will Harris, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
Yes, locally I have changed it to CHECK() and I was trying to locally test a CL to reduce some counts.

What command are you using to capture all processes? perf or something else? Using the documentation at https://chromium.googlesource.com/chromium/src/+/HEAD/docs/profiling.md#Profiling-on-Linux?

Daniel

Scott Violet

unread,
Jun 1, 2023, 12:06:41 PM6/1/23
to Peter Boström, Daniel Cheng, Rick Byers, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
Great feedback Peter.

The meta concern is how do we prevent death by a 1000 paper cuts? The suggestion to prevent this is a bot that builds with CHECKs disabled, runs a suite (most likely speedometer) and compares to the normal release build. This means that we're never really shipping a build with checks disabled, it's purely for testing. That simplifies some of your concerns, but we still need to deal with side effects in the CHECKs. I don't have a feel for how prevalent it is. Your suggestion of trying this in a particular part of the code would be enlightening.

Perhaps the more interesting question is what happens if the bot goes red because the difference between CHECKs and no-CHECKs is significant? It most likely isn't from a recent commit, but rather the accumulation of changes that added CHECKs. This would likely require a deeper analysis to understand hotspots that are calling to CHECK and if they can be changed.

  -Scott

Rick Byers

unread,
Jun 1, 2023, 1:37:52 PM6/1/23
to Scott Violet, Peter Boström, Daniel Cheng, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
On Thu, Jun 1, 2023 at 12:06 PM Scott Violet <s...@chromium.org> wrote:
Great feedback Peter.

The meta concern is how do we prevent death by a 1000 paper cuts? The suggestion to prevent this is a bot that builds with CHECKs disabled, runs a suite (most likely speedometer) and compares to the normal release build. This means that we're never really shipping a build with checks disabled, it's purely for testing. That simplifies some of your concerns, but we still need to deal with side effects in the CHECKs. I don't have a feel for how prevalent it is. Your suggestion of trying this in a particular part of the code would be enlightening.

Perhaps the more interesting question is what happens if the bot goes red because the difference between CHECKs and no-CHECKs is significant? It most likely isn't from a recent commit, but rather the accumulation of changes that added CHECKs. This would likely require a deeper analysis to understand hotspots that are calling to CHECK and if they can be changed.

Right. Or possibly a friendly escalation to Jim to ask him to make a security / performance tradeoff (which we have done in the past several times). IMHO we shouldn't be afraid of doing that, but if we're going to do it then we should do our homework first so we can show how hard we've worked together to find win-win solutions.

Peter Boström

unread,
Jun 1, 2023, 5:56:09 PM6/1/23
to Rick Byers, Scott Violet, Daniel Cheng, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
I think we can have a limited scope of "make this build pass M1 Speedometer pinpoint" (after getting this to build), ignoring all test failures. If we can get that far then presumably we have a ~roughly upper bound for how much our CHECKs cost us on this benchmark currently and can diff with profiles (we may be overcounting the cost since some of the CHECKs should've run and we're incorrect but faster).

I'll be out on pat leave very soon, so I'll just write down my thoughts here if this is being picked up (maybe Daniel has different ideas though).

The way I'd approach that is to add a CHECK_WITH_SIDE_EFFECT (because it can trivially be reversed by s/CHECK_WITH_SIDE_EFFECT/CHECK if we decide this was a bad idea, as opposed to other rewrites), add a check_is_on gn flag that when set to false removes CHECKs (but not CHECK_WITH_SIDE_EFFECT), and then try to get that to build Chrome. Every compile failure we get should probably be s/CHECK/CHECK_WITH_SIDE_EFFECT until we're able to compile. Then try to run Chrome, and run Speedometer locally. Presumably more CHECKs have to turn into CHECK_WITH_SIDE_EFFECT here as well. It's probably a good idea to do a good chunk of this under msan/asan as we're likely to have initialization bugs due to not running stuff inside CHECKs.

If we get this far send out a pinpoint run for M1 Speedometer. Decide where to go from here. Depending on the amount of work (and rebase hell) this may be doable as a local CL and not need to be committed until we get this far. It may also be impossible to approach Speedometer pinpointability without intermediate checkins (hence the reversible CHECK_WITH_SIDE_EFFECT macro).

WDYT? Would that roughly make sense as a plan?

Matt Denton

unread,
Jun 1, 2023, 6:17:51 PM6/1/23
to Peter Boström, Rick Byers, Scott Violet, Daniel Cheng, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
My 2 cents: running a CHECK-free build will disable the Linux sandbox [1, 2], most of the sandbox code is written with the assumption that CHECKs can have side effects, and I find CHECKs with side effects much more readable. However if this is to profile the perf effects of all of our CHECKs, that seems a noble cause.

Daniel Cheng

unread,
Jun 1, 2023, 7:13:00 PM6/1/23
to Matt Denton, Peter Boström, Rick Byers, Scott Violet, Hannes Payer, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
Per my earlier reply, I still stand by the statement that CHECKs should not have side effects; otherwise, we can't easily measure the performance effects. For the sandbox code, that may mean having to pull out something with a side effect into another statement, which will require a bit of manual fixes, but I think that's fine.

I guess if we really want CHECK_WITH_SIDE_EFFECT to make for an easier migration, maybe that's OK, but I don't think we should keep it around long-term.

Daniel

Hannes Payer

unread,
Jun 2, 2023, 4:57:37 AM6/2/23
to Daniel Cheng, Matt Denton, Peter Boström, Rick Byers, Scott Violet, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
+1 to CHECK_WITH_SIDE_EFFECT

I think it would be nice to keep it around long-term and run a builder with a small set of tests and a perf bot. That would allow us to sleep well from a security and performance perspective :)

-Hannes

Will Harris

unread,
Jun 2, 2023, 12:23:42 PM6/2/23
to Daniel Cheng, Dave Tapuska, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
Just to wrap up the original post/discussion on this thread, the CL that changes SECURITY_DCHECK -> CHECK landed in https://chromium-review.googlesource.com/c/chromium/src/+/4563433. Thanks to all those who contributed and reviewed, in particular Dave for all the CLs to improve Blink casting semantics and for some preemptive perf improvements before the CL even landed, and those who participated on this thread.

I think we can continue to iterate on this by continuing to change particularly hot paths to UnsafeTo<>, or updating templates where we know the usage is safe to use UnsafeTo<>, and this will likely provide some performance improvements, but the security benefit from strengthening this check in the immediate term is HUGE so we very much appreciate the flexibility and pragmatic approach of ATL and Blink owners on approving the landing of this CL.

Will

Daniel Cheng

unread,
Jun 2, 2023, 1:06:13 PM6/2/23
to Hannes Payer, Matt Denton, Peter Boström, Rick Byers, Scott Violet, Dana Jansens, Will Harris, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
I guess I don't understand the motivation for keeping CHECK_WITH_SIDE_EFFECT permanently. Any CHECK that has a side effect can be trivially rewritten so that it does not, right?

Daniel

Dave Tapuska

unread,
Jun 2, 2023, 1:17:06 PM6/2/23
to Will Harris, Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber, Steinar H Gunderson
This is a great middle ground we reached, but there is still more work to be done.

Steinar commented that CSS perf went down...  I'm commenting here (instead of on the bug) to include as many people as possible.

Before the change landed and from pprof graphs I identified some CSS callsites as being problematic.

Back from webkit because of the number of instances of CSSValue was "devirtualized". See https://source.chromium.org/chromium/chromium/src/+/ba054bec3f788e3965a48715e54466a0c1285d84


To address the perf concerns I think there are three options and maybe the CSS team could explore them:
1) Are the original de-virtualized reasons for CSSValue still valid?
2) These could be converted to UnsafeTo (relatively easy, but we really don't want to put a lot of UnsafeTo<>s in the code).
3) Can we simulate virtualization with a function ptr table lookup instead of a gigantic switch and cast?

dave.

Steinar H. Gunderson

unread,
Jun 2, 2023, 1:58:19 PM6/2/23
to Dave Tapuska, Will Harris, Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Rick Byers, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
On Fri, Jun 02, 2023 at 01:16:51PM -0400, Dave Tapuska wrote:
> Steinar commented
> <https://bugs.chromium.org/p/chromium/issues/detail?id=1448838#c6> that CSS
> perf went down... I'm commenting here (instead of on the bug) to include
> as many people as possible.

FWIW, I think it's maybe less bad than I thought. The losses are down to
more like 2–3% if I turn off PGO (which I inadvertently had on). So it might
be just because the PGO profile applies very poorly to the new code.

Still, 2–3% is problematic for us. I'm trying to go through and see if we can
identify some places, but there's nothing immediately obvious that we can
flip and gain it all back.

> To address the perf concerns I think there are three options and maybe the
> CSS team could explore them:

I'll need to discuss this with the team next week. I don't think CSSValue is
a particular hotspot for us here, but I would assume that it's devirtualized
to save on the 8 bytes for the vtable (so memory/cache reasons).

Rick Byers

unread,
Jun 2, 2023, 5:47:00 PM6/2/23
to Steinar H. Gunderson, Dave Tapuska, Will Harris, Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
Thank you Steinar for looking into this, and sorry for the surprise! This increases the risk in my mind that we might detect a noticeable LCP regression in the wild with this change - in which case we'd likely have to back it out while we evaluate mitigations and better quantify the impact on search ranking etc.. Please let us know what you figure out based on the style perf tests next week and we can go from there.

Regardless, 100% I love the security/performance/blink collaboration here, and especially appreciate Dave volunteering to jump in land improvements. Let's keep it up!

Rick

Scott Violet

unread,
Jun 2, 2023, 6:39:57 PM6/2/23
to Rick Byers, Steinar H. Gunderson, Dave Tapuska, Will Harris, Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
+1

I'm glad we were able to find a path forward and appreciate folks helping out.

  -Scott

Steinar H. Gunderson

unread,
Jun 5, 2023, 4:11:45 AM6/5/23
to Rick Byers, Dave Tapuska, Will Harris, Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
On Fri, Jun 02, 2023 at 05:46:46PM -0400, Rick Byers wrote:
> Thank you Steinar for looking into this, and sorry for the surprise! This
> increases the risk in my mind that we might detect a noticeable LCP
> regression in the wild with this change - in which case we'd likely have to
> back it out while we evaluate mitigations and better quantify the impact on
> search ranking etc.. Please let us know what you figure out based on the
> style perf tests next week and we can go from there.

FWIW, it wasn't really a surprise to us; this was announced well. Otherwise,
I probably wouldn't have checked. :-)

None of us were around for the CSSValue changes, but it seems fairly clear
that it's at least partially about memory (adding back a vtable would bloat
CSSValue, and we have lots of them). But I also don't think this is about
CSSValue, since I haven't been seeing that show up in the profiles -- is
there a particular reason why that example came up?

There doesn't seem to be a simple hotspot where we can turn To into UnsafeTo,
but I'll keep on looking and send some reviews with smaller things we have
identified.

Dave Tapuska

unread,
Jun 5, 2023, 9:54:53 AM6/5/23
to Steinar H. Gunderson, Rick Byers, Will Harris, Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
The CSSValue was from the pprof taken running speedometer. It doesn't mean these actually cost a lot of time because we've largely mitigated the impact of that with other CLs, but they were useful in terms of finding repeated patterns and binary size growth. 

dave.

Steinar H. Gunderson

unread,
Jun 5, 2023, 10:45:40 AM6/5/23
to Dave Tapuska, Rick Byers, Will Harris, Daniel Cheng, Michael Lippautz, K. Moon, Peter Boström, dan...@chromium.org, Hannes Payer, John Abd-El-Malek, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Alex Gough, Kent Tamura, Nico Weber
On Mon, Jun 05, 2023 at 09:54:39AM -0400, Dave Tapuska wrote:
> The CSSValue was from the pprof
> <https://pprofng.corp.google.com/?id=4838a5f38fb3a900bdc5aaaa7e909283&pivot=blink::To$&id0=b004667a31a122d24631c227e0a6050c&tab=flame)>
> taken running speedometer. It doesn't mean these actually cost a lot of
> time because we've largely mitigated the impact of that with other CLs, but
> they were useful in terms of finding repeated patterns and binary size
> growth.

I am a bit confused; shouldn't the compiler pretty easily manage to remove
the repeated checks through VRP? I mean, this is effectively the same as

switch (type_tag) {
case 1:
CHECK(type_tag == 1);
foo();
break;
case 2:
CHECK(type_tag == 2);
bar();
break;
// etc.
}

where I would assume the CHECKs are readily removable.

Will Harris

unread,
Jun 5, 2023, 2:07:48 PM6/5/23
to Daniel Cheng, Hannes Payer, Matt Denton, Peter Boström, Rick Byers, Scott Violet, Dana Jansens, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
I agree with Daniel, I think we should just do a pass on all CHECKs to make sure that they never have side effects (perhaps we can automate detection of this, as well as prevent folks introducing new ones accidentally using an clang plugin), and then once that's done CHECK can be optionally a null-opt on some configurations. I don't think we should introduce a new construct here, there's already enough of these!

Will

Peter Boström

unread,
Jun 5, 2023, 2:23:31 PM6/5/23
to Will Harris, Daniel Cheng, Hannes Payer, Matt Denton, Rick Byers, Scott Violet, Dana Jansens, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
I think the CHECK_WITH_SIDE_EFFECT is useful for a transitional period (where we can just s/CHECK/CHECK_WITH_SIDE_EFFECTS where we have build failures etc, and then go through them later). We can then, once we are somewhat reasonable that we've gotten >50% of CHECKs with side effects into the separate macro, put out a CL to rewrite CHECK_WITH_SIDE_EFFECT as regular CHECKs and see if we bleed too many "const bool result" variables for this rewrite to be ugly / error-prone. I am a little concerned about bleeding result variables as we may accidentally CHECK the wrong one for results later.
 
crrev.com/c/4583347 has gotten far enough to have an asan build building (nowhere near running). I haven't yet gotten a msan/official-build building yet. For these I just added CHECK_WITH_SIDE_EFFECTS in any files where I found errors so for actual landing this would need to be iterated on. No idea when I'll disappear on parental leave so I'm just dumping these as I go if they happen to be useful. They at least let us know where some side-effectful CHECKs are currently present, but this is probably the tip of the tip of the tip of the iceberg.

dan...@chromium.org

unread,
Jun 6, 2023, 11:06:25 AM6/6/23
to Peter Boström, Will Harris, Daniel Cheng, Hannes Payer, Matt Denton, Rick Byers, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
If CHECK_WITH_SIDE_EFFECT is just for clang tools to see the difference, a comment will also suffice (see implicit constructor comments for prior art).

Peter Boström

unread,
Jun 6, 2023, 11:51:16 AM6/6/23
to dan...@chromium.org, Will Harris, Daniel Cheng, Hannes Payer, Matt Denton, Rick Byers, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
It's really just for readability. There's two downsides with comments that I see, one is that comments tend to get stale and may disconnect from the following statement as additional code gets inserted over time. The other is that you need a temporary variable to CHECK the result of something (bool result = FooCall(); CHECK(result);) which bleeds out into the rest of the function. This mostly matters if you CHECK a lot of things with side effects, so for instance:

CHECK_NE(SIG_ERR, signal(SIGPIPE, SIG_IGN));
CHECK_EQ(0, sigemptyset(&empty_signal_set));
CHECK_EQ(sigprocmask(SIG_SETMASK, &empty_signal_set, nullptr));
for (...)
CHECK_EQ(0, sigaction(signal_to_reset, &sigact, nullptr));

in this case you need to either have a lot of temporary variables that're bogus-named like signal_result, sigemptyset_result, etc. that are only there for CHECKing, or non-const variables that you hopefully successfully set after every call and CHECKing it again. Arguably it's also more readable when glancing through code and seeing CHECK_WITH_SIDE_EFFECT(FooCall()); as a single "this call to FooCall must succeed" vs. FooCall() + CHECKing the result in later statements. Not sure that the macro would hold its weight on its own, but storing the result in a variable just for CHECKing isn't that nice. Maybe that means that instead of CHECK_WITH_SIDE_EFFECT we eventually gets wrapped in a bunch of FooCallOrDie() to get rid of CHECK_WITH_SIDE_EFFECT and avoid the extra variables. Or we just take the cost of additional variables and potential reuse of the wrong result variable.

I would also assume that other OS code where we check HRESULTs would benefit from readability if we don't need to store temporaries. For me it's not a thing for clang tools but for people. Not dying on this hill though, but I do think it's worth during the transition.

dan...@chromium.org

unread,
Jun 6, 2023, 12:03:02 PM6/6/23
to Peter Boström, Will Harris, Daniel Cheng, Hannes Payer, Matt Denton, Rick Byers, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
Ok thanks for the explanation.

The other pitfall I will call out here is the idea of telling in tooling if something has side effects.

clang::FunctionDecl has isPure() to tell you this information but it's going to always be false unless manual annotations are applied. There's a great amount of codegen optimization that could be improved if Clang was able to tell if a function has side effects, yet it is completely unable to. This can be seen by the warnings generated by putting anything other than primitive values into __builtin_assume(). I don't know what that indicates about the difficulty of the problem, but it's a red flag.

dan...@chromium.org

unread,
Jun 6, 2023, 12:07:17 PM6/6/23
to Peter Boström, Will Harris, Daniel Cheng, Hannes Payer, Matt Denton, Rick Byers, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
On Tue, Jun 6, 2023 at 12:02 PM <dan...@chromium.org> wrote:
Ok thanks for the explanation.

The other pitfall I will call out here is the idea of telling in tooling if something has side effects.

clang::FunctionDecl has isPure() to tell you this information but it's going to always be false unless manual annotations are applied. There's a great amount of codegen optimization that could be improved if Clang was able to tell if a function has side effects, yet it is completely unable to. This can be seen by the warnings generated by putting anything other than primitive values into __builtin_assume(). I don't know what that indicates about the difficulty of the problem, but it's a red flag.

self-nit: the isPure on FunctionDecl is about virtual, I don't know what API there is to tell you about __attribute__((pure)), then.

K. Moon

unread,
Jun 6, 2023, 3:38:06 PM6/6/23
to dan...@chromium.org, Peter Boström, Will Harris, Daniel Cheng, Hannes Payer, Matt Denton, Rick Byers, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
Probably want the "attrs" methods on clang::Decl.

Stefan Zager

unread,
Jun 6, 2023, 3:56:36 PM6/6/23
to K. Moon, dan...@chromium.org, Peter Boström, Will Harris, Daniel Cheng, Hannes Payer, Matt Denton, Rick Byers, Scott Violet, Charles Harrison, Mike Wittman, platform-architecture-dev, Daniel Vogelheim, Kentaro Hara, Dave Tapuska, Alex Gough, Kent Tamura, Nico Weber, Michael Lippautz, John Abd-El-Malek
This is a digression, but...

Do we have a strategy for evaluating, for a specific s/DCHECK/CHECK/ conversion, whether the security benefit is worth the performance hit?

I have seen at least one conversion site on a hot path that caused notable perf regressions (as manifest by pinpoint bugs), but has zero reported crashes. How do we feel about this? What's the methodology for quantifying the perf hit and security benefit?

It is loading more messages.
0 new messages