Re: Layout tests should not use threaded compositing, part XIII

Ojan Vafai

unread,

Nov 4, 2015, 11:40:11 AM11/4/15

to Adrienne Walker, Elliott Sprehn, blink-dev

Thread hijacking this to blink-dev@chromium.

At a high level, I want to +1 that we should be moving our testing away from high level regression tests into more deterministic tests. There's a good (not quite finished) alternative to layout tests that I think we should pursue (an evolution of Elliott's C++-based Sim framework).

I'll be adding this to the Blink infra team's long-term roadmap shortly. Coincidentally, Elliott and I hashed out a basic idea yesterday. At some point we'll turn that into a doc and a proposal, but mostly it needs people to do the engineering work to make it possible. So, if you're possibly interested in 5%-80%ing on this, ping me.

On Tue, Nov 3, 2015 at 4:04 PM 'Adrienne Walker' via paint-dev <pain...@chromium.org> wrote:

(I sent this previously as an email reply to a few people about a particular patch, but as this continues to come up on a monthly basis, here is a slightly edited copy of this rant as a reference of my biased position on this topic.)

We should *not* make layout tests use threaded compositing.

We shouldn't do it because it's a rat hole. jamesr and I both tried to make this happen on separate occasions. We spent a lot of time and fought with even more flakiness. I'm sure somebody could heroically burn a lot of time making this happen and succeed where we couldn't, but in my opinion, it's a waste of time. Threaded compositing increases flakiness and loses determinism in terms of when events happen, how repaints line up with frames, and how animation happens, especially. We already have a ton of problems with flakiness in layout tests, so why make this worse?

If there are features that are only enabled with threaded compositing (compositor animations?) that somehow still need Blink layout tests, we should enable those features for non-threaded compositing. I would support removing the virtual/threaded directory if somebody made it possible to test those features in other ways. If there are tests that have different behavior in threaded mode vs not threaded mode, then they need to be deflaked or tested more narrowly via unit tests or esprehn's simulation testing framework.

Single vs threaded compositing is generally well-tested within cc unit tests. There should be very little difference at the Blink level, and I would question the need to test it there. We have ironed out a small number of message ordering differences over time in cc. We should consider these differences bugs and add cc unit tests for them to guarantee the same ordering of messages. In the long term, I think the Blimp/Mandoline work has the potential to merge single/thread proxy with an abstracted glue layer between them.

So, sure, we should test what we ship, but on the other hand every test should not be an integration test. I have have seen exactly zero bugs ever that would have been caught if a layout test had been using threaded compositing instead of single threaded compositing. Really, zero.

Personally, I think our problems as a team lie more on the "tests are flaky and hard to reason about because they're integration tests" end of the spectrum and far less on the "we are shipping huge hidden regressions because of holes in our testing".

On Tue, Nov 3, 2015 at 5:17 PM Adrienne Walker <en...@chromium.org> wrote:

2015-11-03 16:30 GMT-08:00 Alexey <lo...@chromium.org>:
Can we run all of the virtual/threaded tests in SingleThreadProxy mode?

Currently, no!

There are some things (e.g. composited animations) that have behavior differences in code depending on threaded vs single thread. It seems to me that these are wrinkles that could be ironed out. It seems like there's no reason we couldn't create and run composited animations in single-threaded mode.

It looks like other things have been added there over time (e.g. idle callbacks), that I don't understand as well. Naively, it seems like fundamentally nothing should depend on threaded compositing.

Rick Byers

unread,

Nov 4, 2015, 12:33:23 PM11/4/15

to Ojan Vafai, Adrienne Walker, Elliott Sprehn, blink-dev

I agree with this big picture direction - i.e. unit tests are generally better than integration tests.

However I HAVE seen examples in input where we've missed issues because we had only unit tests (content, cc and blink layout tests), not good integration test that included the effect of threaded input handling. One case in particular was pretty embarrassing - IIRC we broke scrolling completely in most real-world cases yet not a single test failed :-). Still, I think this is, at best, an argument for having a couple sanity integration tests (eg. as content browser tests).

My bigger concern here is around interoperability testing. IMHO the only way we're going to substantially reduce developer pain around interoperability is by sharing more tests with the other rendering engines. Mozilla is much better at this than us. This, by its nature, is impossible to do with unit tests. So when new simple web platform features are being added (especially highly deterministic JS APIs), I've recently started encouraging my team to try focusing on web-platform-tests. Obviously more complex / non-deterministic features should still prefer to focus on unit tests. But the exact balance here seems very non-obvious to me. But like with everything, we should probably just try things, evaluate how we're doing, and iterate. Any concerns?

Rick

Adrienne Walker

unread,

Nov 4, 2015, 2:29:52 PM11/4/15

to Rick Byers, Ojan Vafai, Elliott Sprehn, blink-dev

Re: interoperability tests. Looking at the guidelines for the web platform tests, I think if all of our layout tests had been written with the guidelines of short and minimal in mind, we would be in a much better situation than we are today.

My biggest concerns about layout tests are: (1) they get used to test very internal parts of Blink, which then get exposed via awful plumbing via internals or dumping the state of the world into a text file. (2) they "over test", especially in terms of pixel and text output, which adds a rebaselining burden as these things churn over time and over different platforms, even though that doesn't affect what the test was trying to verify. (3) they rely on so much of the stack that they end up being flaky.

I think there's plenty of room for html / javascript tests that are cross-platform and test web platform features that don't fall into these traps.

Ojan Vafai

unread,

Nov 4, 2015, 7:58:29 PM11/4/15

to Adrienne Walker, Rick Byers, Elliott Sprehn, blink-dev

+1 on all points. I hope this new (currently half-baked) solution will help improve all three of these to varying degrees. It's not panacea, but it should be a lot better.

Stefan Zager

unread,

Jun 9, 2023, 2:25:08 PM6/9/23

to blink-dev, Ojan Vafai, Elliott Sprehn, blink-dev, Adrienne Walker, Rick Byers

Resurrecting this thread, because yotha@ are launching the first step in a project that completely

ignores the previous consensus (i.e., that we should *not* enable threaded compositing for all

web_tests).

I will paraphrase the high-level arguments made previously by enne@ and ojan@, neither of whom

are still active on chromium:

- Enabling threaded compositing will increase test flakiness

- Previous efforts to fix the flakiness have been really difficult and largely unsuccessful

- Buggy behavior in the blink/compositor thread boundary is best covered by focused unit tests

- There are vanishingly few bugs in that area; maybe none?

My reponse...

web_tests, by their very nature, are a massive set of integration tests. That is not the most efficient

organization of test coverage. But the beauty of web_tests is that they catch a *ton* of bugs and

prevent many bad patches from landing. The large majority of the bugs are in core blink areas,

but not all of them. web_tests provide effective, if inefficient, coverage of almost the entire chromium

code base. The lack of coverage for threaded compositing is an anomaly and a blink spot. "It's too

hard" is not, in my opinion, a convincing justification.

Yes, turning on threaded compositing *does* make web_tests flakier. And that is evidence of real bugs!

It's oxymoronic to acknowledge the flakiness but claim that there are few or no bugs in the

blink/compositor thread boundary. There are definitely bugs in there! They can be excruciatingly hard

to root-cause, but they definitely exist. A couple of examples...

https://chromium-review.googlesource.com/c/chromium/src/+/3958364

https://bugs.chromium.org/p/chromium/issues/detail?id=1447868

Once these bugs are root-caused, I certainly agree that we should prefer unit tests for regression

testing. But they key point is that we only discovered these bugs because they caused failures in

web_tests.

Again: these are real bugs!

Because bugs in the thread boundary are usually race conditions they can be extremely difficult to

reproduce, which means that they're very likely under-reported. But they still degrade user experience.

Beyond crbug.com and flaky web_tests, there are also crash reports that very likely indicate bugs in this

area.

Ken Rockot

unread,

Jun 9, 2023, 2:38:43 PM6/9/23

to Stefan Zager, blink-dev, Ojan Vafai, Elliott Sprehn, Adrienne Walker, Rick Byers

On Fri, Jun 9, 2023 at 11:25 AM Stefan Zager <sza...@chromium.org> wrote:

Resurrecting this thread, because yotha@ are launching the first step in a project that completely
ignores the previous consensus (i.e., that we should *not* enable threaded compositing for all
web_tests).

This is exciting to hear! Is there a good place / bug to follow progress?

I will paraphrase the high-level arguments made previously by enne@ and ojan@, neither of whom
are still active on chromium:

- Enabling threaded compositing will increase test flakiness
- Previous efforts to fix the flakiness have been really difficult and largely unsuccessful
- Buggy behavior in the blink/compositor thread boundary is best covered by focused unit tests
- There are vanishingly few bugs in that area; maybe none?

My reponse...

web_tests, by their very nature, are a massive set of integration tests. That is not the most efficient
organization of test coverage. But the beauty of web_tests is that they catch a *ton* of bugs and
prevent many bad patches from landing. The large majority of the bugs are in core blink areas,
but not all of them. web_tests provide effective, if inefficient, coverage of almost the entire chromium
code base. The lack of coverage for threaded compositing is an anomaly and a blink spot. "It's too
hard" is not, in my opinion, a convincing justification.

Yes, turning on threaded compositing *does* make web_tests flakier. And that is evidence of real bugs!
It's oxymoronic to acknowledge the flakiness but claim that there are few or no bugs in the
blink/compositor thread boundary. There are definitely bugs in there! They can be excruciatingly hard
to root-cause, but they definitely exist. A couple of examples...

https://chromium-review.googlesource.com/c/chromium/src/+/3958364
https://bugs.chromium.org/p/chromium/issues/detail?id=1447868

Once these bugs are root-caused, I certainly agree that we should prefer unit tests for regression
testing. But they key point is that we only discovered these bugs because they caused failures in
web_tests.

Again: these are real bugs!

+1000

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/4d8dd468-b3c4-4f65-8ac7-1b4ca194e08an%40chromium.org.

Stefan Zager

unread,

Jun 9, 2023, 3:18:42 PM6/9/23

to Ken Rockot, Stefan Zager, blink-dev, Ojan Vafai, Elliott Sprehn, Adrienne Walker, Rick Byers

On Fri, Jun 9, 2023 at 11:38 AM Ken Rockot <roc...@chromium.org> wrote:

On Fri, Jun 9, 2023 at 11:25 AM Stefan Zager <sza...@chromium.org> wrote:
Resurrecting this thread, because yotha@ are launching the first step in a project that completely
ignores the previous consensus (i.e., that we should *not* enable threaded compositing for all
web_tests).

This is exciting to hear! Is there a good place / bug to follow progress?

Main tracking bug.

Elliott Sprehn

unread,

Jun 9, 2023, 6:08:53 PM6/9/23

to Stefan Zager, Ken Rockot, blink-dev, Ojan Vafai, Adrienne Walker, Rick Byers

Woah 2015 super necro thread!

There were a couple meta angles here:

- Explicitly testing non-web exposed behavior is probably better done with a SimTest or other unit test. In the past folks kept adding more methods to Internals, or special behavior to the compositor, to make it easier to write web_tests. So not having the threaded compositor discourages some classes of that. I see folks are still adding new internals methods unfortunately, but I still think SimTests (or something else) is a better option.

- The more "complete" the browser run over web_tests, the more bugs are caught, but (at least at the time) browser_tests were very flaky and there was a strong desire to avoid that fate within blink. For example you could run the full chrome binary for web_tests and catch more bugs too. Why stop at just the threaded compositor? The thinking was making web_tests very stable was better than catching a small number of real bugs at the cost of a flake tax across the hundreds of contributors.

The failure mode I've seen of efforts like this is the SingleThreadedTests file grows each time someone finds some flake, so we never really find the bugs like 1447868 because folks just keep hiding them each time something unusual happens.

All that said, if you can crank up the "realism" of the test runner and not introduce flake and convince folks not to hide all the bugs with exceptions, that sounds worth it. I hope you all succeed!

- E

Stefan Zager

unread,

Jun 9, 2023, 6:28:20 PM6/9/23

to Elliott Sprehn, Stefan Zager, Ken Rockot, blink-dev, Ojan Vafai, Adrienne Walker, Rick Byers

On Fri, Jun 9, 2023 at 3:07 PM Elliott Sprehn <esp...@google.com> wrote:

Woah 2015 super necro thread!

There were a couple meta angles here:

- Explicitly testing non-web exposed behavior is probably better done with a SimTest or other unit test. In the past folks kept adding more methods to Internals, or special behavior to the compositor, to make it easier to write web_tests. So not having the threaded compositor discourages some classes of that. I see folks are still adding new internals methods unfortunately, but I still think SimTests (or something else) is a better option.

Agreed: SimTests are *always* a better option, if you can get the right coverage. The problem we've had is that SimTests test blink right up to the blink/compositor boundary; and cc::LayerTreeTest test right up to the same boundary, from the other side. But we don't have a unit test framework for the renderer process that tests across that boundary. It's been proposed in the past, and it's still probably a good idea, to connect these two test frameworks so we can get better unit test coverage across the boundary. But that only helps for new tests; we still miss out on the coverage provided by the huge corpus of existing web_tests.

- The more "complete" the browser run over web_tests, the more bugs are caught, but (at least at the time) browser_tests were very flaky and there was a strong desire to avoid that fate within blink. For example you could run the full chrome binary for web_tests and catch more bugs too. Why stop at just the threaded compositor? The thinking was making web_tests very stable was better than catching a small number of real bugs at the cost of a flake tax across the hundreds of contributors.

Yep, very true: destabilizing the whole system is not worth it. Our first experiments with enabling threaded compositing by default were pretty broken, but we discovered a few fixes that greatly reduced the number of new flakes and failures, and we discovered that a relatively small number of tests were responsible for just about all of the new flakiness. So we placed those tests in quarantine: they will continue to run with the single-threaded compositor, while the large majority of tests (>95%), and all new tests, will use the threaded compositor. The best time to fix a flaky test is right when the test is added, and that holds true even if threaded compositing is the root cause of the flakiness.

Steve Kobes

unread,

Jun 9, 2023, 8:43:16 PM6/9/23

to Stefan Zager, Elliott Sprehn, Ken Rockot, blink-dev, Ojan Vafai, Adrienne Walker, Rick Byers

I think the strongest argument for threaded web tests is that web tests are the foundation for WPT. We now prefer web tests to be WPT when possible, and wpt.fyi runs against the full browser. So WPT tests need to work with threaded compositing, and it's harder to ensure this if Chrome infra runs them differently. We do have a large legacy of non-WPT web tests, but we shouldn't let that hold us back from moving toward threaded-by-default.

SimTests are better for poking at Chrome-specific internal state. Now that SimTest has a real InputHandlerProxy it is not as limited by the blink/compositor boundary and can fill some of the gap left by single-threaded web tests.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAHOQ7J-0kFrERxC_PZdyfA6yoh_53%2Bu50gm_vuS-hzd1pAXUTg%40mail.gmail.com.

Reply all

Reply to author

Forward