Demoting High-Recall Tests From CQ

415 views
Skip to first unread message

Jeffrey Yu

unread,
Feb 18, 2026, 12:49:18 PM (4 days ago) Feb 18
to chromi...@chromium.org

tl;dr: Starting next week (Feb 23), a selected number of test suites will no longer run on CQ

What's happening?

The Chrome Dev Infrastructure Team is removing a number of high-recall tests from CQ, and making them CI-only. This is in response to the team recently hitting capacity limits, which caused CQ times to run long because testing queues were pending while waiting for available machines.


To mitigate some of this capacity crunch, we will select the top test suites by resource usage that analysis shows have a 100% recall rate and demote them to CI only. With this, we expect to save about 3.33% of CQ test run time across all builders and all tests in addition to capacity savings and speed gains from reduced pending times.


What are high-recall tests?

The Regression Test Selection (RTS) analysis tool analyzes CQ test runs and calculates the Recall rate for every test suite, the Recall rate being the percentage of failed CLs in the given time period that would still have failed if the test suite was not run on the builder (successful CLs do not factor into this number).


To be conservative and ensure no regression in test coverage, this change will only remove top test suites by resource usage that have a 100.00% recall rate. 


What is the list of test suites to be demoted?


Builder

Test Suite

Savings (Total CQ test time)

linux-rel

blink_web_tests

1.26%

linux-rel

content_browsertests

1.14%

linux_chromium_tsan_rel_ng

components_unittests

0.17%

linux-rel

not_site_per_process_blink_wpt_tests

0.16%

linux_chromium_asan_rel_ng

components_browsertests

0.14%

linux_chromium_tsan_rel_ng

content_unittests

0.1%

linux_chromium_asan_rel_ng

headless_browsertests

0.1%

linux_chromium_tsan_rel_ng

headless_browsertests

0.07%

win-rel

trace_test

0.07%

linux_chromium_tsan_rel_ng

cc_unittests

0.06%

win-rel

updater_tests

0.06%


Will my projects lose test coverage?

The expectation is that this change will continue to maintain coverage, as the listed tests will still continue to run on other builders, and the RTS analysis shows that any CQ runs failing those tests would still have failed without that specific builder/test suite combination anyway. To be conservative about not losing test coverage, only builder/suite combinations with a 100% Recall rate are being demoted.


What about other builders?

Removing other high-recall test suites could save additional resources and reduce CQ time without reducing test coverage. However, the scope of this current round of demotions is intended to address the capacity issues for Linux and Windows builders. 


To be conservative and avoid disruption, the demotions will first address the Linux and Windows builders, and monitor for test coverage regressions or increased CQ failure rates. If subsequent RTS analysis on future CQ runs shows substantial additional savings in other builders, another round of CQ updates may be possible.


David Baron

unread,
Feb 18, 2026, 1:15:28 PM (4 days ago) Feb 18
to yu...@google.com, chromi...@chromium.org
Will this change affect what happens if I explicitly choose only a "linux-rel" test run in the "Choose Tryjobs" UI?  I frequently do this for code changes that have no platform-specific parts, as a way to get a complete run of tests on a single platform, as a cheaper alternative to a full CQ+1.

-David

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAD4jpnMhqWHVBAEgw87WvqKmDpfNmUBoXX4mpe1EhqHqHZ6byw%40mail.gmail.com.

Jeffrey Yu

unread,
Feb 18, 2026, 1:19:52 PM (4 days ago) Feb 18
to David Baron, chromi...@chromium.org
Yes, this will affect your workflow, however, you can still choose to run CI-only tests by adding a Gerrit footer to your CL:

  • Include-Ci-Only-Tests: true will enable all CI-only tests mirrored by any try builders.
  • Include-Ci-Only-Tests: builder1,builder2 will enable CI-only tests for the specified builders.
  • Include-Ci-Only-Tests: test1,test2 will enable CI-only tests for the specified tests.

Ian Kilpatrick

unread,
Feb 18, 2026, 1:28:21 PM (4 days ago) Feb 18
to yu...@google.com, David Baron, chromi...@chromium.org
I'm very concerned about removing blink_web_tests from the linux-rel bot on for blink related patches. I personally (and I'm guessing other engineers who work on rendering) rely on this testsuite, and use it to verify correctness for rendering related patches.

If this gets removed, engineers risk losing productivity when things do fail, having to spend time relanding.

Have you investigated scoping test suites more narrowly when particular directories are touched? Or allow various directories to opt in to running more suites?

Thanks,
Ian

Jeffrey Yu

unread,
Feb 18, 2026, 2:01:38 PM (4 days ago) Feb 18
to Ian Kilpatrick, David Baron, chromi...@chromium.org
Hi Ian,

The RTS data shows that the blink_web_tests don't fail in isolation, based on recent historical CQ data. Are there any use cases where those tests would fail only on linux-rel, but not on other platforms/builders, and that these platform-specific breakages would also not be caught by any unit tests?

David Baron

unread,
Feb 18, 2026, 2:13:05 PM (4 days ago) Feb 18
to Ian Kilpatrick, yu...@google.com, chromi...@chromium.org
A few other questions:

Just to clarify, what's proposed to be disabled here is just the blink_web_tests step and *not* any of:
  • vulkan_swiftshader_blink_web_tests
  • high_dpi_blink_web_tests
  • not_site_per_process_blink_web_tests
  • anything with *wpt* rather than *web* (except for not_site_per_process_blink_wpt_tests as explicitly listed)
Second, is the Recall Rate calculated in order to choose to disable multiple test suites at the same time?  For example, if the recall rate for suite A is 100% and the recall rate for suite B is 100%, it seems like that could happen because some failures were caught by only A and B and no other test suites.  Was there any check run relative to the particular set of suites being disabled that together they still have a 100% recall rate?

Also, along those lines, is the Recall Rate calculation based only on CQ+2 runs, or also on earlier CQ+1 runs and runs of individual try jobs?

Third, another interesting set of issues here is that for web tests and wpt tests, we have explicitly chosen to run a bunch of the tests only on Linux and not on other platforms.  In particular, many VirtualTestSuites are configured to run only on Linux (as documented in the VirtualTestSuites configuration), and the flag-specific configurations are all run only on Linux (although with other *_blink_web_tests names that I *think* are not being disabled in this change).  So it seems like this is dropping our only CQ testing for a bunch of our VirtualTestSuites configurations (in so far as they're testing web tests and not wpt tests, which I think are not being disabled here).  That might then influence our future choices about what platforms to run virtual test suites on.  If we have a particular shortage of Windows and Linux test machines right now, should we switch our default for one-platform virtual test suites from Linux to Mac?

-David

Ian Kilpatrick

unread,
Feb 18, 2026, 2:20:03 PM (4 days ago) Feb 18
to Jeffrey Yu, David Baron, chromi...@chromium.org
Hi Jeffery,

While these failures rarely occur in isolation, the data they generate is useful. E.g. I regularly run jobs against the CQ testing what various codepaths are doing, and use the failures from the blink_web_tests to inform my work on the patch.

For example -  one complex patch I was working on was:

Here you can see there are platform specific failures within the blink_web_tests paint/invalidation suite; I used these failures to inform the patch which I eventually landed. It would cause a productivity loss if I only found out about them after they landed on the CQ (and were reverted, for example).

Yes they didn't fail in isolation, but that's because I fixed everything in one batch.

Additionally there are whole testsuites that only have coverage in blink_web_tests (e.g. paint/invalidation , some virtual suites) for example. I suspect the data you've collected has a time-window sampling bias, e.g. Currently, we don't have any projects actively working on paint invalidation (though I will start one soon), which could potentially cause failures in isolation.

Ian

Jeffrey Yu

unread,
Feb 18, 2026, 3:33:02 PM (3 days ago) Feb 18
to David Baron, Ian Kilpatrick, chromi...@chromium.org
Replies inline below:

On Wed, Feb 18, 2026 at 11:10 AM David Baron <dba...@chromium.org> wrote:
A few other questions:

Just to clarify, what's proposed to be disabled here is just the blink_web_tests step and *not* any of:
  • vulkan_swiftshader_blink_web_tests
  • high_dpi_blink_web_tests
  • not_site_per_process_blink_web_tests
  • anything with *wpt* rather than *web* (except for not_site_per_process_blink_wpt_tests as explicitly listed)
Correct.
 
Second, is the Recall Rate calculated in order to choose to disable multiple test suites at the same time?  For example, if the recall rate for suite A is 100% and the recall rate for suite B is 100%, it seems like that could happen because some failures were caught by only A and B and no other test suites.  Was there any check run relative to the particular set of suites being disabled that together they still have a 100% recall rate?

No, the recall rate is calculated individually for each test suite. In your example, failures not caught by A would be caught by B, and either one could be removed at 100% recall, but removing both would result in reduction of test coverage. However, it's also possible suite A's recall rate is 100% not because failures would also be caught by suite B, but because suite A also runs on multiple other platforms, and none of the tests in suite A are platform-specific, so a failure in any platform will be duplicated across all platforms. 

In this case, the removal is only from Linux and Windows, and the test suites to be demoted aren't duplicated between the two platforms. For the "headless_browsertests", which are marked across two different builders, one will be demoted first, and the other monitored for some time to see if RTS recall rates change before deciding whether to demote the other.
 
Also, along those lines, is the Recall Rate calculation based only on CQ+2 runs, or also on earlier CQ+1 runs and runs of individual try jobs?

The Recall Rate calculation is based on the past 3 months of historical CQ runs (for failed runs).
 
Third, another interesting set of issues here is that for web tests and wpt tests, we have explicitly chosen to run a bunch of the tests only on Linux and not on other platforms.  In particular, many VirtualTestSuites are configured to run only on Linux (as documented in the VirtualTestSuites configuration), and the flag-specific configurations are all run only on Linux (although with other *_blink_web_tests names that I *think* are not being disabled in this change).  So it seems like this is dropping our only CQ testing for a bunch of our VirtualTestSuites configurations (in so far as they're testing web tests and not wpt tests, which I think are not being disabled here).  That might then influence our future choices about what platforms to run virtual test suites on.  If we have a particular shortage of Windows and Linux test machines right now, should we switch our default for one-platform virtual test suites from Linux to Mac?
 
In a case like this, might it be worth setting up a separate test suite for these platform-specific tests? Configure them to run ony on Linux (or potentially even run this as a separate builder targetting changes for specific directories) rather than running them in a large, multi-platform test suite? A lot of test suites are very large; breaking them into smaller suites would offer several benefits, allowing for more granular and smart test selection and filtering. 

Dirk Pranke

unread,
Feb 18, 2026, 3:39:55 PM (3 days ago) Feb 18
to Jeffrey Yu, David Baron, chromi...@chromium.org, ikilp...@chromium.org
Jeffrey,

Can you go into a little more detail about how you came up with this proposal? For example, are you only looking at situations where someone ran a full CQ attempt (all of the configurations at once) rather than just individual builders? And by 100% recall you're saying that e.g., blink_web_tests *never* failed on linux without also failing on another platform (e.g., windows) in the same attempt? 

I would not be too surprised by that, if so, there's not a lot of platform-specific code in the blink tests (compared to many of the other suites, at least), and linux is probably the most well-tested config either by hand or by people testing just one config or platform before doing a full CQ attempt.

For the test suites that are being removed, can you say which other configurations they will still be run on (or at least some of the other configurations)? E.g., if blink_web_tests will still be run on linux ASAN, then not running on linux (non-ASAN) is probably less of an issue.

As long as we have sufficient coverage of tests on other platforms and I have the above right, then I wouldn't feel too bad about removing that configuration. As you say, it sounds like it is giving us no distinctive information at all. 

If, to David's point, there are a bunch of test suites that are only run in that configuration and not in any others (though I would expect most of the virtual test suites that were linux only would be caught by, e.g., the ASAN config unless they were explicitly skipped in ASAN for some reason (e.g., too slow), then I would wonder what the fact that they never fail is telling us.

-- Dirk

Ian Kilpatrick

unread,
Feb 18, 2026, 4:04:25 PM (3 days ago) Feb 18
to Dirk Pranke, Jeffrey Yu, David Baron, chromi...@chromium.org
I don't think we are currently running the full linux blink_web_tests on a similar bot at the moment.

One example which I've recently been working on - (which was very lucky not to trigger the rule above) is:

E.g. this patch currently fails a single test (which I wouldn't have known about except for running the CQ), and fails in both not_site_per_process_blink_web_tests & blink_web_tests.
While it wouldn't have triggered the rule, it's likely accidental that it's running in the not_site_per_process_blink_web_tests suite given the amount of skipped tests in the other testsuite.

Ian 

Jeffrey Yu

unread,
Feb 18, 2026, 4:50:39 PM (3 days ago) Feb 18
to Dirk Pranke, David Baron, chromi...@chromium.org, ikilp...@chromium.org
Replies inline below.

On Wed, Feb 18, 2026 at 12:37 PM Dirk Pranke <dpr...@chromium.org> wrote:
Jeffrey,

Can you go into a little more detail about how you came up with this proposal? For example, are you only looking at situations where someone ran a full CQ attempt (all of the configurations at once) rather than just individual builders? And by 100% recall you're saying that e.g., blink_web_tests *never* failed on linux without also failing on another platform (e.g., windows) in the same attempt? 

100% recall means that for failed CQ attempts in the past 3 months, all of those failed CQs would still fail with that test suite removed. There are multiple reasons this can happen, including: *) the same test suite running identical tests on a different builder (as in your example) *) other test suites testing the same logic and always co-failing to catch the same breakages

Re: why these tests suites were chosen. The capacity issues are currently happening in Linux and Windows builders, so tests for those platforms were looked at, and the top targets by resource usage were chosen. There were 900+ builder/test pairs that had 100% recall rate, but many of those use fewer resources and wouldn't contribute significantly to savings if demoted. Test suites with less than 100% recall rate were also ignored.  

I would not be too surprised by that, if so, there's not a lot of platform-specific code in the blink tests (compared to many of the other suites, at least), and linux is probably the most well-tested config either by hand or by people testing just one config or platform before doing a full CQ attempt.

For the test suites that are being removed, can you say which other configurations they will still be run on (or at least some of the other configurations)? E.g., if blink_web_tests will still be run on linux ASAN, then not running on linux (non-ASAN) is probably less of an issue.
 
As long as we have sufficient coverage of tests on other platforms and I have the above right, then I wouldn't feel too bad about removing that configuration. As you say, it sounds like it is giving us no distinctive information at all. 

If, to David's point, there are a bunch of test suites that are only run in that configuration and not in any others (though I would expect most of the virtual test suites that were linux only would be caught by, e.g., the ASAN config unless they were explicitly skipped in ASAN for some reason (e.g., too slow), then I would wonder what the fact that they never fail is telling us.

  • try/win-rel
  • try/mac15-arm64-rel
  • try/mac-rel
  • try/win11-blink-rel
  • try/linux-blink-rel
  • try/linux-blink-web-tests-force-accessibility-rel
  • try/linux-blink-msan-rel
  • try/linux-blink-asan-rel
  • try/mac14.arm64-blink-rel
  • try/linux-blink-leak-rel
  • try/win10.20h2-blink-rel
  • try/win11-arm64-blink-rel
  • try/mac15.arm64-blink-rel
  • try/mac14-blink-rel
  • try/mac13-blink-rel
  • try/linux_chromium_dbg_ng
  • try/mac15-blink-rel
  • try/mac12.0-blink-rel
  • try/mac13.arm64-blink-rel
  • try/mac12.0.arm64-blink-rel
  • try/linux-blink-web-tests-force-accessibility-rel
  • try/win10.20h2-blink-rel
  • try/linux-oi-rel
  • try/mac13-tests
  • try/win-arm64-rel
  • try/win11-rel
  • try/mac12-tests
  • try/linux-blink-asan-rel
  • try/mac12-arm64-rel
  • try/linux_chromium_dbg_ng
  • try/mac14-arm64-rel
  • try/linux-bfcache-rel
  • try/mac13-arm64-rel
  • try/mac15-x64-rel-tests
  • try/linux-dcheck-off-rel
  • try/linux-blink-msan-rel
  • try/linux-blink-leak-rel
Not every builder will run for every CQ run, but even so, if there's nothing too platform-specific in these tests, there should be a fair amount of redundancy.

Rick Byers

unread,
Feb 18, 2026, 5:02:12 PM (3 days ago) Feb 18
to Jeffrey Yu, Deepak Ravichandran, Ben Pastene, Dirk Pranke, David Baron, Chromium-dev, Ian Kilpatrick
@Deepak Ravichandran and @Ben Pastene as I think this is a specific instance of a larger tradeoff they've been discussing. May be worth a small meeting at this point?

Ian Kilpatrick

unread,
Feb 18, 2026, 5:10:16 PM (3 days ago) Feb 18
to Jeffrey Yu, Dirk Pranke, David Baron, chromi...@chromium.org
Most of these bots don't run on the CQ - AFAIK only linux-rel/mac-rel/win-rel do.

> if there's nothing too platform-specific in these tests

blink_web_tests are platform specific due to:
 - font rendering differences between platforms
 - pixel ref tests differening between platforms
 - invalidation diffferences
(and also the virtual test-suite issue indicated above).

We should have all the tests within blink_web_tests & blink_wpt_tests run at least once on linux/mac/win.

Ian

Steve Kobes

unread,
Feb 18, 2026, 6:22:36 PM (3 days ago) Feb 18
to ikilp...@chromium.org, Jeffrey Yu, Dirk Pranke, David Baron, chromi...@chromium.org
Hi, I am chiming in to echo the request for the commit queue to run all blink_web_tests and blink_wpt_tests on each of Linux, Mac, and Windows.

Earlier in the thread, Dirk wrote: "And by 100% recall you're saying that e.g., blink_web_tests *never* failed on linux without also failing on another platform (e.g., windows) in the same attempt?"

I think answering that question in the affirmative is the only way we could sensibly remove blink_web_tests from linux-rel. But a "yes" to that question would greatly surprise me based on my experience in Blink. It's sounding as if "100% recall" may mean something different and more artificial.

Blink has a lot of platform-specific visible behavior, and a lot of platform-specific implementation details. Web tests validate those things even if the tests themselves do not contain platform-specific code. But also, web tests do often have platform-specific code, platform-specific baselines, or platform-specific expectations.

Scrollbars are an obvious example. Blink has a lot of code to manage scrollbars that look and act different on each platform. Scrollbars have many subtle behaviors that can easily regress. Web tests on the commit queue often catch platform-specific scrollbar regressions before a patch lands.

I hope we can tackle the commit queue's capacity challenges in a way that preserves our commitment to product excellence on all supported platforms.

Jeffrey Yu

unread,
Feb 18, 2026, 6:41:12 PM (3 days ago) Feb 18
to Ian Kilpatrick, Dirk Pranke, David Baron, chromi...@chromium.org
Hi Chrome devs,

Thank you for the initial round of feedback. I also had an offline chat with Ian and Deepak and am updating the list of targets based on suites that dev teams would like to keep. Based on replies, I've updated the list of suites to demote to exclude "blink_web_tests" and "trace_tests". 

If you have any additional concerns or questions about any of the other test suites in the list, please let me know!


Reply all
Reply to author
Forward
0 new messages