testing unlaunched flags

Steve Kobes

unread,

Nov 20, 2015, 6:11:44 PM11/20/15

to blin...@chromium.org

For new flags that are still in development, it's useful to have a record of which layout tests pass with the flag enabled. Virtual test suites are one way to do this, but they don't scale to flags that make deep architectural changes that potentially impact all of the tests.

In http://crrev.com/1469433002 I'm proposing "flag-specific test expectations" which are used by run-webkit-tests when additional flags are passed.

This makes it easier to see the flag's progress over time, and provides a way for patches that fix existing tests to include test coverage (since they can remove the flag-specific failure expectation instead of just saying "TESTED=foo/bar.html now passes with the flag" in the description).

Note that flag-specific expectation files won't automatically cause the bots to test your flag. But if you are setting up a dedicated bot it should be quite easy to use them.

Let me know if you have any feedback on this idea.

Thanks,

Steve

Walter Korman

unread,

Nov 21, 2015, 1:27:03 AM11/21/15

to Steve Kobes, blin...@chromium.org, Xianzhu Wang

Interesting! During SlimmingPaint v1 run-up to launch wangxianzhu@ manually created and maintained a set of one-off scripts and patches to track things on an ongoing basis similarly to what this is aimed at. I think your proposal could have made this easier.

Also somewhat related is http://crbug.com/537764 wherein we plan to auto-generate what's needed for RuntimeEnabledFeatures to have test-available JS setters. Today this has to be done manually on a per feature basis if you want to write tests that fiddle REFs one way or another (or, use a virtual suite). I think this is worth doing (and it was output of discussion with ojan@), but it is only in-addition-to what you're proposing.

For virtual test suites I understand there are performance ramifications as they don't/didn't shard as effectively. Would that be an issue here at all?

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Steve Kobes

unread,

Nov 22, 2015, 6:37:56 PM11/22/15

to Walter Korman, blin...@chromium.org, Xianzhu Wang

On Fri, Nov 20, 2015 at 10:26 PM, Walter Korman <wko...@google.com> wrote:

Also somewhat related is http://crbug.com/537764 wherein we plan to auto-generate what's needed for RuntimeEnabledFeatures to have test-available JS setters. Today this has to be done manually on a per feature basis if you want to write tests that fiddle REFs one way or another (or, use a virtual suite). I think this is worth doing (and it was output of discussion with ojan@), but it is only in-addition-to what you're proposing.

Agreed, this is a great idea for tests that are specific to the feature, but it's orthogonal to my proposal.

For virtual test suites I understand there are performance ramifications as they don't/didn't shard as effectively. Would that be an issue here at all?

As I understand it the performance issues with virtual test suites are due to not wanting the main waterfall and trybots to run every test n times for n configurations, plus having to restart the binary when flags change. If you have a dedicated bot for your flag, or if you are manually maintaining the flag-specific expectations, then neither of those issues apply.

Xianzhu Wang

unread,

Nov 23, 2015, 12:00:33 AM11/23/15

to Walter Korman, Steve Kobes, blink-dev

On Fri, Nov 20, 2015 at 10:26 PM, Walter Korman <wko...@google.com> wrote:

Interesting! During SlimmingPaint v1 run-up to launch wangxianzhu@ manually created and maintained a set of one-off scripts and patches to track things on an ongoing basis similarly to what this is aimed at. I think your proposal could have made this easier.

Agreed. I think it's important to get full test coverage for system-wide-impacting features before launch. Virtual test suites can get only a small set of coverage practically. Steve's proposal can let us get full test coverage on trybots or dedicated buildbots.

Dirk Pranke

unread,

Nov 24, 2015, 10:28:46 PM11/24/15

to Steve Kobes, Walter Korman, blink-dev, Xianzhu Wang

Right. To add a bit more detail ...

I believe we still run tests sharded by directory by default, in which case virtual test suites shard just as well as non-virtual test suites (i.e., exactly the same). You do have to restart content_shell when you want to change the list of arguments, and depending on the number of virtual test suites and their size, and the number of threads you have, the restarting can become a significant penalty. In the extreme case, one of the reasons we don't run --fully-parallel is because we don't have a good way of binding tests to content_shells running w/ the same arguments (i.e., preserving affinity), and so you end up restarting every few tests, which is really slow.

[ This is all stuff that could probably be fixed if one wanted to do so. ]

Assuming I understand Steve's proposal correctly, though, the flag-specific expectations are an all-or-nothing thing: during the test run, every test gets the same arguments, and so the same penalty doesn't exist.

-- Dirk

Dirk Pranke

unread,

Nov 24, 2015, 10:28:46 PM11/24/15

to Steve Kobes, blink-dev

On Fri, Nov 20, 2015 at 3:11 PM, Steve Kobes <sko...@chromium.org> wrote:

For new flags that are still in development, it's useful to have a record of which layout tests pass with the flag enabled. Virtual test suites are one way to do this, but they don't scale to flags that make deep architectural changes that potentially impact all of the tests.

In http://crrev.com/1469433002 I'm proposing "flag-specific test expectations" which are used by run-webkit-tests when additional flags are passed.

I like the proposal.

Have you thought much about how you imagine people using this feature, how bot coverage would work, etc.?

We'll probably need to do a little more leg work to figure out how to make the other tools play nicely with this without requiring additional bots to be set up (like we had to do w/ oilpan, which is a compile-time flag instead of a runtime flag).

-- Dirk

Steve Kobes

unread,

Nov 25, 2015, 6:22:09 PM11/25/15

to Dirk Pranke, blink-dev

On Tue, Nov 24, 2015 at 7:28 PM, Dirk Pranke <dpr...@chromium.org> wrote:

Have you thought much about how you imagine people using this feature, how bot coverage would work, etc.?

I am using it for --root-layer-scrolling, and I imagine other people developing behind flags may find it similarly useful. For example, I am working on a fix for scrollbar placement in RTL documents. Using flag-specific expectations I was able to discover that

(1) there is an existing test which is fixed by the patch (compositing/rtl/rtl-overflow-invalidation.html), which saves me the trouble of writing a new one

(2) there are a handful of tests which are broken by the patch, which I need to investigate

Without flag-specific expectations, it would be much more work to discover these diffs since they would be lost in the noise of expected failures. Plus the patch can modify FlagExpectations/root-layer-scrolls to indicate test coverage instead of using the change description.

We'll probably need to do a little more leg work to figure out how to make the other tools play nicely with this without requiring additional bots to be set up (like we had to do w/ oilpan, which is a compile-time flag instead of a runtime flag).

Just to be clear, this does not give us fully automated continuous testing of unlaunched flags without dedicated bots... I don't know of a good way to do that (but if anyone has ideas please chime in).

Dirk Pranke

unread,

Nov 25, 2015, 6:33:00 PM11/25/15

to Steve Kobes, blink-dev

On Wed, Nov 25, 2015 at 3:21 PM, Steve Kobes <sko...@chromium.org> wrote:

On Tue, Nov 24, 2015 at 7:28 PM, Dirk Pranke <dpr...@chromium.org> wrote:
Have you thought much about how you imagine people using this feature, how bot coverage would work, etc.?

I am using it for --root-layer-scrolling, and I imagine other people developing behind flags may find it similarly useful. For example, I am working on a fix for scrollbar placement in RTL documents. Using flag-specific expectations I was able to discover that

(1) there is an existing test which is fixed by the patch (compositing/rtl/rtl-overflow-invalidation.html), which saves me the trouble of writing a new one
(2) there are a handful of tests which are broken by the patch, which I need to investigate

Without flag-specific expectations, it would be much more work to discover these diffs since they would be lost in the noise of expected failures. Plus the patch can modify FlagExpectations/root-layer-scrolls to indicate test coverage instead of using the change description.

Right, makes sense.

We'll probably need to do a little more leg work to figure out how to make the other tools play nicely with this without requiring additional bots to be set up (like we had to do w/ oilpan, which is a compile-time flag instead of a runtime flag).

Just to be clear, this does not give us fully automated continuous testing of unlaunched flags without dedicated bots... I don't know of a good way to do that (but if anyone has ideas please chime in).

Right.

If we did want to leverage this on the bots, I think perhaps the route to take would be to add new test steps and be able to configure the flag per step. I think that would make the flakiness dashboard work, at least. We'd probably have to also modify the steps that archive the test results to archive them into a different bucket (or key the location off of the step name).

If we wanted to have feature-specific baselines, we could extend the patch to look in additional directories much like the way --additional-platform-directory works and then either run multiple rebaseline-o-matic jobs with different flags, or modify it to loop over the different flags.

Of course, if each flag really needed to run all of the tests, that would create a lot of load and increase cycle time significantly, so I wouldn't want to add this lightly, at least not before we at least had swarming support so we could run steps in parallel.

(though at least in some cases this might be nearly net-neutral to cycle time if we could then delete various virtual test suites).

-- Dirk

Steve Kobes

unread,

Nov 30, 2015, 1:24:25 PM11/30/15

to Dirk Pranke, blink-dev

On Wed, Nov 25, 2015 at 3:32 PM, Dirk Pranke <dpr...@chromium.org> wrote:

If we wanted to have feature-specific baselines, we could extend the patch to look in additional directories much like the way --additional-platform-directory works

Yes, I will probably end up doing this too.

Of course, if each flag really needed to run all of the tests, that would create a lot of load and increase cycle time significantly, so I wouldn't want to add this lightly, at least not before we at least had swarming support so we could run steps in parallel.

If you don't need to run all of the tests, you are probably running a subset that are directly related to the feature. For example, with --enable-fast-text-autosizing we only cared about the tests under fast/text-autosizing/. I think in those cases, virtual test suites are a better solution and we should continue using them.

Ojan Vafai

unread,

Nov 30, 2015, 4:13:18 PM11/30/15

to Steve Kobes, Dirk Pranke, blink-dev

We should probably document this. Maybe add a section to https://www.chromium.org/developers/testing/webkit-layout-tests#TOC-Virtual-Test-Suites explaining how these work and when to use this option vs. when to use a VirtualTestSuites? I agree that it's mostly just a question of number of tests.

Steve Kobes

unread,

Dec 1, 2015, 1:39:40 PM12/1/15

to Ojan Vafai, Dirk Pranke, blink-dev

On Mon, Nov 30, 2015 at 1:13 PM, Ojan Vafai <oj...@chromium.org> wrote:

We should probably document this. Maybe add a section to https://www.chromium.org/developers/testing/webkit-layout-tests#TOC-Virtual-Test-Suites explaining how these work and when to use this option vs. when to use a VirtualTestSuites? I agree that it's mostly just a question of number of tests.

I have now refactored this page to explain flag-specific expectations and contrast them with virtual test suites.

Reply all

Reply to author

Forward