A quick survey on setting up fuzzy match rules to identify resolvable flaky web tests

76 views
Skip to first unread message

Vivian Zhi (支文文)

unread,
Jul 29, 2022, 8:20:20 PM7/29/22
to blin...@chromium.org, Shirley Ji, Weizhong Xia, Chrome-Blink-EngProd
Hi blink-dev

I would like to let you know that blink-engprod has added feature support for non-WPT fuzzy tests. It now allows both non-WPT reftests and pixel tests to use the same fuzzy matching meta-tags as WPT tests.It also shows max color channel difference and total number of different pixels image diff stats in results.html. With these capabilities in place, we like to research further to see if we can set up some general fuzzy match rules, help blink dev identify flaky tests that can be potentially resolved by adjusting fuzzy matching rules. Currently there are quite some web tests that are flaky due to a slight image mismatch, which should have been tolerated. If we setup a general fuzzy matching rule , something like:

 <meta name="fuzzy" content="0-1;0-1000">

Instruct the image comparison web tests that if color channel and pixel diff fall within the range of the rule, we can ignore the diff and pass the test.This way we can reduce test flakiness while still maintain test accuracy without missing a real bug.

We want to ask you some quick survey questions to help us make design decisions, whether it makes sense to set up an universal cross-the-board fuzzy match tolerant rule for all blink web tests, or we should make the rules more specific to individual test or test sets.

1.  Is an universal fuzzy match tolerant rule acceptable for the web tests in your area? 

    a). If the answer is yes, what is the acceptable range of max color channel and pixel diff for your tests?
    b) If the answer is no, pls share your reasons.

2. Do you prefer fuzzy matching rule adjustment at a per-test or per test set level based on the pixel difference numbers shown in results.html?

Here is some sample data help you make choice, we collected data recently from blink_web_tests result on linux-test builder, the distribution of color channel maxDifference and totalPixel diff for failing/flaky blink_web_tests
( Note: over 70% tests in color channel maxDifference 0-10 range have maxDifference=1):

Color Channel maxDifferenece 
Range
Fail test count
0-1098
11-10031
101-20028
201-260111

totalPixels 
Diff Range
Fail test count
0-10030
100-100057
1000-10,00099
10,000-100,00066
100,000-1,000,00016


Let me know if you have any questions, looking forward to hearing from you!


Vivian
on behalf of Chrome-Blink-EngProd

Stephen Chenney

unread,
Aug 1, 2022, 7:25:46 AM8/1/22
to Vivian Zhi (支文文), blin...@chromium.org, Shirley Ji, Weizhong Xia, Chrome-Blink-EngProd
Thanks for investigating the potential for fuzzy matching.

Rendering Core continues to oppose a single fuzzy match rule across all web_tests. We have some tests where single pixel differences matter (related to pixel snapping, for example) and a universal fuzzy match would fail to identify problems with those. This came up in practice recently when the GPU team enabled fuzzy matching without telling us, and expected failing tests started passing when they shouldn't.

Maybe specific sub teams have directories they could apply default fuzzy matching to. My guess is that the same directories where it will work will be directories with few failing tests, limiting the impact of a per-directory approach.

Is there a way to reproduce the sampling below with a side-by-side comparison of the images? I would find it helpful to look through some of the cases that would pass with <meta name="fuzzy" content="0-1;0-1000">, for example.

Stephen.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAPCqkTs-L5u22-Xp5U_LeBdLP%3D%2BTDH1KGv8MTmtKQFRcANCZJg%40mail.gmail.com.

Xianzhu Wang

unread,
Aug 1, 2022, 11:40:09 AM8/1/22
to Stephen Chenney, Vivian Zhi (支文文), blink-dev, Shirley Ji, Weizhong Xia, Chrome-Blink-EngProd
On Mon, Aug 1, 2022 at 4:25 AM Stephen Chenney <sche...@chromium.org> wrote:
Thanks for investigating the potential for fuzzy matching.

Rendering Core continues to oppose a single fuzzy match rule across all web_tests. We have some tests where single pixel differences matter (related to pixel snapping, for example) and a universal fuzzy match would fail to identify problems with those. This came up in practice recently when the GPU team enabled fuzzy matching without telling us, and expected failing tests started passing when they shouldn't.

I think a key difference between the original fuzzy matching rule and the rule proposed by Vivian is the ranges. With maxDifference=0-1, we should be able to catch most visible single pixel differences. What I'm not sure is whether a difference like rgb(1, 0, 0) vs rgb(0, 0, 0) (each component in the range of 0-255) should be treated as a failure in some cases.

Maybe specific sub teams have directories they could apply default fuzzy matching to. My guess is that the same directories where it will work will be directories with few failing tests, limiting the impact of a per-directory approach.

Is there a way to reproduce the sampling below with a side-by-side comparison of the images? I would find it helpful to look through some of the cases that would pass with <meta name="fuzzy" content="0-1;0-1000">, for example.

A filter by actual maxDifference and totalPixels in results.html will be helpful. I can add it when I get time.

Vivian Zhi (支文文)

unread,
Aug 1, 2022, 6:16:37 PM8/1/22
to Xianzhu Wang, Stephen Chenney, blink-dev, Shirley Ji, Weizhong Xia, Chrome-Blink-EngProd
Thanks for valuable feedback! Stephen, Xianzhu, will see if we can add a filter in result.html to grab those tests in range.

Xianzhu Wang

unread,
Aug 2, 2022, 12:04:00 PM8/2/22
to Vivian Zhi (支文文), Stephen Chenney, blink-dev, Shirley Ji, Weizhong Xia, Thorben Troebst
On Mon, Aug 1, 2022 at 10:36 AM Vivian Zhi (支文文) <viv...@google.com> wrote:
Thanks for valuable feedback! Stephen, Xianzhu, will see if we can add a filter in result.html to grab those tests in range.

The CL adding pixel diff filter in results.html has landed. Thanks Thorben!

In this example results.html, you can examine the pixel results of tests that produced pixel differences matching a particular fuzzy rule in the following steps:
1. Enter pixel difference filter e.g. "channel_max:1-1" in the filter input box;
2. Click "All" button (as we show regressions only by default).
You might want to switch to "side-by-side view" and click the image to examine the pixel values.

With "channel_max:1-1" we can see all tests that produced pixel differences that can be tolerated with a fuzzy rule like <meta name=fuzzy content="0-1;0-1000000">. There are 70 such tests in the example results.html. All of them look benign to me. So perhaps a universal rule (for non wpt tests) is proper?

On the other hand, even if we have such a universal rule, we can only recover 70 tests. Instead of applying the rule automatically, we can also manually modify these tests to include a meta fuzzy rule.

Philip Rogers

unread,
Aug 2, 2022, 12:10:54 PM8/2/22
to blink-dev, Xianzhu Wang, Stephen Chenney, blink-dev, Shirley Ji, weiz...@google.com, Thorben Troebst, Vivian Zhi (支文文)
How much of a problem is flakiness caused by minor pixel differences compared to overall flakiness? I looked at the top 10 flaky tests here and none of them were minor pixel differences.

70 tests is a manageable number and it seems reasonable to add fuzzy matching to them.
Stephen.

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.
Reply all
Reply to author
Forward
0 new messages