looking at CQ false rejections

39 views
Skip to first unread message

Paweł Hajdan, Jr.

unread,
Sep 29, 2014, 11:02:17 AM9/29/14
to infr...@chromium.org
I wrote a quick script (attached) to list more details about CQ false rejections, i.e. cases where we failed a CQ attempt for one patchset, but another attempt for the same patchset actually succeeded.

Note this is different from chromium-try-flakes in that the latter does not take into account whether the tryjob failure resulted in the attempt failing or just an internal retry within the same attempt. Of course it's still good to work on these flakes and minimize their occurences since every one increases CQ latency.

Now back to the false rejections, here's the data I got. Note that it only looks at the first page of CQ rejections for now, so the sample size is small.

Still, what I wonder is e.g. whether we should increase the number of retries for blink CQ to 2...

Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
  Test-Ubuntu13.10-GCE-NoGPU-x86_64-Debug-Trybot on tryserver.skia (http://108.170.220.120:10117/builders/Test-Ubuntu13.10-GCE-NoGPU-x86_64-Debug-Trybot/builds/891)
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:
Try jobs failed on following builders:

Paweł
find_false_rejections.py

John Abd-El-Malek

unread,
Sep 29, 2014, 11:33:43 AM9/29/14
to Paweł Hajdan, Jr., infr...@chromium.org
I can't help but notice that most of the links below are from win_blink_rel. Looking at Sergey's weekly emails to chromium-dev, there's a section about top flaky builders. I ran the numbers for last week. win_blink_rel was flaky 26% of the time. win_blink_dbg was 22%. The next blink bot was only at 3%. So something is much more flakier in the blink win bots.

For comparison, on the chromium CQ, the two win bots (32 & 64 bit) are at 3%. Note that when we started the CY, the win_chromium bots were around 20%. We got this decrease mostly by disabling the very small number of tests that were responsible for most of the flakiness. Some of the disabled tests were fixed, but the important thing is that they were disabled immediately. chromium-try-flakes has helped me to get this list.

Instead of making the CQ retry twice for blink and wasting cycles rerunning win_blink_rel so much, it seems that whatever is running on that bot should be examined closely to bring the flakiness rate from a very high 30% down to something reasonable.

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPbhQC%3DgLjdHa4cJ2t_WH9miL4ERVEq30xow2ZaVqUSOOQ%40mail.gmail.com.

Dirk Pranke

unread,
Sep 29, 2014, 4:42:35 PM9/29/14
to John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
win_blink_rel is indeed known to be very flaky. enne@ landed some changes at the end of the week last week that I'm hoping fix the worst of the problems, but I haven't looked at recent builds to see if things are better.

Unfortunately the intersection of {blink developers} and {people who regularly develop on windows} is nearly zero, and the non-zero few are usually quite busy, so there are very few people regularly feeling this pain *and* able and motivated to fix it.

Volunteers to work on issues are welcome :). We could theoretically be willing to turn the bot off as an alternative, but I'm not sure how that would be helpful.

-- Dirk

John Abd-El-Malek

unread,
Sep 29, 2014, 4:50:27 PM9/29/14
to Dirk Pranke, Paweł Hajdan, Jr., infr...@chromium.org
On Mon, Sep 29, 2014 at 1:42 PM, Dirk Pranke <dpr...@chromium.org> wrote:
win_blink_rel is indeed known to be very flaky. enne@ landed some changes at the end of the week last week that I'm hoping fix the worst of the problems, but I haven't looked at recent builds to see if things are better.

Unfortunately the intersection of {blink developers} and {people who regularly develop on windows} is nearly zero, and the non-zero few are usually quite busy, so there are very few people regularly feeling this pain *and* able and motivated to fix it.

nit: everyone working on Blink is feeling the pain, since it appears the high flakiness is slowing down the Blink CQ by 20-30 minutes :)

Dirk Pranke

unread,
Sep 29, 2014, 5:02:50 PM9/29/14
to John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
Which is why I emphasized the "and" part ...

Julie Parent

unread,
Sep 29, 2014, 7:32:44 PM9/29/14
to Dirk Pranke, e...@chromium.org, John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
+eae

Not volunteering him, but Emil is the only person I know who falls in the intersection of {blink developers} and {people who regularly develop on windows}, and has indicated willingness to help with CY.

Emil A Eklund

unread,
Sep 30, 2014, 11:25:21 AM9/30/14
to Julie Parent, Dirk Pranke, John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
On Mon, Sep 29, 2014 at 4:32 PM, Julie Parent <jpa...@chromium.org> wrote:
> +eae
>
> Not volunteering him, but Emil is the only person I know who falls in the
> intersection of {blink developers} and {people who regularly develop on
> windows}, and has indicated willingness to help with CY.

Sad but true, that intersection is surprisingly small given that
windows is still by far our most popular platform.

What specifically am I (not) being volunteered for?

--
Emil

Julie Parent

unread,
Sep 30, 2014, 8:40:14 PM9/30/14
to e...@chromium.org, Dirk Pranke, John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
Getting the 26% flaky rate of win_blink tests down to something more reasonable, by investigating and disabling tests as necessary, like jam@ did with chromium tests.  https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CALhVsw34wQrF%2BY8SsxiA2DX-yp1Rm-ZQzqtG0VOXMaLUp5nt2g%40mail.gmail.com should have the full context

Ojan Vafai

unread,
Sep 30, 2014, 9:18:54 PM9/30/14
to Julie Parent, Emil A Eklund, Dirk Pranke, John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
IMO, the focus here should be on exposing flakiness on the main waterfall bots in sheriff-o-matic so we can disable/fix/delete/etc these tests in a long-term sustainable way. I have a plan for this, but I've been having trouble finding someone to work on this.

That said, it'd probably be good for someone to check a random sample of the false rejections we see on the win blink try bot and see if they're flaky on the main waterfall.

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.

Emil A Eklund

unread,
Sep 30, 2014, 9:19:15 PM9/30/14
to Julie Parent, Dirk Pranke, John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
On Tue, Sep 30, 2014 at 5:40 PM, Julie Parent <jpa...@chromium.org> wrote:
> Getting the 26% flaky rate of win_blink tests down to something more
> reasonable, by investigating and disabling tests as necessary, like jam@ did
> with chromium tests.
> https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CALhVsw34wQrF%2BY8SsxiA2DX-yp1Rm-ZQzqtG0VOXMaLUp5nt2g%40mail.gmail.com
> should have the full context

Ah, that certainly sounds like something I'd be able to help with.

Ojan Vafai

unread,
Sep 30, 2014, 9:50:44 PM9/30/14
to Emil A Eklund, Julie Parent, Dirk Pranke, John Abd-El-Malek, Paweł Hajdan, Jr., infr...@chromium.org
To clarify my previous comment, if you're willing to work on actually fixing the flakiness on windows, that's totally crucial and you should do that. Anyone on the team is well qualified to do the sheriff-o-matic work I mentioned, but you're one of the few well-qualified to fix the flakiness.

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.

John Abd-El-Malek

unread,
Oct 1, 2014, 1:13:08 AM10/1/14
to Ojan Vafai, Emil A Eklund, Julie Parent, Dirk Pranke, Paweł Hajdan, Jr., infr...@chromium.org
Is it known if it's a number of flaky tests vs something structural that affects all tests?

If the former, I had been blacklisting blink tryjobs from chromium-try-flakes. Enabling it is is trivial (removing the if statement in line 273). This would show which tests fail and pass in different tryjobs for the same patchset.

Dirk Pranke

unread,
Oct 1, 2014, 1:18:44 AM10/1/14
to John Abd-El-Malek, Ojan Vafai, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
I've not seen anything to think that there is structural flakiness that affects *all* tests.

I think there are bugs that affect certain subsets of tests. I don't think we know exactly what they all are, but the bug that enne fixed last week was a good example of such things.

I think the current level of flakiness is an example of what happens when the tools get good enough so that you can ignore things without feeling the pain: if no one is inclined to run the tests locally, and things eventually pass in a couple hours, it's easy enough to ignore the problem (i.e., I think there are qualitative differences between try jobs that complete in an hour and jobs that complete in ten hours).

-- Dirk

Ojan Vafai

unread,
Oct 1, 2014, 1:19:41 AM10/1/14
to Dirk Pranke, John Abd-El-Malek, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
Historically, the http tests on windows have been very flaky and that's clearly something structural. But I haven't looked recently to see if that's still the case.

Dirk Pranke

unread,
Oct 1, 2014, 1:21:04 AM10/1/14
to Ojan Vafai, John Abd-El-Malek, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
I believe switching to apache fixed that; I haven't seen any issues the past few months.

-- Dirk

John Abd-El-Malek

unread,
Oct 1, 2014, 1:27:37 AM10/1/14
to Dirk Pranke, Ojan Vafai, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
ok, I've made chromium-try-flakes start showing blink try flakes. We'll see what data it shows by the morning.

Ilya Tikhonovsky

unread,
Oct 1, 2014, 2:37:51 AM10/1/14
to John Abd-El-Malek, Dirk Pranke, Ojan Vafai, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
I ran a script against win_blin_rel try bot logs  and got the next stats for 200 runs (only failures were counted)


nametotaltexttimeoutscrashesimagemissing
media/encrypted-media/encrypted-media-playback-multiple-sessions.html38380000
virtual/antialiasedtext/fast/text/orientation-sideways.html27000270
fast/writing-mode/english-lr-text.html27000270
virtual/antialiasedtext/fast/text/international/vertical-text-glyph-test.html27000270
virtual/antialiasedtext/fast/text/decorations-with-text-combine.html27000270
virtual/antialiasedtext/fast/text/justify-ideograph-vertical.html27000270
virtual/antialiasedtext/fast/text/international/text-combine-image-test.html27000270
fast/css/font-weight-1.html27000270
http/tests/w3c/webperf/submission/Intel/user-timing/test_user_timing_measure_associate_with_navigation_timing.html23230000
inspector/tracing/timeline-receive-response-event.html20200000
fast/pagination/div-x-horizontal-bt-ltr.html20002000
virtual/deferred/inspector/tracing/timeline-receive-response-event.html20200000
virtual/implsidepainting/inspector/tracing/timeline-receive-response-event.html20200000
http/tests/media/media-source/mediasource-play-then-seek-back.html18180000
fast/css/fontfaceset-add-remove-while-loading.html16160000
fast/pagination/div-x-horizontal-bt-rtl.html15001500
fast/multicol/newmulticol/compare-with-old-impl/div-x-horizontal-bt-ltr.html14001400
fast/multicol/newmulticol/compare-with-old-impl/div-x-horizontal-bt-rtl.html11001100
virtual/regionbasedmulticol/fast/pagination/div-x-horizontal-bt-ltr.html800800
media/encrypted-media/encrypted-media-needkey.html770000
.......

 



Ojan Vafai

unread,
Oct 1, 2014, 2:45:40 AM10/1/14
to Ilya Tikhonovsky, John Abd-El-Malek, Dirk Pranke, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
This list is just failures, not flakes, right? 

Ilya Tikhonovsky

unread,
Oct 1, 2014, 2:58:03 AM10/1/14
to Ojan Vafai, John Abd-El-Malek, Dirk Pranke, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
yep

the table of flakes is quite different

nametotalflaky textflaky timeoutsflaky crashesflaky image
virtual/gpu/fast/canvas/check-stale-putImageData.html175001750
battery-status/page-visibility.html128001280
http/tests/plugins/interrupted-get-url.html7107100
http/tests/appcache/offline-access.html6706700
http/tests/appcache/video.html5405400
http/tests/security/link-crossorigin-subresource-use-credentials.html3603600
http/tests/inspector/extensions-ignore-cache.html3403400
http/tests/security/cross-frame-access-frameelement.html3003000
http/tests/security/img-crossorigin-no-credentials-prompt.html2602600
http/tests/security/mime-type-execute-as-html-16.html2402400
http/tests/pointer-lock/pointerlockelement-different-origin.html2102100
http/tests/security/script-onerror-crossorigin-same-origin.html2102100
http/tests/w3c/webperf/approved/UserTiming/test_user_timing_mark.htm
1818000
1818000
1500150
1400140
1100110
80080
80080
80080
80080
77000
77000
77000
77000
inspector/timeline/timeline-bound-function.html70700
http/tests/security/host-compare-case-insensitive.html70700
http/tests/media/video-buffered-range-contains-currentTime.html60600
media/track/track-css-matching-timestamps.html66000
http/tests/plugins/cross-frame-object-access.html61500
fast/css/fontfaceset-add-remove-while-loading.html66000
http/tests/media/media-source/mediasource-play-then-seek-back.html66000
http/tests/security/local-video-source-from-remote.html60600
http/tests/loading/preload-picture-sizes.html60600
http/tests/misc/selectionAsMarkup.html60600
media/track/track-css-matching.html64200
svg/custom/resource-client-removal.svg66000
http/tests/media/text-served-as-text.html60600
media/media-fragments/TC0039.html66000
http/tests/security/cross-frame-access-set-window-properties.html60600
fast/dom/HTMLImageElement/image-srcset-w-onerror.html66000
inspector/elements/styles/styles-add-blank-property.html66000
http/tests/loading/dont-preload-non-img-srcset.html60600
media/track/track-cues-seeking.html66000
printing/ellipsis-printing-style.html60006

Eric Seidel

unread,
Oct 2, 2014, 1:05:05 PM10/2/14
to Ilya Tikhonovsky, Ojan Vafai, John Abd-El-Malek, Dirk Pranke, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
You might be able to find help by roping in blink-dev.  Anyone can triage flaky test lists and disable tests.

Ilya Tikhonovsky

unread,
Oct 7, 2014, 2:31:32 PM10/7/14
to Eric Seidel, Ojan Vafai, John Abd-El-Malek, Dirk Pranke, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
I created a small page which fetches stdout from try bot runs and shows the list of flaky and failing tests.
https://x20web.corp.google.com/~loislo/try_bot_flakiness.html
I think it needs to be converted/replaced with flakiness dashboard because it is impossible to detect the type of failure from the webkit_test step stdout and it takes too much time to get the data from the full stdout. But try bots don't publish the results to test-results server at the moment.

It seems that about a half of the flaky tests on windows are from inspector/tracing/ tests.
I've marked them as flaky and found && fixed one of the reason of the flakiness.

BTW: It is interesting that some tests always fail/crash/timeout in the batch and always pass on retry. So they look almost always green in test-results server.

Dirk Pranke

unread,
Oct 7, 2014, 3:19:42 PM10/7/14
to Ilya Tikhonovsky, Eric Seidel, Ojan Vafai, John Abd-El-Malek, Emil A Eklund, Julie Parent, Paweł Hajdan, Jr., infr...@chromium.org
On Tue, Oct 7, 2014 at 11:31 AM, Ilya Tikhonovsky <loi...@google.com> wrote:
I created a small page which fetches stdout from try bot runs and shows the list of flaky and failing tests.
https://x20web.corp.google.com/~loislo/try_bot_flakiness.html
I think it needs to be converted/replaced with flakiness dashboard because it is impossible to detect the type of failure from the webkit_test step stdout and it takes too much time to get the data from the full stdout. But try bots don't publish the results to test-results server at the moment.

The try jobs do publish their results to google storage, so you could download them from there. 

It seems that about a half of the flaky tests on windows are from inspector/tracing/ tests.
I've marked them as flaky and found && fixed one of the reason of the flakiness.


Yup, that matches what I saw on Friday. Good to hear it's been addressed.
 
BTW: It is interesting that some tests always fail/crash/timeout in the batch and always pass on retry. So they look almost always green in test-results server.

That's actually the exact opposite of what should be happening. If a test fails initially and passes on the retry, then test-results should show the failure, not the pass; put differently, test-results ignores the retries completely. 

If that's not what we're seeing, either that's a bug or someone changed how test-results works for the worse :).

-- Dirk
Reply all
Reply to author
Forward
0 new messages