Tree status, end of PDT sheriff stint

19 views
Skip to first unread message

Peter Kasting

unread,
Mar 21, 2013, 4:05:54 AM3/21/13
to Anton Muhin, webkit-gardening
As usual, the PDT sheriffing stint was fairly chaotic.

I tried to roll to r146382, but a problematic WebKit revision (initially misdiagnosed as flakiness) caused people to near-simultaneously revert the roll and disable the failing test on the Chromium side.  With the test disabled, I stuck a re-do of the roll in the CQ, but that's been chewing on it for hours and hasn't landed it yet: https://codereview.chromium.org/12442010/  (I rarely use the CQ for these things, but I was gunshy after the first failure.)

Meanwhile, there's lots of redness everywhere:

* The Chromium Test Win7 Dbg Canary ( http://build.chromium.org/p/chromium.webkit/builders/Win7%20%28dbg%29 ) shows a persistent crash in a test (interactive_ui_tests:ManyMessageCenterNotifications) that was also failing on the main waterfall (I didn't look to see whether the main waterfall failure was also a crash).  Someone apparently marked the test FLAKY, but that isn't making the redness of the crashed test disappear.  Worth checking to see what the main waterfall status is, but probably not an issue.

* The Chromium Test Linux Aura Canary ( http://build.chromium.org/p/chromium.webkit/builders/Linux%20Aura ) had a test failing for a while (ui_unittests:DisplayRectShowsCursorLTR) which was not failing on the main waterfall.  Then it stopped, then it started again.  I checked all the change log ranges and can't see any obvious reason for this behavior.  This one worries me.

* On the Win7 Perf bot ( http://build.chromium.org/p/chromium.webkit/builders/Win7%20Perf ), in addition to the "maybe we need to update the reference build" dromaeo_domcoremodify test (why is the perf sheriff not doing that?), startup_test is now persistently failing by timing out after 20 miinutes with no output.  Bot needs a kick?  Who knows.

* The Layout Test Canaries are in mediocre shape; a couple people (including me) landed tons of rebaselines, and there's been some fallout from that (maybe bad rebaselining due to non-cycled bots?) which I don't know if I've cleaned up completely yet.  There are also some crashes in some video-related compositing tests which I think are due to a Chromium-side change.  I sent a separate mail to the change author about this and CCed Anton (the non-PDT sheriff).  I had hoped that somewhere in the 146435 - 146444 range or so would be good enough to do another roll, but now it looks like there are some failures in shadow-related tests popping up everywhere, so that's probably not going to fly.

On the plus side, a lot of lines of test expectations got nuked and many "unexpected pass" outputs disappeared.  So at least some cleanup was accomplished.

Wearily,
PK

Peter Kasting

unread,
Mar 22, 2013, 12:39:37 AM3/22/13
to Yury Semikhatsky, webkit-gardening, Marcus Bulach, John Sheu, Alok Priyadarshi
Easier day today.

The Chromium Test canaries are basically green save for the >24-hr.-old "DisplayRectShowsCursorLTR" failure on the Linux Aura bot ( http://build.chromium.org/p/chromium.webkit/builders/Linux%20Aura ), which is tracked by https://code.google.com/p/chromium/issues/detail?id=222606 .

The Win7 Perf bot ( http://build.chromium.org/p/chromium.webkit/builders/Win7%20Perf ) is still hosed, but the problems on it aren't new.  I'm still hoping that a perf sheriff decides to do something; I mailed Markus Bulach about this earlier.

The Layout Test canaries are in fairly good shape other than the video/compositing-related failures I identified last night, which should be fixed by John Sheu's  https://codereview.chromium.org/12998003/ , but that's still in the CQ.  In retrospect, this should have gotten its own bug and layout test expectations if it was going to be broken this long (or at least I should have pushed for it to be landed manually -- the CQ adding 8 hours of latency to a crash fix is not really acceptable).  Otherwise, there's still some flakiness on the bots, but I managed to trim that down a bit with several bugs earlier today.

Earlier, Alok Priyadarshi landed a larger-impact change on the tree and attempted to suppress its failures before he had to leave; however, neither he nor I realized that in the new expectations syntax, "Failure" explicitly doesn't cover "ImageOnlyFailure"; so I had to modify some expectations to be wider.  More critically, a lot of the affected tests already had other failure expectations.  I am worried that in trying to clean up the fallout of his own change, Alok might rebaseline some tests into "passing" when they really shouldn't be, but I don't know if he has a lot of good choices.  The relevant bugs for anyone else who wants to suggest something are http://webkit.org/b/109507 , http://crbug.com/145087 , http://crbug.com/192172 , and http://webkit.org/b/111199 .

I have just landed a slightly-optimistic (but should be pretty safe) update to r146554, which also clears the local overrides in skia_expectations.txt.  Too many test expectation files!

See you tomorrow,
PK

Peter Kasting

unread,
Mar 23, 2013, 2:35:44 AM3/23/13
to Yury Semikhatsky, webkit-gardening
The tree is mostly in good shape right now, but there are issues:

* The Android test canary ( http://build.chromium.org/p/chromium.webkit/builders/Android%20Tests%20%28dbg%29 ) is hosed again because the devices connected to it have run out of battery (again).  I sent mail to the troopers and infrastructure mailing lists asking someone to fix this, hopefully on a permanent basis.

* The Win7 perf bot ( http://build.chromium.org/p/chromium.webkit/builders/Win7%20Perf ), seemingly a persistent annoyance, was failing nearly every test with output like:

INFO:root:Running file:///../../../../data/page_cycler/moz/start.html?iterations=10&auto=1
WARNING:root:Timed out waiting for reply on file:///../../../../data/page_cycler/moz/start.html?iterations=10&auto=1. This is unusual.
Exception WindowsError: (5, 'Access is denied') in <bound method TemporaryHTTPServer.__del__ of <telemetry.core.temporary_http_server.TemporaryHTTPServer object at 0x029DEA50>> ignored
WARNING:root:Failed pages: file:///../../../../data/page_cycler/moz/start.html?iterations=10&auto=1

I sent the bot a clobber request, but I don't feel confident that will fix it.  If not, someone may have to log in to the bot and clean up whatever's hung/broken.

* The Mac 10.6 Perf bot ( http://build.chromium.org/p/chromium.webkit/builders/Mac10.6%20Perf ) is failing dromaeo_domcoremodify, not because of size differences, but because the test is crashing:

WARNING:root:Tab crashed: file:///../../../../chrome/test/data/dromaeo/index.html?dom-modify&automated
Standard Output:
********************************************************************************
Chromium Helper(6232,0xa0496540) malloc: *** error for object 0x3c6280: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Terminating process due to a potential for future heap corruption: errno=009
********************************************************************************
WARNING:root:Failed pages: file:///../../../../chrome/test/data/dromaeo/index.html?dom-modify&automated
2 new files were left in /tmp: Fix the tests to clean up themselves.

This looks like a real failure, and since it doesn't appear on the main waterfall's 10.6 Perf bot, I think it's due to something in the following range: http://trac.webkit.org/log/?verbose=on&rev=146686&stop_rev=146672 .  Anyone have any good ideas?  (Anyone able to get onto the bot or otherwise run the test in such a way as to get more info about what crashed?)

Until this is fixed, I'm not willing to bring down any new WebKit revisions.

* The Win XP Layout Test canary ( http://build.chromium.org/p/chromium.webkit/builders/WebKit%20XP ) keeps failing a test ( fast/text/international/bold-bengali.html ) that's almost certainly due to the Skia changes today and almost certainly just needs a rebaseline, but the bot refuses to provide me with new output files so I can actually rebaseline it.  Mystified.  I had to leave a Bug(pkasting) expectation sitting at the bottom of the TestExpectations file to cover this for now.

Other than the bolded line above, we'd probably be good to go up to WebKit r146717.  If anyone is around over the weekend that can help with the issues above I'll be happy to try and do a few updates.

PK

Stephen White

unread,
Mar 23, 2013, 2:04:46 PM3/23/13
to Peter Kasting, Tony Gentilcore, Chase Phillips, Yury Semikhatsky, webkit-gardening
+tonyg, +cmp for perf bot expertise.  I'm seeing "telemetry" in the Win7 callstack above (although it also seems to implicate page_cycler, so I'm not sure what's to blame).

Stephen

Peter Kasting

unread,
Mar 23, 2013, 2:08:40 PM3/23/13
to Stephen White, Tony Gentilcore, Chase Phillips, Yury Semikhatsky, webkit-gardening
On Sat, Mar 23, 2013 at 11:04 AM, Stephen White <senor...@chromium.org> wrote:
+tonyg, +cmp for perf bot expertise.  I'm seeing "telemetry" in the Win7 callstack above (although it also seems to implicate page_cycler, so I'm not sure what's to blame).

Not page-cycler-related.  Here's another similar output from one of the dromaeo tests:

INFO:root:Running file:///../../../../chrome/test/data/dromaeo/index.html?jslib-traverse-prototype&automated
WARNING:root:Timed out waiting for reply on file:///../../../../chrome/test/data/dromaeo/index.html?jslib-traverse-prototype&automated. This is unusual.
Exception WindowsError: (5, 'Access is denied') in <bound method TemporaryHTTPServer.__del__ of <telemetry.core.temporary_http_server.TemporaryHTTPServer object at 0x02AB4A70>> ignored
WARNING:root:Failed pages: file:///../../../../chrome/test/data/dromaeo/index.html?jslib-traverse-prototype&automated

PK 

Chase Phillips

unread,
Mar 23, 2013, 7:44:53 PM3/23/13
to Peter Kasting, Stephen White, Tony Gentilcore, Yury Semikhatsky, webkit-gardening
The 'access is denied' error reminds me of crbug.com/222435.

Yury Semikhatsky

unread,
Mar 25, 2013, 4:42:10 AM3/25/13
to Peter Kasting, webkit-gardening, ant...@chromium.org
On Sat, Mar 23, 2013 at 10:35 AM, Peter Kasting <pkas...@google.com> wrote:
The tree is mostly in good shape right now, but there are issues:

* The Android test canary ( http://build.chromium.org/p/chromium.webkit/builders/Android%20Tests%20%28dbg%29 ) is hosed again because the devices connected to it have run out of battery (again).  I sent mail to the troopers and infrastructure mailing lists asking someone to fix this, hopefully on a permanent basis.

* The Win7 perf bot ( http://build.chromium.org/p/chromium.webkit/builders/Win7%20Perf ), seemingly a persistent annoyance, was failing nearly every test with output like:

INFO:root:Running file:///../../../../data/page_cycler/moz/start.html?iterations=10&auto=1
WARNING:root:Timed out waiting for reply on file:///../../../../data/page_cycler/moz/start.html?iterations=10&auto=1. This is unusual.
Exception WindowsError: (5, 'Access is denied') in <bound method TemporaryHTTPServer.__del__ of <telemetry.core.temporary_http_server.TemporaryHTTPServer object at 0x029DEA50>> ignored
WARNING:root:Failed pages: file:///../../../../data/page_cycler/moz/start.html?iterations=10&auto=1

I sent the bot a clobber request, but I don't feel confident that will fix it.  If not, someone may have to log in to the bot and clean up whatever's hung/broken.

* The Mac 10.6 Perf bot ( http://build.chromium.org/p/chromium.webkit/builders/Mac10.6%20Perf ) is failing dromaeo_domcoremodify, not because of size differences, but because the test is crashing:

WARNING:root:Tab crashed: file:///../../../../chrome/test/data/dromaeo/index.html?dom-modify&automated
Standard Output:
********************************************************************************
Chromium Helper(6232,0xa0496540) malloc: *** error for object 0x3c6280: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Terminating process due to a potential for future heap corruption: errno=009
********************************************************************************
WARNING:root:Failed pages: file:///../../../../chrome/test/data/dromaeo/index.html?dom-modify&automated
2 new files were left in /tmp: Fix the tests to clean up themselves.

This looks like a real failure, and since it doesn't appear on the main waterfall's 10.6 Perf bot, I think it's due to something in the following range: http://trac.webkit.org/log/?verbose=on&rev=146686&stop_rev=146672 .  Anyone have any good ideas?  (Anyone able to get onto the bot or otherwise run the test in such a way as to get more info about what crashed?)

The same test was crashing on Win7 much earlier (e.g.  WebKit r146022) with a message like the following: 

INFO:root:Running file:///../../../../chrome/test/data/dromaeo/index.html?dom-modify&automated

WARNING:root:Tab crashed: file:///../../../../chrome/test/data/dromaeo/index.html?dom-modify&automated
Standard Output:
********************************************************************************

********************************************************************************

Which makes me think that it is an old problem that was earlier observed only on Win7 bots but now it is revealed on Mac 10.5 Perf as well. Based on this I don't think we should block WK rolls unless we have a quick fix for the problem.

Yury Semikhatsky

unread,
Mar 25, 2013, 5:58:42 AM3/25/13
to Peter Kasting, webkit-gardening, ant...@chromium.org
Filed a but to track dromaeo_domcoremodify failure: https://code.google.com/p/chromium/issues/detail?id=223521

Marcus Bulach

unread,
Mar 25, 2013, 10:55:02 AM3/25/13
to Yury Semikhatsky, Peter Kasting, webkit-gardening, ant...@chromium.org
FYI, I filed this to try to get more information out when there's a crash on a perf test under Telemetry:

On android we have a buildbot step to collect all tombstones (aka coredumps) and symbolize them, it'd be nice to have on other platforms too.
Reply all
Reply to author
Forward
0 new messages