Android Tests bot post mortem

Erik Arvidsson

unread,

Sep 8, 2014, 4:20:42 PM9/8/14

to hackabi...@chromium.org

This morning when I came in the Android Test bot [1] was still broken. The bot had ~3 failing tests (but I now see that it was up to 5 test failures at one point). These tests have blocked the Blink roll at least since Friday.

Today I manually went through the build logs to find when these tests started to fail. Since the waterfall and build page does not list the failures correctly [2], I had to look at the std out of each build to dissect where these failures started.

I ended up reverting/disabling 4 CLs/tests

Disable:

crbug.com/412004

crbug.com/411931

Revert:

crbug.com/380349

https://codereview.chromium.org/549433004

[1] http://build.chromium.org/p/chromium.webkit/builders/Android%20Tests%20%28dbg%29

[2] crbug.com/412023

--
erik

Ojan Vafai

unread,

Sep 8, 2014, 8:41:46 PM9/8/14

to Erik Arvidsson, hackability-cy

Were these tests not failing on the main chromium waterfall?

--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CAJ8%2BGoge4y1fu-ZCoF%3DLvDYYSr7vbSDrJBm_tbGDTHjS_sJa5A%40mail.gmail.com.

John Abd-El-Malek

unread,

Sep 8, 2014, 9:02:54 PM9/8/14

to Ojan Vafai, Erik Arvidsson, hackability-cy

They were (I hadn't noticed that, as I was only looking at trybots).

Note that this outage was for > 3 days. In the old days, the tree would be closed which would force folks to track this down. But given that it was only a few failures, https://codereview.chromium.org/521583003/ was in the tree for three working days last week, while https://codereview.chromium.org/479873002/ was there all Friday.

It seems quite problematic that the tree is staying open with failures that aren't being addressed. Should we have a 'timeout' where the tree closes if a failure stays on for longer than a given period (6 hours?)? That way there's a forcing function to investigate and track down failures, instead sheriffs apparently ignoring them.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CANMdWTt_1DSdrYgjYsNXd1iR-_htdWMi%2BFFHg5vEb4rHQPiNHw%40mail.gmail.com.

Ojan Vafai

unread,

Sep 8, 2014, 9:13:50 PM9/8/14

to John Abd-El-Malek, Erik Arvidsson, hackability-cy

TL;DR: Unless I'm reading the waterfall wrong, this is not a problem related to the tree closure policy. Sheriff's are consistently greening the tree as much as they used to as best I can tell.

I don't see failures on the main waterfall android bots. contentshell_instrumentation_tests failed for three disconnected runs (with different tests each time), but the bots were completely green otherwise. I'm not sure why they were failing consistently on the blink waterfall.

http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/22754

http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/22750

http://build.chromium.org/p/chromium.linux/builders/Android%20Tests/builds/15620

http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20(dbg)

http://build.chromium.org/p/chromium.linux/builders/Android%20Tests

On Mon, Sep 8, 2014 at 6:02 PM, John Abd-El-Malek <j...@chromium.org> wrote:

They were (I hadn't noticed that, as I was only looking at trybots).

Am I just looking at the wrong bots?

John Abd-El-Malek

unread,

Sep 8, 2014, 10:01:59 PM9/8/14

to Ojan Vafai, Erik Arvidsson, hackability-cy

See http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29?numbuilds=200. They were about 23 failures of "contentshell_instrumentation_tests" while this outage was happening.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CANMdWTt7E0x73_rJaNSy-MuwtWMh1LwWxkT-wpA2Hh-QsrHuOg%40mail.gmail.com.

Ojan Vafai

unread,

Sep 8, 2014, 10:08:44 PM9/8/14

to John Abd-El-Malek, ser...@chromium.org, Erik Arvidsson, hackability-cy

Oh, I see. It's super flaky. This would be a really hard thing to detect and close the tree for.

The only solution I can see for this is to expose flakiness in sheriff-o-matic. That work isn't progressing super quickly. We could use more help on that work.

Open to other suggestions for how to address this.

Brett Wilson

unread,

Sep 8, 2014, 11:21:18 PM9/8/14

to Ojan Vafai, John Abd-El-Malek, ser...@chromium.org, Erik Arvidsson, hackability-cy

Before the tree was normally closed all weekend for some random flake.
This was quite frustrating and I often found myself opening the tree
on the weekend to land a patch. My perception is that the state fo the
tree during the week has improved a lot.

However, I wouldn't want to trade that for three days of uncaught
regressions that require a lot of work for the Monday sheriffs and a
lot of reverts to untangle.

We should keep a close eye on this kind of thing. Maybe this is
unusual. If it happens a lot, I wonder what the old tree closure
policy would look like with the new much-less-flaky tree (since most
of the flakiest tests have been disabled). I'm personally unclear on
how much of my perceived improvement of the state of the world is due
to this, and how much is due to the new tree closure policy.

Brett

> https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CANMdWTu_t-LOncQyNjA1M7Gvx9x%3DEOrJD-O40d5Ch5FmGbZssQ%40mail.gmail.com.

Ojan Vafai

unread,

Sep 8, 2014, 11:39:15 PM9/8/14

to Brett Wilson, John Abd-El-Malek, ser...@chromium.org, Erik Arvidsson, hackability-cy

Have a lot of flaky tests been disabled in the past week and a half?

John Abd-El-Malek

unread,

Sep 9, 2014, 12:10:53 AM9/9/14

to Ojan Vafai, ser...@chromium.org, Erik Arvidsson, hackability-cy

On Mon, Sep 8, 2014 at 7:08 PM, Ojan Vafai <oj...@chromium.org> wrote:

Oh, I see. It's super flaky. This would be a really hard thing to detect and close the tree for.

The only solution I can see for this is to expose flakiness in sheriff-o-matic. That work isn't progressing super quickly. We could use more help on that work.

http://chromium-try-flakes.appspot.com/ is now working. It uses the data from chromium-cq-status to match passed and failed runs for the same patchset and groups similar failures. This can be used by a dashboard to alert a sheriff when something wrong starts happening.

(I've been ironing out bugs in this over the last week)

John Abd-El-Malek

unread,

Sep 9, 2014, 12:32:26 AM9/9/14

to Ojan Vafai, Brett Wilson, ser...@chromium.org, Erik Arvidsson, hackability-cy

Let's not look at this as binary: either old policy or keeping tree open always. Both have issues.

We have switched from a situation where everyone pays price of closed tree through slower CQ, which sucked for obvious reasons. However at least that suckiness incentivized some people to reopen tree. How can we incentivize people now? Just saying welcome to new world & volunteers welcome to undo the suckiness from the new system is somewhat lacking.

Ojan Vafai

unread,

Sep 9, 2014, 1:08:48 AM9/9/14

to John Abd-El-Malek, Brett Wilson, ser...@chromium.org, Erik Arvidsson, hackability-cy

On Mon, Sep 8, 2014 at 9:32 PM, John Abd-El-Malek <j...@chromium.org> wrote:

Let's not look at this as binary: either old policy or keeping tree open always. Both have issues.

I think we can keep the new policy and address these sorts of issues by building better tooling. For example, if we really wanted, we could have gatekeeper change the tree status message and require the sheriff to fix it without closing the tree. That's a bit silly, but it would achieve the same goal without stopping chromium development in the process.

We have switched from a situation where everyone pays price of closed tree through slower CQ, which sucked for obvious reasons. However at least that suckiness incentivized some people to reopen tree. How can we incentivize people now?

We have a good plan for showing flakiness in sheriff-o-matic. The problem is just that bad flakiness can now go unnoticed (whereas before only mild flakiness went unnoticed). If we get to a point where we expect sheriffs to use sheriff-o-matic and we expose this in sheriff-o-matic as something the sheriff's need to address in order to consider the tree green, then I don't think we need further incentives or policy changes. We're not there yet obviously.

If we just got to a point where sheriff's used sheriff-o-matic without the dedicated flakiness UI, I think a lot of problems like this would get caught because sheriffs would see the same failure repeating and would address it similar to how they did before when the tree would close.

Just saying welcome to new world & volunteers welcome to undo the suckiness from the new system is somewhat lacking.

Who is saying that? That's certainly not what I said. To clarify, right now I'm focusing my efforts on making sheriff-o-matic better so that sheriffs actually use it for non-flaky failures. Once that's working well, I'll focus on better-exposing flaky failures. We could do the work in parallel and do both parts of this faster if we had more help.

That said, I do wish more of the people complaining about the policy change would help improve things. I've wasted too much time arguing with people about the policy change rather than actually working on improving tooling to avoid problems like this one. To be clear, I'm not talking about John or Brett here. John is obviously helping a lot with related issues and Brett was just making an observation.

On Mon, Sep 8, 2014 at 8:38 PM, Ojan Vafai <oj...@chromium.org> wrote:
Have a lot of flaky tests been disabled in the past week and a half?

I'm still curious about the answer to this.

I'm also curious how often this happens. Hard to get data on that. So far this is the only instance we know of?

John Abd-El-Malek

unread,

Sep 9, 2014, 1:25:35 AM9/9/14

to Ojan Vafai, Brett Wilson, ser...@chromium.org, Erik Arvidsson, hackability-cy

On Mon, Sep 8, 2014 at 10:08 PM, Ojan Vafai <oj...@chromium.org> wrote:

On Mon, Sep 8, 2014 at 9:32 PM, John Abd-El-Malek <j...@chromium.org> wrote:
Let's not look at this as binary: either old policy or keeping tree open always. Both have issues.

I think we can keep the new policy and address these sorts of issues by building better tooling.

I agree with this.

I what worries me is that we switched tree policy without this tooling being done. If we keep having issues like the one we had over the last 5 days, then that'll be a big time sink that isn't measured.

For example, if we really wanted, we could have gatekeeper change the tree status message and require the sheriff to fix it without closing the tree. That's a bit silly, but it would achieve the same goal without stopping chromium development in the process.

We have switched from a situation where everyone pays price of closed tree through slower CQ, which sucked for obvious reasons. However at least that suckiness incentivized some people to reopen tree. How can we incentivize people now?

We have a good plan for showing flakiness in sheriff-o-matic. The problem is just that bad flakiness can now go unnoticed (whereas before only mild flakiness went unnoticed). If we get to a point where we expect sheriffs to use sheriff-o-matic and we expose this in sheriff-o-matic as something the sheriff's need to address in order to consider the tree green, then I don't think we need further incentives or policy changes. We're not there yet obviously.

If we just got to a point where sheriff's used sheriff-o-matic without the dedicated flakiness UI, I think a lot of problems like this would get caught because sheriffs would see the same failure repeating and would address it similar to how they did before when the tree would close.

btw I often hear that all the new tooling will be in sheriffo-matic, however anecdotal points from chromium sheriffs is that most aren't using it. Do we have stats on what percentage of sheriffs are using it? Is anything being done to encourage sheriffs to try this out and collect feedback about what can be done to make it their preferred tool?

Just saying welcome to new world & volunteers welcome to undo the suckiness from the new system is somewhat lacking.

Who is saying that? That's certainly not what I said. To clarify, right now I'm focusing my efforts on making sheriff-o-matic better so that sheriffs actually use it for non-flaky failures. Once that's working well, I'll focus on better-exposing flaky failures. We could do the work in parallel and do both parts of this faster if we had more help.

That probably came off as not exactly what I meant, sorry.

It's just not clear to me (either way, I'm unsure myself) if we should change the tree opening policy before we have ways to cope with the resulting tragedy of the commons.

Dirk Pranke

unread,

Sep 9, 2014, 1:15:51 PM9/9/14

to John Abd-El-Malek, Ojan Vafai, Brett Wilson, ser...@chromium.org, Erik Arvidsson, hackability-cy

On Mon, Sep 8, 2014 at 10:25 PM, John Abd-El-Malek <j...@chromium.org> wrote:

On Mon, Sep 8, 2014 at 10:08 PM, Ojan Vafai <oj...@chromium.org> wrote:
On Mon, Sep 8, 2014 at 9:32 PM, John Abd-El-Malek <j...@chromium.org> wrote:
Let's not look at this as binary: either old policy or keeping tree open always. Both have issues.

I think we can keep the new policy and address these sorts of issues by building better tooling.

I agree with this.

I what worries me is that we switched tree policy without this tooling being done. If we keep having issues like the one we had over the last 5 days, then that'll be a big time sink that isn't measured.

For example, if we really wanted, we could have gatekeeper change the tree status message and require the sheriff to fix it without closing the tree. That's a bit silly, but it would achieve the same goal without stopping chromium development in the process.

We have switched from a situation where everyone pays price of closed tree through slower CQ, which sucked for obvious reasons. However at least that suckiness incentivized some people to reopen tree. How can we incentivize people now?

We have a good plan for showing flakiness in sheriff-o-matic. The problem is just that bad flakiness can now go unnoticed (whereas before only mild flakiness went unnoticed). If we get to a point where we expect sheriffs to use sheriff-o-matic and we expose this in sheriff-o-matic as something the sheriff's need to address in order to consider the tree green, then I don't think we need further incentives or policy changes. We're not there yet obviously.

If we just got to a point where sheriff's used sheriff-o-matic without the dedicated flakiness UI, I think a lot of problems like this would get caught because sheriffs would see the same failure repeating and would address it similar to how they did before when the tree would close.

btw I often hear that all the new tooling will be in sheriffo-matic, however anecdotal points from chromium sheriffs is that most aren't using it. Do we have stats on what percentage of sheriffs are using it? Is anything being done to encourage sheriffs to try this out and collect feedback about what can be done to make it their preferred tool?

Just saying welcome to new world & volunteers welcome to undo the suckiness from the new system is somewhat lacking.

Who is saying that? That's certainly not what I said. To clarify, right now I'm focusing my efforts on making sheriff-o-matic better so that sheriffs actually use it for non-flaky failures. Once that's working well, I'll focus on better-exposing flaky failures. We could do the work in parallel and do both parts of this faster if we had more help.

That probably came off as not exactly what I meant, sorry.

It's just not clear to me (either way, I'm unsure myself) if we should change the tree opening policy before we have ways to cope with the resulting tragedy of the commons.

We were in a bad place with the prior tree closing policy, so I think there is a real danger of letting the perfect be the enemy of the good.

As far as this incident goes, it seems to me that we've had plenty of times where the tree was mostly or totally broken all weekend before. Often, in that case, we relied on people doing work on their own on the weekend to try and clean things up, which is at best a somewhat painful recipe for success.

It seems to me the main downside of having the tree open while the failures persisted is that more changes landed in the mean time. It wasn't clear to me from Arv's initial note that that that actually made his job harder.

So far I haven't heard any real complaints with the new policy besides this thread, and lots of people are happy that the tree is open more and the CQ is faster, so it seems to me that we made the right call to switch before all the tools were perfect.

Obviously, we should keep watching and get more data, though. And keep improving things!

-- Dirk

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CALhVsw0py9Rrj%2BH2h%3DwvMR1dR0eugQNdbQp4xwzOXVuO%2BpLUSg%40mail.gmail.com.

Julie Parent

unread,

Sep 9, 2014, 1:16:41 PM9/9/14

to John Abd-El-Malek, Ojan Vafai, Brett Wilson, ser...@chromium.org, Erik Arvidsson, hackability-cy

Re: engaging with sheriffs.

When the policy change was made, we discussed having Ojan/Karen reach out directly to sheriffs to remind them of the new tooling, ask for feedback during/after their shifts, and, more importantly stress that it is still their *job* to keep the tree green, even if it is open. Is that still being done? A cultural change like this will take some reminding to make happen. The first few sheriffs post-change gave a lot of great feedback.

I'm not sure if anyone reads the email, but we should add a note to the sheriff reminder e-mail with a link to the sheriff-o-matic documentation page and a link to easily file bugs. And, at least until we feel confident that sheriffs have switched, we should follow up directly with sherifs after their shifts and ask for feedback.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CALhVsw0py9Rrj%2BH2h%3DwvMR1dR0eugQNdbQp4xwzOXVuO%2BpLUSg%40mail.gmail.com.

Ojan Vafai

unread,

Sep 9, 2014, 2:27:59 PM9/9/14

to Julie Parent, kar...@chromium.org, John Abd-El-Malek, Brett Wilson, ser...@chromium.org, Erik Arvidsson, hackability-cy

I have been engaging with some of the sheriffs. There were a number of serious bugs in the tool that came out of sheriffs actually using it. I've been working hard on getting those fixed. I plan on doing another round of encouraging sheriffs to give it a try soon.

I'd like to get to a point where the new tool is unambiguously better before pushing too hard for sheriffs to use it. I don't think we're far from that point.

On Tue, Sep 9, 2014 at 10:16 AM, Julie Parent <jpa...@chromium.org> wrote:

Re: engaging with sheriffs.

When the policy change was made, we discussed having Ojan/Karen reach out directly to sheriffs to remind them of the new tooling, ask for feedback during/after their shifts, and, more importantly stress that it is still their *job* to keep the tree green, even if it is open. Is that still being done? A cultural change like this will take some reminding to make happen. The first few sheriffs post-change gave a lot of great feedback.

This particular problem doesn't seem like a cultural issue to me so much as a problem with the tooling (both waterfall and sheriff-o-matic) not surfacing the flakiness more prominently. But, I agree we should continue reaching out to sheriffs to try it out and give feedback. The feedback so far has been very helpful.

I'm not sure if anyone reads the email, but we should add a note to the sheriff reminder e-mail with a link to the sheriff-o-matic documentation page and a link to easily file bugs.

That's a good idea. Who owns that script?

And, at least until we feel confident that sheriffs have switched, we should follow up directly with sherifs after their shifts and ask for feedback.

I've been doing this some, which has gotten me the feedback about things we need to fix. I should do it more proactively though, particularly now that most of the P0 issues are fixed. Karen, maybe you could help out with this? Or maybe you're still doing this?

On Mon, Sep 8, 2014 at 10:25 PM, John Abd-El-Malek <j...@chromium.org> wrote:
On Mon, Sep 8, 2014 at 10:08 PM, Ojan Vafai <oj...@chromium.org> wrote:
On Mon, Sep 8, 2014 at 9:32 PM, John Abd-El-Malek <j...@chromium.org> wrote:
Let's not look at this as binary: either old policy or keeping tree open always. Both have issues.

I think we can keep the new policy and address these sorts of issues by building better tooling.

I agree with this.

I what worries me is that we switched tree policy without this tooling being done. If we keep having issues like the one we had over the last 5 days, then that'll be a big time sink that isn't measured.

For example, if we really wanted, we could have gatekeeper change the tree status message and require the sheriff to fix it without closing the tree. That's a bit silly, but it would achieve the same goal without stopping chromium development in the process.

We have switched from a situation where everyone pays price of closed tree through slower CQ, which sucked for obvious reasons. However at least that suckiness incentivized some people to reopen tree. How can we incentivize people now?

We have a good plan for showing flakiness in sheriff-o-matic. The problem is just that bad flakiness can now go unnoticed (whereas before only mild flakiness went unnoticed). If we get to a point where we expect sheriffs to use sheriff-o-matic and we expose this in sheriff-o-matic as something the sheriff's need to address in order to consider the tree green, then I don't think we need further incentives or policy changes. We're not there yet obviously.

If we just got to a point where sheriff's used sheriff-o-matic without the dedicated flakiness UI, I think a lot of problems like this would get caught because sheriffs would see the same failure repeating and would address it similar to how they did before when the tree would close.

btw I often hear that all the new tooling will be in sheriffo-matic, however anecdotal points from chromium sheriffs is that most aren't using it. Do we have stats on what percentage of sheriffs are using it? Is anything being done to encourage sheriffs to try this out and collect feedback about what can be done to make it their preferred tool?

Just saying welcome to new world & volunteers welcome to undo the suckiness from the new system is somewhat lacking.

Who is saying that? That's certainly not what I said. To clarify, right now I'm focusing my efforts on making sheriff-o-matic better so that sheriffs actually use it for non-flaky failures. Once that's working well, I'll focus on better-exposing flaky failures. We could do the work in parallel and do both parts of this faster if we had more help.

That probably came off as not exactly what I meant, sorry.

Heh. No worries. I'm probably being too defensive. Email is hard.

It's just not clear to me (either way, I'm unsure myself) if we should change the tree opening policy before we have ways to cope with the resulting tragedy of the commons.

You'll recall that I had a much more conservative plan for when we'd change the policy and only pushed it earlier due to encouragement from the other CY leads. That said, I agree with Dirk that we're doing overall much better with the new policy even though there are some legitimate issues like this.

Erik Arvidsson

unread,

Sep 9, 2014, 4:00:57 PM9/9/14

to Dirk Pranke, John Abd-El-Malek, Ojan Vafai, Brett Wilson, ser...@chromium.org, hackability-cy

On Tue, Sep 9, 2014 at 1:15 PM, Dirk Pranke <dpr...@chromium.org> wrote:

It seems to me the main downside of having the tree open while the failures persisted is that more changes landed in the mean time. It wasn't clear to me from Arv's initial note that that that actually made his job harder.

I don't think it made my job harder.

My main outcome of this incident was that the results from the Android bots is hard to interpret and at first I could not decipher what was wrong with the bot. Later when another test started to fail the error was still the same cryptic message so I continued to ignore the failure. Fixing the error reporting from the bot would have cause me to realize the issues as they happened instead of 3 days later.

--
erik

Reply all

Reply to author

Forward