Reliable CQ, Code Yellow Update

37 views
Skip to first unread message

Eric Seidel

unread,
Sep 9, 2014, 1:23:20 PM9/9/14
to hackability-cy
TL;DR: I’m retiring from the Code Yellow effort and declaring the
“Reliable CQ” area (which I lead) as having failed to meet its
expected Q3 exit.

I’ve written out a long (but Google-only, sorry!) post-mortem of my
involvement with the CY for the curious:
https://docs.google.com/a/google.com/document/d/1elvxzGkK0jRDk8WOUQpyOThzQj3dVVPVLTu2yFgEf20/edit

I will be leaving on vacation next week and returning to my Blink
duties (instead of Infrastructure) upon my return in Oct.


Sergey Brezin of the CQ team has offered to take on the “Reliable CQ”
task, and would do so with my full support! However (as discussed in
my post-mortem) I beleive larger change is needed to the Code Yellow
should it continue. We've been in this "emergency" state for 3 months
as of this Friday.

John Abd-El-Malek

unread,
Sep 9, 2014, 3:14:59 PM9/9/14
to Eric Seidel, hackability-cy
On Tue, Sep 9, 2014 at 10:23 AM, Eric Seidel <ese...@chromium.org> wrote:
TL;DR: I’m retiring from the Code Yellow effort and declaring the
“Reliable CQ” area (which I lead) as having failed to meet its
expected Q3 exit.

There's a lot to discuss in the document. One thing that I want to directly talk about is this due date. As I've mentioned before, I don't think any arbitrary date should be used. We have serious problems in our developer infrastructure, and the code yellow should end when they are solved. The google3 code yellow for similar scenario took longer than the initial estimate, but it finished when the initial goals were done.

 

I’ve written out a long (but Google-only, sorry!) post-mortem of my
involvement with the CY for the curious:
https://docs.google.com/a/google.com/document/d/1elvxzGkK0jRDk8WOUQpyOThzQj3dVVPVLTu2yFgEf20/edit

I will be leaving on vacation next week and returning to my Blink
duties (instead of Infrastructure) upon my return in Oct.


Sergey Brezin of the CQ team has offered to take on the “Reliable CQ”
task, and would do so with my full support!  However (as discussed in
my post-mortem) I beleive larger change is needed to the Code Yellow
should it continue.  We've been in this "emergency" state for 3 months
as of this Friday.

--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CA%2Bbb4fk87vRMU2f_tB_9zV1Oqj7AjdA7gLs37Kvwhng%3DhbzsEg%40mail.gmail.com.

Chase Phillips

unread,
Sep 9, 2014, 5:56:00 PM9/9/14
to John Abd-El-Malek, Eric Seidel, hackability-cy
Thanks for sharing your feedback on your CY experience as a lead, Eric.

About the "due date" specifically, I did not expect any pillar to exit by the end of Q3.  Leadership did not set a time-based pressure for exiting the CY, either.

AFAIK Eric's Q3 expected exit was only ever an expectation he set for himself independently of the CY effort.


sergey...@chromium.org

unread,
Sep 10, 2014, 3:13:34 PM9/10/14
to hackabi...@chromium.org
I'm happy to take over the tech lead for "Reliable CQ" CY pillar. I've de facto been working on it anyway, and I hope my direct involvement will help the overall effort.

Thanks, Eric, for being very supportive, and for writing the postmortem - I find it illuminating.

The way I look at the "Reliable CQ" pillar, the greatest chunk of it boils down to flakiness, for any reason. jam@'s new app chromium-try-flakes.appspot.com is a great step in monitoring and exposing the various kinds of flakiness, and driving down sources of flakiness will bring us much closer to the exit criteria on multiple pillars.

Deeper understanding of the CQ behavior is another key effort, and alancutter@ has been instrumental in creating chromium-cq-status.appspot.com. For the first time, we now have historical logs of what CQ did for each CL (similar to chrome-build-extract.appspot.com for waterfall / try job builds). We just need to build tooling for analyzing and displaying this data in trooper-o-matic, like chrome-monitor.appspot.com and chrome-infra-stats.appspot.com do based on chrome-build-extract.

I don't expect us to meet the exit criteria by the end of Q3, but I believe we can learn a lot about our systems in Q3, which will help up understand how to make them more robust in Q4.

Sergey.

Paweł Hajdan, Jr.

unread,
Sep 11, 2014, 6:20:14 AM9/11/14
to Sergey Berezin, hackability-cy
Eric, thank you for writing the doc. I appreciate your contributions and leadership in the Code Yellow, and the produced doc is a very important one in my opinion. I do not consider it a failure, so whether it's a postmortem or just called one, it contains lots of insights I'm still digesting.

Sergey Berezin, Alan Cutter, Sergiy Byelozyorov and John Abd-El-Malek have been working on better understanding of CQ rejections. Anything more we can learn is very useful, including just to convince people about changes to address the problems.

If I can make a suggestion, I'd advocate for more emphasis on addressing the problems we know about. John did a ton of work to improve trybot speed, and it paid off - we're seeing much improved numbers in the stats. As far as I know most other work was on gathering more data.

If my understanding is correct, the top issues are:

1. compile failures
2. browser_tests failures (or test flakes in general)

Just wanted to check, what is our strategy/roadmap to tackle them? Here's my understanding of current efforts:

1a. we'll add a "compile (without patch)" step to see if the tree was broken at given revision
1b. we'll also distinguish between failures we're sure about and tryjob results which say "I don't know", and retry the latter more times; we already do this for infra failures
2. chromium-try-flakes.appspot.com is flagging specific tests so I guess John is disabling these; is this enough, or could we study flakiness of tests over longer period of time, possibly using the BigQuery infrastructure that Sergiy was working on? Do we know exactly what type of data we currently don't have but would like to have, and what would we do with that data that we can't do with current data?

Paweł

--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.

John Abd-El-Malek

unread,
Sep 11, 2014, 11:32:07 AM9/11/14
to Paweł Hajdan, Jr., Sergey Berezin, hackability-cy, Marc-Antoine Ruel, Vadim Shtayura
As part of looking at the CQ time, I have spent a lot of the last month looking at the various failures which cause it to retry (and be slow as a result). We have enough known issues now that we need to tackle. Here's a list of  issues that I think we should fix as part of the code yellow. Some of these depend on sys admin work (i.e. upgrading mac/win).
-upgrade Mac bots to 10.9 to avoid the random-process-killing bug (https://code.google.com/p/chromium/issues/detail?id=313077)
-automatically detect bot running out of disk space and at least take the machine offline (https://code.google.com/p/chromium/issues/detail?id=410088)
-switch windows 32 bots to new hardware: fixes OOM while fetching, simplifies bot management by having a big enough disk to put the windows configs in one pool, about 20% faster: https://code.google.com/p/chromium/issues/detail?id=406960 (being worked on by sys admins afaik)
-network switch needs to be upgraded in lab to avoid hangs during bot update: https://code.google.com/p/chromium/issues/detail?id=398229 (planned, I think sys admins are waiting for hardware. see these 137 failures over the last 9 days: http://chromium-try-flakes.appspot.com/search?q=update_scripts)
-uninstall VS2010 & VS2012 from bots since the C: partition is too small and runs out of space: https://code.google.com/p/chromium/issues/detail?id=410567
-swarming: restart bots every day to workaround leaks in Windows (https://code.google.com/p/chromium/issues/detail?id=413005)
-swarming: internally retry when 'bot died' (there are 284 occurances over the last 9 days according to the chromium-try-flakes. these are the ones with 'TEST RESULTS WERE INVALID' in the flake name): (https://code.google.com/p/chromium/issues/detail?id=401124)

In general, one area we could really benefit from is to automatically detect when a bot is in a bad state. In this scenario, the bot fails every run with the same error. This happens for a wide variety of reasons, some examples I've seen over the last few weeks:
-disk full
-ninja, nacl build_nexe bugs or missing dependency in gyp which lead to the output directory being in a bad state and needing a clobber
-systemic problems affecting the whole config (like the android_aosp outage we have experienced over the last few days)
-git checkout being corrupted

We need to detect that a bot is borked, and stop sending jobs to it. This detection part is not going to be easy, i.e. have to guard against tree failures triggering this or someone's bad patch being tried on a single VM a few times in a row etc. As a start, before taking bots out of commission, if we had simple heuristics (i.e. 3 failed builds in a row with same error, other jobs for same config are passing) and had a dashboard of suspected 'bad machines' that would be helpful in telling troopers or sheriffs where to look.

On Thu, Sep 11, 2014 at 3:19 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
Eric, thank you for writing the doc. I appreciate your contributions and leadership in the Code Yellow, and the produced doc is a very important one in my opinion. I do not consider it a failure, so whether it's a postmortem or just called one, it contains lots of insights I'm still digesting.

Sergey Berezin, Alan Cutter, Sergiy Byelozyorov and John Abd-El-Malek have been working on better understanding of CQ rejections. Anything more we can learn is very useful, including just to convince people about changes to address the problems.

If I can make a suggestion, I'd advocate for more emphasis on addressing the problems we know about.

I couldn't +1 this more.
 
John did a ton of work to improve trybot speed, and it paid off - we're seeing much improved numbers in the stats. As far as I know most other work was on gathering more data.

If my understanding is correct, the top issues are:

1. compile failures
2. browser_tests failures (or test flakes in general)
 
I'm not sure when this analysis was done, but I'd be curious to see the scripts and/or updated results.


Just wanted to check, what is our strategy/roadmap to tackle them? Here's my understanding of current efforts:

1a. we'll add a "compile (without patch)" step to see if the tree was broken at given revision

So looking at compile failures; everything I see points to 'tree broken at head' not being the primary cause of compile failures. This does happen, but rarely. What is more common is issues linked to above (I won't duplicate them here).
 

Sergiy Byelozyorov

unread,
Sep 12, 2014, 6:22:14 AM9/12/14
to John Abd-El-Malek, Paweł Hajdan, Jr., Sergey Berezin, hackability-cy, Marc-Antoine Ruel, Vadim Shtayura
One of the efforts is to implemenet a autobisectbot which will automatically find and eventually revert patches that introduce flakiness. This is arguably better than just disabling the test, but we will need to evaluate how reliably will bisect bot be able to find the right commit.

John Abd-El-Malek

unread,
Sep 12, 2014, 10:39:47 AM9/12/14
to Sergiy Byelozyorov, Paweł Hajdan, Jr., Sergey Berezin, hackability-cy, Marc-Antoine Ruel, Vadim Shtayura
This is something I'm really looking forward to :)

Kenneth Russell

unread,
Sep 12, 2014, 3:50:18 PM9/12/14
to John Abd-El-Malek, Sergiy Byelozyorov, Paweł Hajdan, Jr., Sergey Berezin, hackability-cy, Marc-Antoine Ruel, Vadim Shtayura
Yes, a bot like this would be immensely helpful.

Depending on how much hardware would be required, it would be great if one configuration of this bot (Win or Mac or Linux) could use physical machines, to be able to run the tests that are currently run on the tryserver.chromium.gpu waterfall.


Sergiy Byelozyorov

unread,
Sep 15, 2014, 3:39:04 AM9/15/14
to Kenneth Russell, John Abd-El-Malek, Paweł Hajdan, Jr., Sergey Berezin, hackability-cy, Marc-Antoine Ruel, Vadim Shtayura
Currently the effort for the auto-bisect bot is sufficiently stuffed. The necessary parts should be finished already this week, but probably we'll need some time to ensure that they work together.
Reply all
Reply to author
Forward
0 new messages