Hackability Code Yellow Meeting Notes - October 29

19 views
Skip to first unread message

Julie Parent

unread,
Oct 29, 2014, 7:25:00 PM10/29/14
to hackability-cy

Wednesday, October 29

Summary

Code yellow work is still ongoing. Please continue to reach out to the leads if you'd like to be involved.

Area updates

Tree closers and CQ configs match (phajdan)

  • Fixed 1 blocking issue, but found another, so no change in coverage %

  • Considering changing metric to when CQ lands change and breaks main waterfall config (since this is the symptom we really care about).  Caveat: main waterfall flakes can give false answers here.

  • ng trybot conversion in progress (issue: runtimes seem to be longer, investigating this)

  • Working on increasing GPU builder capacity (moving to swarming), re-using compile step, adding more coverage


Tree open > 80% (ojan)

  • agable will be helping out keeping builder_alerts running

  • AI: Karen to ping Raman on src-internal bug and follow up on crbug.com/417405

  • ojan fixed a major bug, things are running better now, but still stops running sometimes

  • mac bots (main waterfall) are finally on 10.9, and … the horrible bug is now gone! Tree open time much improved over the weekends.  Next up: try server.

    • jam@ notes that the specific bug is fixed, but what about similar issues that could cause tree to stay closed all weekend? Should we do something to re-open automatically? ojan@ says yes, there are speculative things we could do, but, not sure it is worth the investment.

Reduce CQ false rejection rate (sergeyberezin)

  • The false rejection rate this week was 7.4%, mostly due to an outage on Wednesday.

  • I and sheyang@ are looking into ways to automatically monitor overall health of the tryserver and adjust CQ's behavior based on that, so that it can survive through such outages, and not require developers to re-click the button.


Lower Chromium bot cycle time (jam)

  • Looking more into GPU bots, working with ken russell on increasing utilization.  GPU bots have long queues, lots of time spent rebooting, this is effecting times.  Moving to swarming will improve everything.  This is priority for the team, working on it now.


Dashboards/monitoring (jparent)

  • https://trooper-o-matic.appspot.com/cq/chromium now shows flakiness rates.Click a point to dig into results.

  • jam@ notes that the cq patch time breakdown isn’t a good metric - includes time waiting for LGTM, tree closed, etc.  We want this info somewhere (total time graph encompasses it), but it makes the single run graph less actionable. [ AI(jparent): ask alancutter to tighten this metric ]

  • AI(jparent): circulate graphs about trooper-o so others can get their eyes on it

Mike Stipicevic

unread,
Oct 29, 2014, 8:05:48 PM10/29/14
to Julie Parent, hackability-cy, Sergey Berezin, she...@chromium.org
Sergey and Sheng, can you talk more about "looking into ways to automatically monitor overall health of the tryserver and adjust CQ's behavior based on that?"

I worry that CQ is a complicated beast as-is, and simplicity and resiliency works in its favor (the end-to-end principle factors in here). For example. phajdan.jr@ removed CQ polling the TS, which greatly increased its reliability. What were you guys considering?

- Mike

--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CAPSmAASq5s2cBYaMQw7Ta3as3QP2yKkow30MpBSd6h6QZPXS1A%40mail.gmail.com.

Sergey Berezin

unread,
Oct 29, 2014, 8:28:16 PM10/29/14
to Mike Stipicevic, Julie Parent, hackability-cy, she...@chromium.org
Hi Mike,

Here's my vision, in a short paragraph. First, motivation: if I'm as a human trying to land a change, and discover that my tryjobs fail because there is an outage, I'd wait for the system to get fixed before retrying. It's a natural intelligent behavior. So I want the artificial intelligence of CQ to step up a notch towards that.

Now, implementation. Say, we build a separate monitoring system (not part of CQ) which can summarize an overall health of the tryserver / tree in a single number between [0..100], roughly representing a probability of a tryjob failing for the wrong reasons (any reason at all). Then on the CQ side, all it has to do is:
  1. Try the jobs as usual (unless the system is really, really broken).
  2. If some jobs fail, check if the overall system is not healthy before retrying. If it's below certain threshold, simply hold, possibly with a timeout, until things are better.
  3. Retry the jobs as usual.
CQ already has a similar logic for tree closure, and it's not hard to add another one.

The continuous number for health instead of a binary (good/bad) may allow CQ to make more intelligent decisions. For instance, 0 would be "tryserver is totally dead", so CQ should not even attempt any try jobs, while 50 might be worth a risk if it's only one builder that failed. Or, for simplicity, we may just draw a line at 50 - it's up to CQ devs to decide.

The key here is the magic number for the overall health summary (possibly per builder), and this is what we are evaluating for feasibility.

Sergey.

Mike Stipicevic

unread,
Oct 29, 2014, 8:41:30 PM10/29/14
to Sergey Berezin, Mike Stipicevic, Julie Parent, hackability-cy, she...@chromium.org
So, this adds almost no code the CQ process itself right? You're just going to have two 'tree closure' verifiers instead of one? That sounds fine to me. I was worried you were going to add more logic into the process itself, which would add extra complexity to an already unpredictable system :). I would leave the CQ data as binary and make the 'go/no go' decision elsewhere.

Backpressure is an important part of a reliable system so this sounds chill.

John Abd-El-Malek

unread,
Dec 3, 2014, 12:02:13 PM12/3/14
to Sergey Berezin, Mike Stipicevic, Julie Parent, hackability-cy, she...@chromium.org
IMO this sounds like too much complexity, and has all the problems of heuristic based systems. We should focus on decreasing the time from when bad patches land to when we discover them and revert them.

Reply all
Reply to author
Forward
0 new messages