Builder update & mini post-mortem

285 views
Skip to first unread message

Brad Fitzpatrick

unread,
Jun 5, 2015, 5:14:46 PM6/5/15
to golang-dev
tl;dr: The builders are back up and running, and might even be healthy. They also shard tests and should be faster (like 5 minutes).

Long:

First we hit GCE Preemptible bugs last night. https://cloud.google.com/compute/docs/instances/preemptible does say "This is a Beta release of Preemptible Instances. This feature is not covered by any SLA odeprecation policy and may be subject to backward-incompatible changes", but unfortunately the bug was on their side only surfacing during our responsible fallback from Preemptible -> full-price VMs when a zone was out of batch resources.

But in deploying the workaround to the above bug I decided (mostly because we don't use branches) to just deploy the new build coordinator that adg and I have been working on the past week that does test sharding.  It had been doing just fine in our isolated dev project environment.

But of course, it had bugs. And one of the crashing child processes was repeatedly hammering git (gerrit), killing our gerrit quota for the day.

We were also running out of GCE quota, and not following it closely. So I finally fixed our GCE quota accounting. Instead of hard-coding 60 max builds, it now accounts for CPU, instances, and network addresses at runtime.

I requested more GCE quota (now 500 CPUs instead of 200), and we were killing Gerrit harder.

So I added caching & coalescing of requests to Gerrit. But we'd already used our 24-hour (not rolling window) allotment, so I stopped and just implemented our own git archive server (git rev -> tar.gz) in the existing watcher process.

Then I found a deadlock design problem in my test sharding. So I fixed that.

Things are making progress now.

In other news, tests are sharded now.

A build will now use multiple VMs. 1 runs make.bash, is snapshotted, and then is mirrored to N other machines to distribute out the tests. The first one tries to run tests in order (to make streaming results smooth), the other N do the slowest things first.

all-compile is now misc-compile and only does things not otherwise covered by other trybot builders.

plan9-386-gcepartial is now plan9-386 and runs all the tests, so 0intro's full builder is now retired. (thanks!)

Most builds (and ideally trybots) should take ~5 minutes or less now and should get better later.

This sounds like chaos but things are actually getting better and more reliable.

Andrew Gerrand

unread,
Jun 5, 2015, 6:52:42 PM6/5/15
to Brad Fitzpatrick, golang-dev

On 5 June 2015 at 14:14, Brad Fitzpatrick <brad...@golang.org> wrote:
This sounds like chaos but things are actually getting better and more reliable.

I can confirm that this is true. Thanks for the hard work Brad.

Dave Cheney

unread,
Jun 5, 2015, 8:43:40 PM6/5/15
to Andrew Gerrand, Brad Fitzpatrick, golang-dev
Seconded. This work doesn't just make the build faster to run after the commit lands, but it also improves the speed the pre commit builders, which in turn encourages you to use the trybots, even for that change you know couldn't possibly break anything.

Thanks Brad, this is a huge step in the battle for quality.


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aram Hăvărneanu

unread,
Jun 6, 2015, 6:32:21 AM6/6/15
to Dave Cheney, Andrew Gerrand, Brad Fitzpatrick, golang-dev
Thanks Brad, for making everything faster and better.

--
Aram Hăvărneanu
Reply all
Reply to author
Forward
0 new messages