tl;dr: The builders are back up and running, and might even be healthy. They also shard tests and should be faster (like 5 minutes).
Long:
First we hit GCE Preemptible bugs last night.
https://cloud.google.com/compute/docs/instances/preemptible does say "This is a Beta release of Preemptible Instances. This feature is not covered by any SLA odeprecation policy and may be subject to backward-incompatible changes", but unfortunately the bug was on their side only surfacing during our responsible fallback from Preemptible -> full-price VMs when a zone was out of batch resources.
But in deploying the workaround to the above bug I decided (mostly because we don't use branches) to just deploy the new build coordinator that adg and I have been working on the past week that does test sharding. It had been doing just fine in our isolated dev project environment.
But of course, it had bugs. And one of the crashing child processes was repeatedly hammering git (gerrit), killing our gerrit quota for the day.
We were also running out of GCE quota, and not following it closely. So I finally fixed our GCE quota accounting. Instead of hard-coding 60 max builds, it now accounts for CPU, instances, and network addresses at runtime.
I requested more GCE quota (now 500 CPUs instead of 200), and we were killing Gerrit harder.
So I added caching & coalescing of requests to Gerrit. But we'd already used our 24-hour (not rolling window) allotment, so I stopped and just implemented our own git archive server (git rev -> tar.gz) in the existing watcher process.
Then I found a deadlock design problem in my test sharding. So I fixed that.
Things are making progress now.
In other news, tests are sharded now.
A build will now use multiple VMs. 1 runs make.bash, is snapshotted, and then is mirrored to N other machines to distribute out the tests. The first one tries to run tests in order (to make streaming results smooth), the other N do the slowest things first.
all-compile is now misc-compile and only does things not otherwise covered by other trybot builders.
plan9-386-gcepartial is now plan9-386 and runs all the tests, so 0intro's full builder is now retired. (thanks!)
Most builds (and ideally trybots) should take ~5 minutes or less now and should get better later.
This sounds like chaos but things are actually getting better and more reliable.