Changing how builders run

953 views
Skip to first unread message

Brad Fitzpatrick

unread,
Sep 3, 2014, 3:27:49 PM9/3/14
to golang-dev
I started a doc on recent efforts to modernize and clean up how we run the Go builders:


Comments welcome here by email.


I would absolutely love help migrating more builders to his infrastructure. (e.g. FreeBSD, OpenBSD, NetBSD, Plan 9).

If you're good at qemu, Vagrant, Docker, or any of those types of things, let me know.

Thanks!

Aram Hăvărneanu

unread,
Sep 3, 2014, 6:38:33 PM9/3/14
to Brad Fitzpatrick, golang-dev
I really like the trybot mechanism.

Plan 9 doesn't currently work on the flavor of KVM provided by GCE. I
don't see any reason to drag Plan 9 into this, we're fine, really.

I don't think we should do any cross-arch builds on QEMU if we have
the real hardware alternative. I have found *many* bugs in QEMU, plus
QEMU can sometimes run code that would not otherwise run on real
hardware (QEMU many times is lenient towards things explicitly
forbidden by the CPU manufacturer, plus QEMU implements a memory
coherency model stronger than real hardware).

As for Solaris, sorry again, but we're more than fine. Other than
that, anything that makes you happy.

--
Aram Hăvărneanu

andrewc...@gmail.com

unread,
Sep 3, 2014, 8:00:52 PM9/3/14
to golan...@googlegroups.com
I've used headless qemu a fair bit and used docker as well. I'd be happy to help with things.

For cross arch builds if qemu isn't accurate enough, is lack of hardware the main issue hardware?  Buildroot or something like debootstrap might help with making things more reproducible even if it requires real hardware.

Brad Fitzpatrick

unread,
Sep 3, 2014, 8:20:32 PM9/3/14
to Andrew Chambers, golang-dev
For ARM, it's mostly how unreliable they are. And with enough (x86) CPU in the "cloud", we might also be able to do a build faster than real hardware given enough sharding.

I wouldn't throw out all our real hardware builders, but having reliable qemu builders to at least tell us we broken obvious things quickly would be nice.

I agree a lighter userspace on our ARM builders would be nice. e.g. apparently Arch Linux has Docker ARM easily available: http://sc5.io/blog/2014/07/a-private-raspberry-pi-cloud-with-arm-docker/

More help would be great.

Let me know what part you'd like to work on and I'll file bugs so we can discuss there and not spam this list too much going forward.



--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

andrewc...@gmail.com

unread,
Sep 3, 2014, 8:39:24 PM9/3/14
to golan...@googlegroups.com, andrewc...@gmail.com
I can try to make a dockerfile which creates a repeatable arm linux filesystem then executes it with qemu to run the builder stuff.

Brad Fitzpatrick

unread,
Sep 3, 2014, 8:48:01 PM9/3/14
to Andrew Chambers, golang-dev

David Symonds

unread,
Sep 3, 2014, 9:47:40 PM9/3/14
to Brad Fitzpatrick, golang-dev
This might be a good time to switch the dashboard to use pull queues
that the coordinator/builder requests (possibly in batches). That'll
make it easier to have multiple builders for a given build target
(especially useful for the slower ones, but also good for the fast
ones when a big pile of commits land together), and also allow us to
do build retries, whether automatic or human triggered.

Andrew Gerrand

unread,
Sep 3, 2014, 9:55:00 PM9/3/14
to David Symonds, Chris Manghane, Brad Fitzpatrick, golang-dev
I suggested the same, but I don't think Brad really wants to touch the dashboard code and I don't really have the cycles to take it on right now. Maybe Chris is interested in tying this in with his trybot changes?


Dmitry Vyukov

unread,
Sep 4, 2014, 3:26:19 AM9/4/14
to Brad Fitzpatrick, golang-dev
This is not directly related to builders, but still somewhat related.
This is about building race runtime. Currently I need to find a linux,
windows, darwin and freebsd machine; ssh into them and do some manual
actions. This is painful, non-reproducible and only I know how to do
it (hey, you better hire me a personal driver for safety). If we have
linux/windows/darwin/freebsd VMs on GCE, we could automate the process
of building the race runtime. Basically you give it revision number,
and it gives you 4 race_goos.syso files. Is it possible?

Brad Fitzpatrick

unread,
Sep 4, 2014, 9:36:02 AM9/4/14
to Dmitry Vyukov, golang-dev
Totally possible.

File a bug.

Dmitry Vyukov

unread,
Sep 4, 2014, 9:42:45 AM9/4/14
to Brad Fitzpatrick, golang-dev

Marc-Antoine Ruel

unread,
Sep 4, 2014, 9:53:36 AM9/4/14
to Brad Fitzpatrick, Dmitry Vyukov, golang-dev
Just FYI,

What Brad describes as the (stand alone process) Coordinator + the AppEngine dashboard was implemented as an AppEngine based app named Swarming. Swarming was written to handle Chromium's work load, the task distributor handles currently up to ~30k tasks/day and the file distributor handles up to 130 millions file cache hit per day.

It implements OAuth2 based authentication and centralized ACL'ing + IP whitelist, plus Windows support and is using a strong priority stack of FIFO task queues. It's a strong requirement to support a Try Server in the first place. Selection of bot is done via "dimensions", where a list of properties on the TaskRequest must be matched by the Bot.

Because coordination is purely done on the AppEngine's DB, there's no stand alone server to manage and everything is done via HTTPS. Nobody likes setting up a certificate for TLS communication on a stand alone server, this being AppEngine saves the hassle.

The main downside of this approach is slower streaming and slightly slower coordination; total end-to-end overhead in the ~8s range, stdout streaming is >10s lagging behind. AppEngine is flaky, so the client scripts had to be written to support significant server failure, but I see this as a long-term gain. The server can die for minutes and all tasks will still succeed, albeit more slowly. It's especially important in the case of partial AppEngine downtime, which is quite frequent. We also have monitoring.

In Chromium land, we had to figure out transferring files first, so we're using a content-addressed cache named Isolate Server, which is a front end to Cloud Storage, acting as the LRU cache manager and hot cache item lookup accelerator. It's likely a non-issue for the Go build system due to small files and small checkouts used.

.isolate and .isolated files are how we currently manage the isolation (what you describe as using Docker and VM snapshotting). The main advantage of this OS agnostic approach is that it's easy to reimplement on any OS (someone is currently looking at a native Android client for example). It's much more leaky than Docker though so I'd like to get lower per OS as we improve.

So in the end the workflow looks like:
- Write your end-to-end build script.
- Write a .isolate file to list everything that needs to be archived so the task is self contained. Conditions (like OS, build type) are supported.
- Archive it on the isolate server. This "compiles" the .isolate file into a .isolated file.
- Trigger a swarming task, along the hash (currently SHA-1 but can be changed) of the .isolated file produced.
- Get results either via command line tool or web UI.

That said, I'm not totally sure it's a good fit, just stating it exists, being used by Chromium and it's staffed and not going away anytime soon. Main caveat is that many APIs are being completely rewritten as we speak which is still a matter of weeks. We have a fairly good canarying release process. Heck, if Chrome Infra would accept, you could use (free-riding at low priority) our fleet right away. It's designed to be open source friendly.

Sadly it's written in python but I'd love to rewrite in Go if I had some free time, I even had started an isolate server Go prototype.

Brad, if you want to take a deeper look, IM or email me and I'll grant you access to the canary server so you can look at the admin pages.

Warning: my html skills really suck.

Thanks,

M-A


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
M-A

Brad Fitzpatrick

unread,
Sep 4, 2014, 2:00:08 PM9/4/14
to Marc-Antoine Ruel, Dmitry Vyukov, golang-dev
Interested. I hadn't heard of that. But it also seems like more Python, App Engine, and abstraction than I have patience for.

I think I'll stick with a Go-based, GCE-based solution that does exactly what Go needs and nothing more.

Russ Cox

unread,
Sep 5, 2014, 2:06:48 PM9/5/14
to Aram Hăvărneanu, Brad Fitzpatrick, golang-dev
On Wed, Sep 3, 2014 at 6:38 PM, Aram Hăvărneanu <ara...@mgk.ro> wrote:
Plan 9 doesn't currently work on the flavor of KVM provided by GCE. I
don't see any reason to drag Plan 9 into this, we're fine, really.
... 
As for Solaris, sorry again, but we're more than fine. Other than
that, anything that makes you happy.

Maybe you're fine, but "we" are not. Right now it's actually very hard for a developer who has broken a build to get to a Plan 9 or Solaris and see what's going on. If we can define standard machine images that the builder spins up, we can also make it easy for anyone who needs to try something on one of those systems to spin up their own image. That's a Very Good Thing.

I would love to see Plan 9 and Solaris included in the new builder stuff. It is too bad that Plan 9 doesn't work with GCE. Maybe that can be fixed. What about Solaris?

Russ

Brad Fitzpatrick

unread,
Sep 5, 2014, 3:23:03 PM9/5/14
to Russ Cox, Aram Hăvărneanu, golang-dev
For Solaris, Bryan Cantrill even offered help: https://twitter.com/bcantrill/status/507625265103511552

If Plan 9 won't run on GCE natively, it could surely run on QEMU x86. That's probably not as painfully slow as ARM emulation, I hope.

Aram Hăvărneanu

unread,
Sep 5, 2014, 3:30:25 PM9/5/14
to Russ Cox, Dave Cheney, br...@joyent.com, Brad Fitzpatrick, golang-dev
> Maybe you're fine, but "we" are not. Right now it's actually very hard for a
> developer who has broken a build to get to a Plan 9 or Solaris and see
> what's going on.

He can ask me or minux, or Dave, or Devon, all that's needed is an ssh
public key, but okay; if the status quo is so terrible and this new
way of doing things become a mandatory requirement, we'll implement
it, sure.

> It is too bad that Plan 9 doesn't work with GCE. Maybe that can be fixed.

Bell Labs Plan 9 lacks both network and disk driver. 9front has a disk
driver. There is a very experimental network driver around that
doesn't work very well.

One note: today David and Nick update the Plan 9 systems daily. It is
hard to see this how this very useful property will be preserved
without a lot of work since GCE doesn't come with a Plan 9 image.

> What about Solaris?

I am sure Solaris would work on GCE, however, previous point still
stands. Right now updating the system is someone else's job. It would
have to become my job. I have to pass. I can do whatever is required
to spin up SmartMachine instances on Joyent's public cloud if that's
acceptable. Those are always up-2-date, and SmartOS was always the
primary target anyway.

--
Aram Hăvărneanu

Russ Cox

unread,
Sep 5, 2014, 3:36:04 PM9/5/14
to Aram Hăvărneanu, Dave Cheney, br...@joyent.com, Brad Fitzpatrick, golang-dev
On Fri, Sep 5, 2014 at 3:30 PM, Aram Hăvărneanu <ara...@mgk.ro> wrote:
> Maybe you're fine, but "we" are not. Right now it's actually very hard for a
> developer who has broken a build to get to a Plan 9 or Solaris and see
> what's going on.

He can ask me or minux, or Dave, or Devon, all that's needed is an ssh
public key, but okay; if the status quo is so terrible and this new
way of doing things become a mandatory requirement, we'll implement
it, sure.

Yes, well, if we just publish the docker images then there's no need to ask permission or be in the right club. This is all about making things as easy as possible. I understand that you think they are easy enough. I do not.

Russ

Brad Fitzpatrick

unread,
Sep 5, 2014, 3:50:29 PM9/5/14
to Aram Hăvărneanu, Russ Cox, Dave Cheney, br...@joyent.com, golang-dev
Um, we'd automate it.

I was talking to mdempsky the other day about having an openbsd-amd64-nightly builder were the kernel & userspace that runs on it is compiled from OpenBSD each night.

ron minnich

unread,
Sep 5, 2014, 4:46:39 PM9/5/14
to Brad Fitzpatrick, Russ Cox, golang-dev
tl;dr for anyone but Plan 9 interested people.

We've got a semi-automated setup for akaros, and since akaros uses a
lot of plan 9 code (network stack, name space, mnt device, utilities,
etc.) and we will have some of the same issues w.r.t things like GCE,
I thought I'd mention what we're doing.

I currently run my tests in a docker instance with the standard docker
ubuntu image and kvm on my chromebook, so the experience I'm having
may be applicable to GCE.

To run our tests, we crossbuild, fire up a go9p server (we use
github.com/rminnich/go9p), and boot akaros in qemu optionally with
kvm. [FWIW we're passing almost everything at this point].

We have a script running on the Akaros side that ipconfig's the
network stack, runs
srv and does a mount to get the go tree available, and fires up a
listen1 in another script. The listen ties incoming calls to a shell
(ash in our case, it would be rc in Plan 9), which in turn allows us
to use a simple
shell script on the linux side that kicks off go tests in the akaros
instance. This script with very minor changes would run just fine on
Plan 9, there's nothing special to it.

At that point, on the linux side, we can type
go test
or
go test whatever
and it looks like a local go test but it runs on the guest (Akaros).
It takes a little longer of course. And, trust me, it has worked well
enough to expose all our bugs, races included. We just got past the
one related to TLS server sockets closing in unexpected ways.

Given that all this was built with standard plan 9 tools, name spaces,
and network stack, I suspect it could be made to work without undue
effort on GCE. The key is that you don't need to boot plan 9 as a gce
instance; you fire up a docker in gce and run Plan 9 as a guest in
that. The Plan 9 instance is controlled from the linux side. We use
netcat to issue commands and get text back form the listen1. The
current setup is more plumbing than you want, I'm sure, but it shows
what's possible with a little work. It may be about the level of
effort it took for Windows.

Our setup won't help much if you're hankering to test building under
Plan 9, but the approach of firing up plan 9 in a ubuntu docker
instance is a useful model. You could then control hg commands and the
build sequence from a Go program running on the Linux side, again via
the socket to listen1.

I doubt this makes much sense to non-plan9 folks, but I hope it makes
sense to someone :-)

Short form: Plan 9 is doable, and you don't need GCE to support Plan 9
images directly. An indirect setup is doable. Nested virtualization
will make it perform better, but I don' t know if GCE will do that.
Given the number of Plan 9 people who have been using qemu to run Plan
9 for about 10 years now, I suspect that qemu is 'good enough' for
this purpose.

ron

David du Colombier

unread,
Sep 5, 2014, 7:06:29 PM9/5/14
to Brad Fitzpatrick, Aram Hăvărneanu, Russ Cox, Dave Cheney, br...@joyent.com, golang-dev
There is already a QEMU image with Plan 9 and Go available
for a while.

http://9legacy.org/download/plan9-go.img.bz2

I never tried GCE, but I think the issue is the lack of support
for disk virtio. Fortunately, a driver is already available
as part of 9front. I don't know about network however. Do
GCE provide virtio or Intel GbE?

I think these issues could be worked on and we could have
Plan 9 working on GCE someday.

--
David du Colombier

Anthony Martin

unread,
Sep 5, 2014, 8:16:24 PM9/5/14
to David du Colombier, Brad Fitzpatrick, Aram Hăvărneanu, Russ Cox, Dave Cheney, br...@joyent.com, golang-dev
David du Colombier <0in...@gmail.com> once said:
> I don't know about network however. Do GCE provide virtio
> or Intel GbE?

Only the virtio ethernet controller:
https://developers.google.com/compute/docs/images#providedkernel

> I think these issues could be worked on and we could have
> Plan 9 working on GCE someday.

Agreed.

Anthony

ron minnich

unread,
Sep 5, 2014, 10:45:41 PM9/5/14
to Anthony Martin, David du Colombier, Brad Fitzpatrick, Aram Hăvărneanu, Russ Cox, Dave Cheney, br...@joyent.com, golang-dev
I did a virtio console and ethernet for the lguest port of Plan 9, it
might be a useful starting point.

ron

Aram Hăvărneanu

unread,
Sep 6, 2014, 10:04:05 AM9/6/14
to ron minnich, Anthony Martin, David du Colombier, Brad Fitzpatrick, Russ Cox, Dave Cheney, br...@joyent.com, golang-dev
> I am sure Solaris would work on GCE

I have spoken too soon. It does not. Virtio-scsi and virtio-net
drivers are lacking.

Brad on twitter said:
> @bcantrill @aramh Bryan, I actually don't care if it runs on Google stuff.
> Just need to make an RPC to create a fresh container & run code.

I'll deal with this (using manta).

--
Aram Hăvărneanu

andrewc...@gmail.com

unread,
Sep 6, 2014, 10:32:40 AM9/6/14
to golan...@googlegroups.com

One thing that just bit me after testing some stuff. My whole ssd filled with terminated docker containers which I had to remove. I didn't realize docker holds onto terminated containers until you explicitly free them.




On Thursday, September 4, 2014 7:27:49 AM UTC+12, Brad Fitzpatrick wrote:

gov...@ver.slu.is

unread,
Sep 7, 2014, 1:45:48 PM9/7/14
to golan...@googlegroups.com, andrewc...@gmail.com
On Saturday, September 6, 2014 4:32:40 PM UTC+2, andrewc...@gmail.com wrote:

One thing that just bit me after testing some stuff. My whole ssd filled with terminated docker containers which I had to remove. I didn't realize docker holds onto terminated containers until you explicitly free them.

That bit me too when I first started using Docker. Luckily that won't be a problem with ephemeral cloud machines though.

David du Colombier

unread,
Sep 7, 2014, 4:36:07 PM9/7/14
to Brad Fitzpatrick, ron minnich, Anthony Martin, Aram Hăvărneanu, Russ Cox, Dave Cheney, br...@joyent.com, golang-dev
I'm now able to build a Plan 9 image for QEMU with
virtio-net-pci and virtio-scsi-pci devices.

I'm booting the disk image with:

qemu-kvm -net user -net nic,model=virtio -m 2048 -vga std
-drive if=none,id=hd,file=plan9-gce.img
-device virtio-scsi-pci,id=scsi
-device scsi-hd,drive=hd

Nick Owens kindly contributed a Virtio Ethernet driver
and I used the disk Virtio driver from 9front.

Basically, one just have to apply the following patches
on Plan 9 (in this order):

http://www.9legacy.org/9legacy/patch/pc-pcflop-sdiahci.diff
http://www.9legacy.org/9legacy/patch/pc-sdvirtio2.diff
http://www.9legacy.org/9legacy/patch/pc-conf-sdvirtio3.diff
http://www.9legacy.org/9legacy/patch/pc-ethervirtio.diff
http://www.9legacy.org/9legacy/patch/pc-conf-ethervirtio.diff

This is still experimental, but it seems to work fine so far.

Next, I will try to run the disk image on GCE.

--
David du Colombier

David du Colombier

unread,
Sep 9, 2014, 8:36:48 AM9/9/14
to Brad Fitzpatrick, ron minnich, Anthony Martin, Aram Hăvărneanu, Russ Cox, Dave Cheney, br...@joyent.com, golang-dev
After some tweaks in the sdvirtio driver, I've been able to
boot Plan 9 on GCE. However, there is still some effort
needed to be able to get the network working.

Booting from Hard Disk...
Booting from 0000:7c00
i8042: kbdinit failed
pcirouting: BIOS workaround: PCI.0.1.3 at pin 1 link 96 irq 10 -> 9

no vga; serial console only
disk loader

cpu0: 2599MHz GenuineIntel Core i7 (cpuid: AX 0x206D7 DX 0xF8BFBFF)
ELCR: 0C00
497M memory: 497M kernel data, 0M user, 18M swap
found partition #S/sd01/data 0 20,971,520
disks: sd01
trying sd01....found 9pcf
.1132393..........................................................................................................................................+2031880........................................................................................................................................................................................................................................................+457548=3621821
entry: 0xf0100020

Plan 9
i8042: kbdinit failed
E820: 00000000 0009fc00 memory
E820: 0009fc00 000a0000 reserved
E820: 000f0000 00100000 reserved
E820: 00100000 bfffe000 memory
E820: bfffe000 c0000000 reserved
E820: fffbc000 100000000 reserved
E820: 100000000 130000000 memory
cpu0: 2595MHz GenuineIntel Core i7 (cpuid: AX 0x206D7 DX 0xF8BFBFF)
ELCR: 0C00
#l0: ethervirtio: 100Mbps port 0x0 irq 11: 42010af0f952
3072M memory: 256M kernel data, 2815M user, 3440M swap
usbinit...usbd.../boot/usbd: /dev/usb: no hubs
no /srv/usb...no usb disk...pickmethod...read #e/nobootprompt...pickmethod done
bind #æ...bind #S...partinit...auth...usbinit...usbd.../boot/usbd:
/dev/usb: no hubs
no /srv/usb...no usb disk...mount usbd...boot: can't open /srv/usb:
'/srv/usb' file does not exist
time...
fossil(#S/sd01/fossil)...version...can't stat /srv/partfs.sdXX:
'/srv/partfs.sdXX' file does not exist

init: starting /bin/rc
term%

--
David du Colombier

Sebastien Douche

unread,
Sep 9, 2014, 9:36:09 PM9/9/14
to ron minnich, golang-dev
On Fri, Sep 5, 2014 at 10:46 PM, ron minnich <rmin...@gmail.com> wrote:
> tl;dr for anyone but Plan 9 interested people.

Naive question (no offense), is Plan9 support useful? Also, the build
bot seems always broken.


--
Sebastien Douche <sdo...@gmail.com>
Twitter: @sdouche / G+: +sdouche
Reply all
Reply to author
Forward
0 new messages