Summary: We currently have 5 CI systems: Travis, Azure, Gitlab, AppVeyor,
and DrDr. I explain what I have done so far in Gitlab and propose
unifying this into a single solution in the future. Request concerns,
suggestions and comments.
Long version:
A few years ago GitHub CI through Travis was lacking important features (for a
start it only supported two OS images and one architecture) and many,
many other systems provided much better solutions. I had worked in
industry with many systems including Gitlab[1], buildbot[2] and
Jenkins[3]. Jenkins-pre2 was pretty buggy - I went through the pain
of setting up a large CI system for internal development tools in one of
my previous clients and it scarred me for life. Jenkins2 by then had
just come out and looked better and shiny but I never gave it a
go. Buildbot OTOH is not a CI system per se but more of a framework from
which you create your system in Python. I wrote a prototype of a
buildbot-based CI system for GCC around 2017[3] and many other
systems have been using buildbot like llvm[4], webkit[5] and gdb[6] to name just a
few. A little later, Gitlab integrated a CI solution and I
started to use it with Racket in a Gitlab fork and later when Gitlab
released CI for Github projects, I worked with SamTH to get a Gitlab
solution for the official Racket tree and nowadays you see it working under
gitlab.com/racket/racket with its configuration in
https://github.com/racket/racket/blob/master/.gitlab-ci.yml.
Here's what it does at this point (currently all configurations are
running on Linux):
On x86_64:
1. Builds RacketCGC -> Racket3M -> RacketCS
2. Runs tests on all of the three variants (similar to those ran by Travis)
3. Builds once again all the variants and tests with --enable-ubsan
keeping a record of the runtime errors (for example:
https://gitlab.com/racket/racket/-/jobs/350271624/artifacts/file/runtime-errors.log)
4. Builds once again all the variants with the llvm static analyser and
keeps a record of the failures (for example:
https://gitlab.com/racket/racket/-/jobs/350271564/artifacts/file/scan-report_mmm/2019-11-14-032030-5552-1/index.html).
This step requires us to build LLVM with Z3 enabled so we can use the
work from ICSE'19[7].
On armv7l (arm32 with hard float):
1. Build RacketCGC -> Racket3M
2. Runs tests
The above pipeline takes 1h10m[8] to run through on my machines.
Every night besides the above we extend the pipeline with:
1. Emulate the build of CGC and 3M on arm64, armel (32bits little endian - soft floating
point), armhf (32bits little endian - hard floating point), i386, mips
(32bits big endian), mips64el (64bits, little endian), mipsel (32bits
little endian),
2. Same as above but configured with --enable-generations=no (since
https://github.com/racket/racket/commit/7c3a207f36dc25baaac4afdf7ecedc18bf9ff49c).
The above pipeline takes 4h to run since it also compiles QEmu 3.1.0 (debian
container qemu is too old) beforehand. Both this build and the LLVM
build are cached so it only really builds once until the versions are
changed. QEmu 4.1.0 shows good results[9] so I will upgrade this soon.
The biggest problem with the Gitlab pipeline is that it worked _really_
well until I started wanting to optimize the pipeline. For example to
have a stage-less pipeline where jobs only need a few jobs running in
previous stages instead of waiting for all of them. Gitlab is finally
catching up to this with the `needs` keyword but the interface becomes a
mess. For all the time I spent with Gitlab CI, I spent almost as much
time configuring Racket, as I spent trying to figure out why some things
break or don't work[10, 11]. Gitlab CI for simple projects is great but it
just gets harder and harder as the pipeline complexity grows.
I have slowly been gathering a few machines to test Racket and other
work related projects so I have quite a few machines/boards of varied
architectures (arm32, arm64, x86_64, mipsel). I also got a machine 2
months ago with Windows to test
Racket on Windows10 (also something coming soon [12]). For you to do
Gitlab CI of Racket on your machine, you simply need to install `gitlab-runner`,
and connect it to the project appropriately. However, I just got a rpi4
with 4Gbs and just found out I cannot use it because gitlab-runner
doesn't run on arm64 yet (Go apparently doesn't support arm64 yet). So
that's another bummer.
Lately I noticed that Gitlab CI for Github projects (what we use for
Racket) doesn't support afaict, running the pipelines on PRs. And if it
did, it probably wouldn't support running a special faster pipeline so
the PR author understand if it's breaking something.
All in all, we have outgrown Gitlab CI and I would like to spend more of
my free time working on an improved GC for RacketCS than on fixing
GitlabCI or working around it. I also think it is a waste of resources
to run so many CI systems simultaneously sometimes doing the same thing.
My next CI project was to support benchmarking and develop a Racket
Dashboard Webapp that displays the
important results of CI in a visually appealing way that's easy to
understand. I take some time every day to look at the Gitlab CI pipeline
and ensure that all the yellows (expected failures) are what we already
knew that was going to fail instead of some new failure that ends up
being categorized in the same way - again the interface doesn't help when you
have 20+ jobs.
My proposal is to rewrite the current Gitlab CI pipeline in using
Buildbot and take it from there. This means writing Python but maybe
with some luck parts of it can be written in Racket and interfaced with
Python if necessary (can Pycket help here)?
Buildbot runs on all the architectures Python runs - all the ones we are
interested in and deploying it is as easy as it is with gitlab. Granted
that the code won't look like a yaml file anymore but I am pretty sure
that by now the Python code might be more readable than the current
~1300 line yaml file we have to configure our pipeline.
Once Buildbot has the same features as Gitlab CI, I will extend it to
ensure architectures tested with Azure, Travis, and AppVeyor are
covered. At this point we could potentially switch off other systems.
DrDr seems to be a different beast, much harder to replace so
before I go there, I will sync with the rest of the team but I still
think that having a unified system and interface could be the way to
go.
If you are happy with my proposal, I will go ahead and start a new
project on GitHub: racket-buildbot. Once we get this to a stable point,
we could merge this into the racket tree and remove .gitlab, etc.
At this point, I welcome any comments and suggestions. Having good CI
means that in the long term we'll have ensured that Racket keeps running
on all supported platforms (and once benchmarking is done - how Racket's
performance changes over time). So having good CI is important. However,
it is only relevant if it is useful to the racket team and
contributors. It would be great if everyone involved could chime in with
what they would like to have/see. Feel free to request whatever you
want, I cannot promise implementing all of this but I can make a list.
Refs:
[1]
https://gitlab.com
[2]
https://buildbot.net
[3]
https://jenkins.io
[4]
http://lab.llvm.org:8011/
[5]
https://build.webkit.org/
[6]
https://gdb-buildbot.osci.io/#/
[7]
https://dl.acm.org/citation.cfm?id=3339673
[8]
https://gitlab.com/racket/racket/pipelines/95803835
[9]
https://github.com/LinkiTools/racket/tree/pmatos-qemu-410
[10]
https://gitlab.com/gitlab-org/gitlab/issues?scope=all&utf8=%E2%9C%93&state=opened&author_username=pmatos
[11]
https://gitlab.com/gitlab-org/gitlab-runner/issues?scope=all&utf8=%E2%9C%93&state=opened&author_username=pmatos
[12]
https://github.com/LinkiTools/racket/tree/pmatos-ci-win10
Thanks for reading this,
--
Paulo Matos