CI improved for Racket

139 views
Skip to first unread message

Paulo Matos

unread,
Apr 2, 2019, 4:00:05 AM4/2/19
to racke...@googlegroups.com
Hello,

Short Summary: I have added in 35d269c29 [1] cross architectural testing
using virtualized qemu machines. There are problems - we need to fix those.

Long Story:

For months now, I have been wishing I could get cross-arch testing done
on a regular basis on Racket. Initially I had something setup privately
for RISC-V but I quickly noticed that the framework could be extended to
other architectures.

Thanks to Sam I got permission to get gitlab.com/racket/racket setup and
get things moving. It took a couple of months to get everything right.
Not necessarily due to inherent CI problems but I had to report a couple
of Gitlab issues first, debug qemu as well and setup a few of my
machines for this.

The important things are:
- with testing running on gitlab, people who would like to contribute
CPU time to Racket can do so by setting up a gitlab runner on said
machine (contact me for help). Because Gitlab CI free machines have a
maximum timeout that's enough for normal testing but not enough for
virtualization I needed to add some extra machines to do these specific
jobs. Besides the Gitlab CI machines, we have a 4 CPU x86_64, a 16 CPU
x86_64 and a rpi3 running in my server room. Of course, with more
machines, more tests can run simultaneously and therefore provide
quicker feedback.
- Matthew pointed to me a few archs Racket should support so I added those:
Testing added for Racket:
Native: armv7l (running on rpi3), x86_64
Emulated: arm64, armel, armhf, i386, mips, mips64el, mipsel, ppc64el, s390x

Testing added for Racket CS:
Native: x86_64
Emulation: i386

- There are problems and initially because so many of the architectures
fail either to compile or to test I assumed that this was a qemu bug.
Since I am not a virtualization expert it took me a few days and some
help from the qemu people to setup an environment to debug qemu inside a
chroot inside a docker container running racket in a different arch.
Afer some analysis, it turned out the segfault during compilation was
definitely coming from Racket [5]. In a discussion with Matthew he
proposed I could disable generational GC to ease debugging of the
problem. Turns out disabling it, caused the sigsegv not to occur any
more. So, at this point I think we are in the realm of a problem in
Racket. I haven't gotten to the bottom of this yet, but hopefully when I
do we can get all the lights green in the cross-arch testing.

There are a few things I would like to do in the future like running
benchmarks on a regular basis on Racket and RacketCS and have these
displayed on a dashboard but these will come later. First I would like
look into these failures which might be related to #2018 [2] and #1749 [3].

Lastly, this is another opportunity to help fix some Racket issues and
get involved. If you are into different archs, debugging and
contributing take a look at the logs coming out of the pipelines [4].

If you need some help or clarification on any of this, let me know.

[1]
https://github.com/racket/racket/commit/35d269c29eee6f6f7f3f83ea6f01b92ae1db180a
[2] https://github.com/racket/racket/issues/2018
[3] https://github.com/racket/racket/issues/1749
[4] https://gitlab.com/racket/racket/pipelines/
[5] https://gitlab.com/racket/racket/-/jobs/188658454

--
Paulo Matos

Alexis King

unread,
Apr 9, 2019, 1:44:52 PM4/9/19
to Paulo Matos, racke...@googlegroups.com
Hi Paulo,

The work you’re doing is really cool, though I admit most of it is over my head. Thank you for putting in the time to set it all up. One thing I have noticed, however, is that the GitLab pipeline seems to almost always fail or timeout, which causes almost every commit on the commits page of the GitHub repo[1] to be marked with a loud, red failure indicator.

I don’t understand what you’re doing well enough to say whether or not this is because something is going wrong in the CI scripts itself or because they are (correctly) detecting that Racket doesn’t currently support some of the tested architectures. But in either case, while the testing of those architectures is very nice to have, it seems extreme to cause the whole commit to be marked as a failure every time for things that (correct me if I’m wrong) seem unlikely to be changed/fixed in the immediate future.

For the Travis builds, we have a job that tests RacketCS, which currently always fails, but we have the CI configured to ignore the failure of that particular job when deciding whether or not to say the overall commit passed. Is there some way something similar could be done with the GitLab pipeline? Running all those jobs is valuable, in the same way that the RacketCS build is, it’d just be nice to avoid making the at-a-glance commit status meaningless. And just as we will surely promote the RacketCS job from an “allowed failure” to an ordinary job once it passes consistently, we would of course do the same for the various architecture jobs as well.

Thanks,
Alexis

[1]: https://github.com/racket/racket/commits/master

Paulo Matos

unread,
Apr 10, 2019, 3:15:02 AM4/10/19
to Alexis King, racke...@googlegroups.com


On 09/04/2019 19:44, Alexis King wrote:
> Hi Paulo,
>

Hi Alexis,

> The work you’re doing is really cool, though I admit most of it is over my head. Thank you for putting in the time to set it all up. One thing I have noticed, however, is that the GitLab pipeline seems to almost always fail or timeout, which causes almost every commit on the commits page of the GitHub repo[1] to be marked with a loud, red failure indicator.
>

Thanks for your email. It is correct what you say and this is an issue
close to my heart that I wanted to see sorted. Finally this email is the
poke that will get me to sort these out. Apologies I haven't done it
earlier.

> I don’t understand what you’re doing well enough to say whether or not this is because something is going wrong in the CI scripts itself or because they are (correctly) detecting that Racket doesn’t currently support some of the tested architectures. But in either case, while the testing of those architectures is very nice to have, it seems extreme to cause the whole commit to be marked as a failure every time for things that (correct me if I’m wrong) seem unlikely to be changed/fixed in the immediate future.
>

There are a few issues with compiling in other archs that's what these
jobs capture. #2018 is one of the main issues and I have been looking at
it with Matthew and Sam, but it's turning out to be a major pain. Other
archs reveal similar behaviour.

As you say, this shouldn't cause commits to get the red-cross.

> For the Travis builds, we have a job that tests RacketCS, which currently always fails, but we have the CI configured to ignore the failure of that particular job when deciding whether or not to say the overall commit passed. Is there some way something similar could be done with the GitLab pipeline? Running all those jobs is valuable, in the same way that the RacketCS build is, it’d just be nice to avoid making the at-a-glance commit status meaningless. And just as we will surely promote the RacketCS job from an “allowed failure” to an ordinary job once it passes consistently, we would of course do the same for the various architecture jobs as well.
>

Yes, that's partially the solution. Currently I don't have enough
machines or AWS time to dedicate to Racket builds so I will instead do
the following straight away:
- Regularly failing jobs will be marked as 'can fail', until they don't
fail anymore and then I will remove the flag.
- Move long running jobs or jobs for which I don't have straightaway
enough machines available, to run nightly only.

In the long term I would like CI jobs to finish in a respectable time:
<1h or even <30mins. I would like all archs tested and no failures. This
will take some time but we'll get there.

> Thanks,
> Alexis
>

Thanks for the suggestions and the poke. Now I am off to make racket
green again.
--
Paulo Matos

Paulo Matos

unread,
Apr 10, 2019, 3:53:15 AM4/10/19
to Alexis King, racke...@googlegroups.com
I have finished pushing the changes I have mentioned below.

Hopefully things will improve. Again, I am sorry for leaving Racket in
such a red state for the past couple of weeks. :)

jackh...@gmail.com

unread,
Apr 11, 2019, 5:01:07 AM4/11/19
to Racket Developers
On Wednesday, April 10, 2019 at 12:15:02 AM UTC-7, Paulo Matos wrote:
Currently I don't have enough machines or AWS time to dedicate to Racket builds

How much do you need?

Paulo Matos

unread,
Apr 11, 2019, 5:21:58 AM4/11/19
to jackh...@gmail.com, Racket Developers
There are really two types of needs here.
- Different architecture machines: ppc64, mips32, mips64, arm
- x86_64 with several cores - speaking in terms of cores, having 16
more cores would be great. Also the machines wouldn't need to be on all
the time if not possible. Lets say you have a spare machines, you could
start the gitlab-runner only during certain periods and gitlab would
schedule the jobs appropriately.

With regards to the number of cores, the more, the merrier. Running qemu
to emulate the architectures above is slow and a lot of the jobs have to
wait because I only have 2 machines (totalling 20 cores) assigned to
this and there are lots of jobs that need to run.

Regards,

--
Paulo Matos

Jack Firth

unread,
Apr 11, 2019, 5:56:38 AM4/11/19
to Paulo Matos, Racket Developers
So about 30-40 total cores for that second category? About how much total RAM is needed? I'm assuming Gitlab owns all persisted data so there's not much need for these machines to have non-transient disk storage. How much importance does resource density play, in terms of ideal cores-and-RAM-per-machine? Can you run the CI tasks effectively across several (>=8 nodes) small machines (<=4 cores, 4GB RAM) instead of a few big ones? How bursty is the CI workload, and would the machines spend a lot of time sitting idle?

It might be feasible for me to donate a Kubernetes cluster to the cause (but I make no promises).

Paulo Matos

unread,
Apr 11, 2019, 9:32:55 AM4/11/19
to Jack Firth, Racket Developers


On 11/04/2019 11:56, Jack Firth wrote:
> So about 30-40 total cores for that second category?

That would be awesome! :)

> About how much
> total RAM is needed?

Someone might correct me here but from what I can see it would be great
to have something like 2G/core - but if not it shouldn't be a problem.
One can always force certain jobs to certain hosts through the use of
tags so you can run 2 RAM hungry jobs or 1 RAM hungry job and 4 where
RAM doesn't matter so much.

> I'm assuming Gitlab owns all persisted data so
> there's not much need for these machines to have non-transient disk
> storage.

No, currently I have a central cache for the gitlab machines locally.
However, I have in place an S3 bucket (Google would also be possible but
would need to check) I will sponsor to hold the cache between the machines.

> How much importance does resource density play, in terms of
> ideal cores-and-RAM-per-machine? Can you run the CI tasks effectively
> across several (>=8 nodes) small machines (<=4 cores, 4GB RAM) instead
> of a few big ones? How bursty is the CI workload, and would the machines
> spend a lot of time sitting idle?
>

So here it depends. There are 14 jobs at the moment that cannot go to
the free gitlab machines. These are building Racket in one way or
another. I would say best is to build Racket with at least 4 cores, so
depending on the available hardware I would split the jobs the best
possible way. With a single 20 cores machines with 64G RAM I could ran 5
builds/jobs in parallel but however if I only had 16G RAM than it might
be better not to do 5 in parallel. If RAM is limited per machine, it's
better to have more machines. The machines will sit idle sometimes but
not often because some jobs just take a long time. So if Matthew pushes
something every two hours, jobs start to queue up. They might sit idle
during weekends though when people don't generally work.

If you have a specific suggestion then I could tell you what would work
and what wouldn't but in any case, I need to already say thank you for
your interest in helping the CI process.

> It might be feasible for me to donate a Kubernetes cluster to the cause
> (but I make no promises).
>

Thanks. I don't know much about kubernetes (still at the docker level...
:)) but I saw the name thrown around in the gitlab docs so it should be
fine integrating such a cluster with CI.

Paulo

> On Thu, Apr 11, 2019 at 2:21 AM Paulo Matos <pma...@linki.tools> wrote:
>
>
>
> On 11/04/2019 11:01, jackh...@gmail.com
> <mailto:jackh...@gmail.com> wrote:
> > On Wednesday, April 10, 2019 at 12:15:02 AM UTC-7, Paulo Matos wrote:
> >
> >     Currently I don't have enough machines or AWS time to dedicate to
> >     Racket builds
> >
> >
> > How much do you need?
> >
>
> There are really two types of needs here.
>  - Different architecture machines: ppc64, mips32, mips64, arm
>  - x86_64 with several cores - speaking in terms of cores, having 16
> more cores would be great. Also the machines wouldn't need to be on all
> the time if not possible. Lets say you have a spare machines, you could
> start the gitlab-runner only during certain periods and gitlab would
> schedule the jobs appropriately.
>
> With regards to the number of cores, the more, the merrier. Running qemu
> to emulate the architectures above is slow and a lot of the jobs have to
> wait because I only have 2 machines (totalling 20 cores) assigned to
> this and there are lots of jobs that need to run.
>
> Regards,
>
> --
> Paulo Matos
>
> --
> You received this message because you are subscribed to the Google
> Groups "Racket Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to racket-dev+...@googlegroups.com
> <mailto:racket-dev+...@googlegroups.com>.
> To post to this group, send email to racke...@googlegroups.com
> <mailto:racke...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-dev/CAAXAoJWJJJb39Ec1MN78MzKL0SDgJ7TEAj9QkV%3DXVRxodfP6Uw%40mail.gmail.com
> <https://groups.google.com/d/msgid/racket-dev/CAAXAoJWJJJb39Ec1MN78MzKL0SDgJ7TEAj9QkV%3DXVRxodfP6Uw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Paulo Matos

Paulo Matos

unread,
Apr 11, 2019, 10:58:52 AM4/11/19
to Jack Firth, Racket Developers


On 11/04/2019 15:32, 'Paulo Matos' via Racket Developers wrote:
>
> Thanks. I don't know much about kubernetes (still at the docker level...
> :)) but I saw the name thrown around in the gitlab docs so it should be
> fine integrating such a cluster with CI.
>

Now I know where I keep seeing the kubernetes mention. It's the gitlab
runner registration page. Screenshot attached - it seems to be easy to
install a gitlab-runner in one.

--
Paulo Matos
2019-04-11-165629_517x377_scrot.png

Jack Rosenthal

unread,
Apr 12, 2019, 1:45:38 AM4/12/19
to Paulo Matos, Jack Firth, Racket Developers
On Thu, 11 Apr 2019 at 15:32 +0200, 'Paulo Matos' via Racket Developers wrote:
> No, currently I have a central cache for the gitlab machines locally.
> However, I have in place an S3 bucket (Google would also be possible
> but would need to check) I will sponsor to hold the cache between the
> machines.

I can get you $500 GCP credit if you'd like. Let me know if you want
this... even if you just want it to mess around and see how it'd work
out.

--
Jack M. Rosenthal
http://jack.rosenth.al

Software efficiency halves every 18 months, compensating Moore's law.

signature.asc

Paulo Matos

unread,
Apr 12, 2019, 1:58:11 AM4/12/19
to Jack Rosenthal, Jack Firth, Racket Developers


On 12/04/2019 07:45, Jack Rosenthal wrote:
> On Thu, 11 Apr 2019 at 15:32 +0200, 'Paulo Matos' via Racket
> Developers wrote:
>> No, currently I have a central cache for the gitlab machines
>> locally. However, I have in place an S3 bucket (Google would also
>> be possible but would need to check) I will sponsor to hold the
>> cache between the machines.
>
> I can get you $500 GCP credit if you'd like. Let me know if you
> want this... even if you just want it to mess around and see how
> it'd work out.
>

Wow, that would be great. I would certainly take it. I can confirm
that Gitlab CI has integration with Google Cloud so it should be a
breeze to set it up.

That should definitely set Racket up for quite a bit of time cache-wise.

Thanks,

--
Paulo Matos
Reply all
Reply to author
Forward
0 new messages