How do you setup Bazel in a monorepo with Jenkins or CI?

2,744 views
Skip to first unread message

Matthieu Poncin

unread,
Nov 2, 2017, 11:51:57 AM11/2/17
to bazel-discuss
Hi Bazel,

We are trying to adopt Bazel at our company, I've seen few people asking on the mailing list how to setup Bazel with Jenkins but didn't really see any answer showing how it's done, so I wanted to share a bit our experience, how we did it and the issues encountered. Hopefully we could have a discussion about how everyone else is doing it.

Our system is not yet working due to the following issue I reported few days ago: https://github.com/bazelbuild/bazel/issues/3978
Since this issue now was marked as a feature request and not a bug, I am guessing people are doing things differently.


I'll try to give an explanation to our use case what I have been trying to achieve:

Context:
We have a Git monorepo, so with multiple apps and packages/libraries (primarily python, but cpp projects might ensue later).

Imagine we have something like this:
/apps/app1
/apps/app2
/packages/lib1
 
- app1 and app2 depends on lib1
- each apps and package have their own set of tests, typically packages have unit tests and apps have integration tests.
           - some builds/tests take a long time to run, eg: 1h

We want to run tests for every pull requests open and for every pull requests merged on master.
However if we have 50 devs, we have many pull requests open and closed on daily basis. So it's important for us to be able to run those tests from different branches in parallel and run them only when needed.

Jenkins implementation:
We wanted to build a system that is as automated as possible, so using jenkins pipelines (https://jenkins.io/doc/book/pipeline/), we made one main pipeline as entry point. It is triggered for every pull requests and commits on master.
The main pipeline steps works as follow:
1) Checkout the code with a workspace per branch and then move to that workspace
2) Check if packages/lib1 have been modified using --check_tests_up_to_date flag
   If YES -> Run a downstream pipeline which run tests on packages/lib1
3) Check if apps/app1 has been modified using --check_tests_up_to_date flag
   If YES -> Run a downstream pipeline for apps/app1
4) Check if apps/app2 has been modified using --check_tests_up_to_date flag
   If YES -> Run a downstream pipeline for apps/app2

So each of the packages and apps have a downstream job. These are app specific. Here would be an example for an integration test:
1) deploy to testing environment
2) run tests
 
Such system with downstream jobs allows us to see whenever the code has changed and tests ran for a single app, then we can link those automatically to github pull requests and provide a link to the logs where tests ran. Then we can automatically deploy our apps when necessary, depending on the app it can be staging for example for QA or directly to prod.
This is very flexible so that each apps can define how they are being tested and deployed in a single pipeline.  
It is also very transparent for devs, they just open a PR and the tests will automatically run and they just have to wait for a review and for the tests to show as green.

This now works really well with a single branch for master with the bazel local cache. However the local cache won't work anymore when opening a new PR since we want to run tests in parallel and from a different location.
Unfortunately, when using the remote cache, the flag check_tests_up_to_date doesn't work. This means that on every pull requests we are forced to build each downstream jobs even if our code change doesn't affect anything. The builds and tests are actually cached by the remote cache but we can't detect it, so we then redeploy always the same build multiple times and we can't trust the jenkins jobs to track when a code change deployment was done.
Now we are trying to come up with some workarounds to get it working without the check_tests_up_to_date flag working when opening a new PR.


What are your experiences with setting up bazel with a monorepo on Jenkins or similar CI systems? On what scale?

Cheers,
Matthieu

Marcel Hlopko

unread,
Nov 14, 2017, 6:31:51 PM11/14/17
to Matthieu Poncin, dmar...@google.com, bazel-discuss

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/3f8cb143-de46-4fee-b9e6-bbf0a7ae298d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
-- 
Marcel Hlopko | Software Engineer | hlo...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | Germany | Geschäftsführer: Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

Damien Martin-Guillerez

unread,
Nov 20, 2017, 8:14:37 AM11/20/17
to Marcel Hlopko, Matthieu Poncin, bazel-discuss
Sorry for the delay, travelling.

So we do not have setup Jenkins for a mono repo but we are maintaining the bazel cache dir to get more caching that the remote cache but it does not works well with branch (where the remote cache works better). A thing that Google does on its repo is actually having a query that determine all the affected tests to not try re-run non affected tests. There is a version of that query in the Bazel repository. I would encourage people to improve that script :)

Unless your repository is so big that Bazel cannot load the graph in memory, I would recommend to use the query approach over splitting the job, that would avoid double computation. You can then shard the job depending on tests (that is roughly what we do to test bazel on presubmit internally).

Matthieu Poncin

unread,
Nov 22, 2017, 10:55:36 PM11/22/17
to bazel-discuss
Hi Damien,
Thanks a lot for the answer! Looking at the ci script, it is exactly what we would have needed from the get go!

We have already managed to get our setup running nicely by setting up a very ugly hack to get around the option --check_tests_up_to_date not working. But I think querying the graph like you do in this ci script would have worked very well. Glad to hear there was a lot less hackish alternative! :D

One small issue I can think of with this ci script is that if we would spread master tests to multiple machines, we wouldn't be able to use the local cache and that would force to rebuild everything for every merge to master. But that probably can be easily mitigated.


Here is an explanation of our hack just for reference if anyone else really needs the option check_tests_up_to_date to work with remote caching:
The idea was to try to wrap all tests depending on our binaries with a second bazel test, name it: is_test_up_to_date_hack. So we endup with something like this:
/apps/app1
/apps/app1/tests
/apps/app1/tests/is_test_up_to_date_hack

That extra test would then fail depending on a filesystem lock (trying to create a directory). Here is a workflow abstracted for each apps:
First check if is_test_up_to_date_hack is local cached with --check_tests_up_to_date
- if local cache exists
  - nothing to do, tests are cached
- else if the local cache doesn't exists:
  - we need to test for the remote cache
  - create lock
  - run is_test_up_to_date_hack test
  - if the test pass:
    - nothing to do, it means that tests didn't run and so the underlying tests are cached
  - if the test fail:
    - this means bazel detected code change and the cache was invalidated, so we run downstream tests
    - if downstream tests succeeds:
      - remove lock
      - run is_test_up_to_date_hack to make it cache the result
    - if downstream tests failed:
      - do nothing, the test
is_test_up_to_date_hack is cached as failed, so any subsequent commit will make tests run again

We tried quite some time with the filesystem lock but unfortunately we found that for it to work we had to disable sandboxing and doing so was actually disabling remote caching at the same time. So instead we used a sleep, and record how much time the test takes to run + a second hack fiddling with environment variables to make the test fail and invalidate the cache if the downstream tests have failed.

Our solution now works well but it's certainly very hacky and it obviously breaks the rule of reproducible builds.
We'll very likely later on get back to it and try the approach with quering the bazel graph :)
Sorry for that horrifying hack! :D

Damien Martin-Guillerez

unread,
Nov 23, 2017, 7:21:45 AM11/23/17
to Matthieu Poncin, bazel-discuss
On Thu, Nov 23, 2017 at 4:55 AM Matthieu Poncin <matt...@yousician.com> wrote:
Hi Damien,
Thanks a lot for the answer! Looking at the ci script, it is exactly what we would have needed from the get go!

We have already managed to get our setup running nicely by setting up a very ugly hack to get around the option --check_tests_up_to_date not working. But I think querying the graph like you do in this ci script would have worked very well. Glad to hear there was a lot less hackish alternative! :D

One small issue I can think of with this ci script is that if we would spread master tests to multiple machines, we wouldn't be able to use the local cache and that would force to rebuild everything for every merge to master. But that probably can be easily mitigated.

If you use a distributed cache and split the test in deterministic fashion between workers then you should get pretty good caching :)
:)
Reply all
Reply to author
Forward
0 new messages