Hi Damien,
Thanks a lot for the answer! Looking at the ci script, it is exactly what we would have needed from the get go!
We
have already managed to get our setup running nicely by setting up a
very ugly hack to get around the option --check_tests_up_to_date not working.
But I think querying the graph like you do in this ci script would have
worked very well. Glad to hear there was a lot less hackish alternative!
:D
One small issue I can think of with this ci script is that if we would spread master tests to multiple machines, we wouldn't be able to use the local cache and that would force to rebuild everything for every merge to master. But that probably can be easily mitigated.
Here is an explanation of our hack just for reference if anyone else really needs the option check_tests_up_to_date to work with remote caching:The idea was to try to wrap all tests depending on our binaries with a second bazel test, name it:
is_test_up_to_date_hack. So we endup with something like this:
/apps/app1
/apps/app1/tests
/apps/app1/tests/is_test_up_to_date_hack
That extra test would then fail depending on a filesystem lock (trying to create a directory). Here is a workflow abstracted for each apps:
First check if is_test_up_to_date_hack is local cached with --check_tests_up_to_date
- if local cache exists
- nothing to do, tests are cached
- else if the local cache doesn't exists:
- we need to test for the remote cache
- create lock
- run is_test_up_to_date_hack test
- if the test pass:
- nothing to do, it means that tests didn't run and so the underlying tests are cached
- if the test fail:
- this means bazel detected code change and the cache was invalidated, so we run downstream tests
- if downstream tests succeeds:
- remove lock
- run is_test_up_to_date_hack to make it cache the result
- if downstream tests failed:
- do nothing, the test is_test_up_to_date_hack is cached as failed, so any subsequent commit will make tests run again
We tried quite some time with the filesystem lock but unfortunately we found that for it to work we had to disable sandboxing and doing so was actually disabling remote caching at the same time. So instead we used a sleep, and record how much time the test takes to run + a second hack fiddling with environment variables to make the test fail and invalidate the cache if the downstream tests have failed.
Our solution now works well but it's certainly very hacky and it obviously breaks the rule of reproducible builds.
We'll very likely later on get back to it and try the approach with quering the bazel graph :)
Sorry for that horrifying hack! :D