Selecting bazel targets to run in CI - possible approaches

5,858 views
Skip to first unread message

ut...@dropbox.com

unread,
Apr 2, 2018, 12:44:23 PM4/2/18
to bazel-discuss
I wanted to discuss two different approaches for deciding which test targets to run in CI at each revision. 

Goals:
  • Run as few tests as possible, while still being correct.
  • Not spend too much time deciding targets
Context:

Historically, we've been using an approach that essentially involves "bazel query rdeps" at the current and previous revision. Since bazel query is not aware of BUILD/skylark files, we've added a bit of functionality that handles edge cases like adding/deleting BUILD files by running all the targets that could be affected by the parent BUILD file, and resorted to running all the targets when a skylark file is modified.

We have ~10k targets and it takes ~4 minutes to decide which targets to run, which is awfully slow. We've been advised to use warm bazel processes instead of starting and shutting down bazel every time we want to decide targets, but that's an extra operational burden (a bazel service) that we're not sure that we want to take up right now.


Approaches:

Approach 1. Using rbuildfiles

To solve the BUILD/skylark file problem, we can use skyquery (rbuildfiles). And we've learnt that by comparing the proto representation of bazel query, we can determine whether a BUILD file modification causes a change in the target, therefore needing to run it again.

For example:

bazel query //foo/bar:baz  --output proto > a
echo '# hi' >> foo/bar/BUILD
bazel query //foo/bar:baz  --output proto > b
# there will be no difference in the proto representation of the target
diff a b # exit 0

# But if you modify the implementation of the rule of the target, or if you modify the target 
bazel query //foo/bar:baz  --output proto > c

diff a c
Binary files a and c differ

Algorithm

1. let B = set of BUILD files.
2. For every skylark file modified at this revision, add all BUILD files that could be affected to B (using rbuildfiles).
3. For every BUILD file modified at this revision, add it to B.
4. For every BUILD file added or deleted modified at this revision, add its parent BUILD file to B.
5. let T1 = set of targets defined in BUILD files in B.
6. let R1 = rdeps of all non BUILD/skylark files at this revision

7. check out to previous revision.

8. let T2 = steps 1 to 4 for this revision
9. let R2 = rdeps of all non BUILD/skylark files at this revision
10. let R = R1 union R2
# subtract out targets that we don't need to compare, since we know that we're going
# to be running them anyway.
11. let M1 = T1 - R
12. let M2 = T2 - R
13. F = R union (M1 - M2) # new targets that are affected by BUILD/skylark files
14. P = set of all targets in the intersection of M1 and M2 such that the proto representation of the target is different.
15. return F union (allrdeps P) 

This approach is beneficial since we rely on bazel query to do the heavy lifting.

Approach 2. Figuring out dependencies without relying too much on bazel query

Dump complete Bazel graph at current and previous revisions
  • Run a Bazel query  deps(//...) at both revisions, which will expand globs and wildcards
For each graph, compute a “target hash” for each target - this is a Merkle hash of all non-label attributes of the target and of the target hashes of all label attributes of the target (like deps)
  • Ideally, we’ll include a hash of all labels that are source files - like those in srcs, or data. However, this might be extremely wasteful/expensive - so, we can get away with only using the file name for any files that haven’t changed between the 2 revisions. For any files that were modified, file name + hash will be used to compute the target hash.
  • Compute differences between target hashes in current and previous revisions. Only targets which have different hashes (“modified targets”) need to be executed.
Since this approach doesn't take into account any file modifications, it helps us cache bazel query deps(//...) for each revision in a simple blob store. That way, each query at a revision can be fast without needing to operate a bazel query service. The downside is that we're not bazel query experts, so we don't know if this approach will work (and always work) for all corner cases, like WORKSPACE file changes, etc.


What do you think of these? Which one would you go with? It would be great to hear of stories from other companies, and what they do for target selectivity.


ut...@dropbox.com

unread,
Apr 20, 2018, 1:23:19 AM4/20/18
to bazel-discuss
We ended up choosing the latter approach.

This algorithm involves including every attribute except location (link) from every target in bazel query --order_output=no --target_universe=//... deps(//...) --output=proto  and recurse on label attributes to create a hash out of each target at both the revisions, and trigger all targets with modified hashes. It's pretty quick to construct the target -> hash mapping (< 5 seconds). We also have the list of changed files through git diff-tree, and involve the content hashes into the source file targets to ensure that their dependents run.

Is this a reasonable approach, or is anyone aware of potential problems with the choice of using this?

Janak Ramakrishnan

unread,
Apr 20, 2018, 6:17:18 PM4/20/18
to ut...@dropbox.com, bazel-discuss
The Merkle hash is a nice approach. It sounds like you're doing the queries at each revision twice, in order to include just the source files that changed? You may have scaling issues when deps(//...) gets too big: rbuildfiles(...) is O(UTC(modified files)), whereas deps(//...) is O(full graph).

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/8e01308d-eb1b-42ac-920a-7f6ea24d8843%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ut...@dropbox.com

unread,
Apr 20, 2018, 7:01:12 PM4/20/18
to bazel-discuss
I tried looking at memory usage of this approach. We seem to be around 2gb RSS for ~15k targets (the 10k mentioned earlier was an outdated number). And the target growth rate is not that high. I suspect it will be a while before we have memory issues. Are there other scaling limits we need to worry about?

Oh no, we query at each revision only once.

Janak Ramakrishnan

unread,
Apr 20, 2018, 9:46:48 PM4/20/18
to ut...@dropbox.com, bazel-discuss
Ah, you query once but compute the hashes twice, one for each consecutive pair of commits?

ut...@dropbox.com

unread,
Apr 21, 2018, 1:46:17 AM4/21/18
to bazel-discuss
Here's some pseudocode to better explain it:

changed_files = subprocess.check_output(['git', 'diff-tree', 'r', '--name-only']).split()

q1 = subprocess.check_output(['bazel', 'query', 'deps(//...)', '--output=proto'])
h1 = construct_target_hashes(q1, changed_files)

subprocess.check_call(['git', 'checkout', 'HEAD^']) # checkout to parent

q2 = subprocess.check_output(['bazel', 'query', 'deps(//...)',  '--output=proto'])
h2 = construct_target_hashes(q2, changed_files)

affected_targets = targets_with_different_hashes(h2, h1)

def construct_target_hashes(query_output, changed_files):
  # returns a mapping of targetname -> merkle hash 
  target_hashes = {}  # type: Dict[str, str]
  for target in query_output.get_targets():
    .....
    hasher = hashlib.sha256()
    if target.get_type() == bazel_proto.SourceFile:
      path = source_file_target_to_path(target.get_source_file())
      if path in changed_files && is_not_BUILD_or_skylark(path):
        hasher.update(open(path).read())
    ....
      

Basically, we query at the current revision, then we construct hashes for each target. When constructing the hashes, if the target seen is a source file type target, and the source file is one of the changed files and its not a BUILD/skylark file, we include its content into the hash as well, instead of just the attributes. 

Then we checkout to parent, and again query deps(//...), and again construct target hashes. Then we use the mapping here to find differences with the parent.

Janak Ramakrishnan

unread,
Apr 21, 2018, 2:57:00 PM4/21/18
to ut...@dropbox.com, bazel-discuss
So in a linear stream of commits, each commit is processed twice, once as the current commit, and once as the parent? If you wrote the bazel query output to a file, seems like you could save one query per commit.

I don't think you'll run into scaling limits in terms of data. So long as doing this query at each commit can keep up with your commit rate, seems fine.

ut...@dropbox.com

unread,
Apr 21, 2018, 3:52:14 PM4/21/18
to bazel-discuss
Yes, once we implement this, we could cache each query and save it.

Do you see any possible correctness issues with this approach? We're mostly worried about underselection -> we don't select a test that we should.

Janak Ramakrishnan

unread,
Apr 21, 2018, 8:02:50 PM4/21/18
to ut...@dropbox.com, bazel-discuss
I don't see any correctness issues, no!

ittai zeidman

unread,
Apr 22, 2018, 12:33:48 AM4/22/18
to bazel-discuss
Side-note: why don’t you run “bazel test //...”? Because you don’t have a remote cache or because you found the interaction with the remote cache takes too much time?

pa...@lucidchart.com

unread,
Apr 22, 2018, 1:31:34 PM4/22/18
to bazel-discuss
> Side-note: why don’t you run “bazel test //...”? Because you don’t have a remote cache or because you found the interaction with the remote cache takes too much time?

If I understood correctly, it the time, though IDK if it's remote cache or loading.

ittai zeidman

unread,
Apr 22, 2018, 2:20:15 PM4/22/18
to bazel-discuss
AFAIU the OP said they: “Historically, we've been using an approach that essentially involves "bazel query rdeps" at the current and previous revision.”
Didn’t understand they were using plain “bazel test //...”. This relates to remote cache because if you have ephemeral workers on CI without remote cache then you can’t afford to run “//...” because you’ll run everything every time

ut...@dropbox.com

unread,
Apr 22, 2018, 2:38:52 PM4/22/18
to bazel-discuss
We don't do bazel test //... because:

1. We have different projects that test different configurations + requirements. For example, we don't run size=enormous tests before letting developers land code, and we move flaky tests to their own project. For that, it's nice to have a single layer that does test selection.
2. Even with remote cache, there's a lot of overhead in fetching from the cache (which is not free). We save a lot of time by skipping to build and run those tests, then processing their results. 

ut...@dropbox.com

unread,
Nov 28, 2018, 2:38:49 PM11/28/18
to bazel-discuss
We've deployed a version that uses deps(//...) since a few months, and things seemed good.

We ran into an issue last week where the URL of an http_archive changed. Since we were getting the changed files from git diff, files changed in external repos weren't included in those, so we didn't hash their contents, and skipped a test which we shouldn't have. It's worth noting that if a file had been added or removed, our merkle hash would have changed, and we would have been safe.

It's an interesting corner case, and I think the only safe resolution today is to hash every external repo source file every time we're deciding what to test, which seems suboptimal.

Janak, any thoughts on this issue?

Janak Ramakrishnan

unread,
Nov 28, 2018, 2:43:51 PM11/28/18
to ut...@dropbox.com, bazel-...@googlegroups.com
If the http_archive url changed, didn't that change a WORKSPACE file in your repo?

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Utsav Shah

unread,
Nov 28, 2018, 2:50:00 PM11/28/18
to jan...@google.com, bazel-...@googlegroups.com
It did, but WORKSPACE files don't show up in the bazel query graph, right? My understanding is that BUILD, starlark, and WORKSPACE files don't appear in the graph, just shape them.

Janak Ramakrishnan

unread,
Nov 28, 2018, 3:17:36 PM11/28/18
to ut...@dropbox.com, bazel-...@googlegroups.com
I think we've discussed that you definitely need to issue queries (using rbuildfiles, perhaps?) to capture all modified BUILD, .bzl, and WORKSPACE files. Otherwise won't you also fail to detect any number of other modifications, like added/removed dependencies, changing options to targets, etc.?

Utsav Shah

unread,
Nov 28, 2018, 5:33:09 PM11/28/18
to Janak Ramakrishnan, bazel-...@googlegroups.com
Our approach to capture changes to these special files was to find the differences in deps(//...) across the two revisions. If a dependency was added/removed, or attributes were changed, the proto output of the target would change. And since we construct a merkle hash for each target, it gives us all the affected targets.

With this approach, I don't think we need rbuildfiles.

Janak Ramakrishnan

unread,
Nov 29, 2018, 1:22:51 PM11/29/18
to ut...@dropbox.com, bazel-...@googlegroups.com
Ah, thanks for the reminder. So then you could potentially special-case the WORKSPACE file? (Treat all WORKSPACE changes as world-affecting?) Might be faster in the common case of no changes?

Utsav Shah

unread,
Nov 29, 2018, 1:33:06 PM11/29/18
to Janak Ramakrishnan, bazel-...@googlegroups.com
Yeah, that's a little unfortunate, since we make a lot of modifications to the WORKSPACE file. Since the WORKSPACE file also pulls in a bunch of starlark files, we'd have to find all the files that it depends on, and run the world when they're modified as well. Presumably via parsing the load statements, it doesn't seem like there's any other way of doing it.

Janak Ramakrishnan

unread,
Nov 29, 2018, 1:36:51 PM11/29/18
to ut...@dropbox.com, bazel-...@googlegroups.com
bazel query 'buildfiles(WORKSPACE)' gives me a list of .bzl/BUILD files that it transitively depends on in some sense. So I don't think you have to do the parsing yourself. Another option would be a FR to have the definition of the remote repository as an attribute in the proto output of a rule that came from a remote repository. I think I'd be receptive to a patch for that.

Utsav Shah

unread,
Nov 29, 2018, 5:10:43 PM11/29/18
to Janak Ramakrishnan, bazel-...@googlegroups.com
Ok, do you mean something like a repository_rule_definition_hash, that could be embedded inside every target that originates from an external repo?

Janak Ramakrishnan

unread,
Nov 29, 2018, 7:39:01 PM11/29/18
to Utsav Shah, bazel-...@googlegroups.com
Yes, exactly.

Janak Ramakrishnan

unread,
Nov 29, 2018, 11:46:26 PM11/29/18
to Utsav Shah, bazel-...@googlegroups.com
+list, since I think people may benefit from the discussion.

I think you need to add the repository hash to each non-main package. You might be able to retrieve it here, by doing something like this and exposing the RepositoryDirectoryValue's digest. I'm not sure if that's always present (I don't work with external repositories), so you might need to find something else unique about it, or make sure the digest is always present. Then once you have the digest as a field in the package, you can add it in to the proto output.

Another alternative, that maybe I like better in theory, would be to have every target in an external repository depend on the rule that defined that repository, since it does really depend on it. Then you'd get that rule in the proto output for free. I'm not sure how to add that edge, though. And it might clutter the graph a bit.

markus....@ecosia.org

unread,
Dec 17, 2018, 6:50:48 PM12/17/18
to bazel-discuss
Just running into the same issue and trying to figure out what has changed in a WORKSPACE file to only trigger a rebuild/retest of rdeps f changed dependencies. I tried the "bazel query --output=proto 'deps(//...)'" approach and if I also add the flag "--order_output full" then I can only see a difference if there was any change in the dependecy graph, not if just a source file had a simple code change. I do notice that if I use "--order_output deps" (the default) the order of the output does change after a dependency update and then I can actually detect a difference, using a sha, for my test case). But how exactly is the output ordered in this mode? I can not find much docs on this and I am not exactly sure why there is this difference.

It would be nice to have a "blessed" way to easily figure out what external dependencies have changed between two different commits.

markus....@ecosia.org

unread,
Dec 17, 2018, 6:56:02 PM12/17/18
to bazel-discuss
Yeah using --order_output deps does not seem to have stable ordering. If I update a dep, then roll it back and update it again the order is different. Only --order_output full seems to be stable.

Janak Ramakrishnan

unread,
Dec 17, 2018, 6:58:33 PM12/17/18
to markus....@ecosia.org, bazel-discuss
--order_output=deps means that the list is ordered by deps (a depends
on b means that a is output before b). See
https://docs.bazel.build/versions/master/query.html#output-formats.
It's not a total ordering, so it may not be stable across invocations.

I'm not sure I understand your question about detecting changes. If
you want to see affected targets, you'll have to issue something like
a bazel query 'rdeps(//path/to/source:file,//...)'. query doesn't
examine most source files' contents, just BUILD and bzl files.
> --
> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/6023d291-4862-4fcb-8fcd-5b31bd57a2b3%40googlegroups.com.

markus....@ecosia.org

unread,
Dec 18, 2018, 3:35:27 AM12/18/18
to bazel-discuss

As far as I can tell it is basically the same issue that Utsav is talking about. If I update one of my third party dependencies I want to only build/test affected targets. Once I know which third party dependency has changed that is easy (using rdeps as you mentioned) - but figuring out which third party dependency has changes is not easy as only the `WORKSPACE` file has changed, or only a `package.json` or `requirements.txt`.

If every external repository target would expose a repository_rule_definition_hash in the proto output that would be easier. Actually just thinking about it, the new WORKSPACE.resolved file would probably help here also.


Ha, I actually just gave it a try and in the WORKSPACE.resolved there is an output_tree_hash for each external repository and the content does seem to be valid json, so it would be quite straight forward to parse that, compare it with the previous version and then we know which external deps have changed. So I guess once this feature becomes more mature (I am getting errors currently, will bug report) things should be in a better state.

Utsav Shah

unread,
Dec 20, 2018, 10:32:30 PM12/20/18
to markus....@ecosia.org, bazel-discuss
The content from the output_tree_hash isn't valid JSON unfortunately. It has "Label(//...)", but that seems like the only invalid json attr.

A "repository_rule_definition_hash" would really be useful.

You received this message because you are subscribed to a topic in the Google Groups "bazel-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bazel-discuss/I9udqWIcEdI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/b41523db-194b-42f7-974b-b1d2e9de88c7%40googlegroups.com.

mish...@gmail.com

unread,
Jan 14, 2019, 2:57:33 PM1/14/19
to bazel-discuss
I was seeing the pseudocode from Utsav, but still, I don't see how the hashes are created in the Rules, with deps and srcs

markus....@gmail.com

unread,
Feb 18, 2019, 3:42:48 PM2/18/19
to bazel-discuss
Utsav, have you been able to find a scalable solution to this problem yet? We have been ignoring workspace changes for now, but our monorepo is starting to get to a size where wan can not anymore so this is coming onto my radar again.

Utsav Shah

unread,
Feb 18, 2019, 4:00:19 PM2/18/19
to markus....@gmail.com, bazel-discuss
We came up with an ugly hack that mitigates the problem. We use the experimental workspace log file (https://github.com/bazelbuild/bazel/blob/master/site/docs/workspace-log.md) to find out when a remote repository hash was invalidated, and then we mark any dependencies of such rules as modified. This isn't robust against new developments like "patch_cmd" which sucks.

I'd really like to see a systemic solution for this within bazel.


On Mon, Feb 18, 2019 at 12:42 PM <markus....@gmail.com> wrote:
Utsav, have you been able to find a scalable solution to this problem yet? We have been ignoring workspace changes for now, but our monorepo is starting to get to a size where wan can not anymore so this is coming onto my radar again.

--
You received this message because you are subscribed to a topic in the Google Groups "bazel-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bazel-discuss/I9udqWIcEdI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bazel-discus...@googlegroups.com.

markus....@gmail.com

unread,
Feb 18, 2019, 4:36:42 PM2/18/19
to bazel-discuss
Hm interesting, how exactly are you making use of this? Any code snippet, pseudo code or hints you could share? Would be highly appreciated.

But yes agreed bazel really should provide a good story for this itself.

Utsav Shah

unread,
Feb 19, 2019, 9:22:50 PM2/19/19
to Markus Padourek, bazel-discuss
This is some example go code for reading workspace log files. We construct a mapping from archives to sha256 at both revisions and then mark any different ones as invalidated.

type Log struct {
entries map[string]string
}

func Parse(r io.Reader) (*Log, error) {
log := &Log{
entries: map[string]string{},
}

for {
var w workspace_log.WorkspaceEvent
_, err := pbutil.ReadDelimited(r, &w)

if err == io.EOF {
return log, nil
}

if err != nil {
return nil, errors.Wrap(err, "unexpected error while reading workspace log")
}

if w.GetDownloadAndExtractEvent() == nil && w.GetDownloadEvent() == nil {
continue
}

name := w.GetRule()

var sha string
if w.GetDownloadEvent() != nil {
sha = w.GetDownloadEvent().GetSha256()
} else if w.GetDownloadAndExtractEvent() != nil {
sha = w.GetDownloadAndExtractEvent().GetSha256()
}

log.entries[name] = sha
}
}

// Identify workspaces that have been modified or removed; newly added aren't a risk
func ModifiedWorkspaces(newer, older *Log) []string {

presentInNewer := make(map[string]bool)

var modified []string

for workspace, hash := range newer.entries {
presentInNewer[workspace] = true
if oldHash, ok := older.entries[workspace]; ok && oldHash != hash {
modified = append(modified, workspace)
}
}

for workspace := range older.entries {
if _, ok := presentInNewer[workspace]; !ok {
modified = append(modified, workspace)
}
}

return modified
}


On Mon, Feb 18, 2019 at 1:36 PM <markus....@gmail.com> wrote:
Hm interesting, how exactly are you making use of this? Any code snippet, pseudo code or hints you could share? Would be highly appreciated.

But yes agreed bazel really should provide a good story for this itself.

--
You received this message because you are subscribed to a topic in the Google Groups "bazel-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bazel-discuss/I9udqWIcEdI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bazel-discus...@googlegroups.com.

markus....@ecosia.org

unread,
Feb 26, 2019, 3:21:11 AM2/26/19
to bazel-discuss
Thanks for sharing that, and to get the correct log file are you simply running something like:

bazel query --universe_scope="//..." --order_output=no --experimental_workspace_rules_log_file=workspace.log 'deps(//...)'

Utsav Shah

unread,
Feb 26, 2019, 11:09:37 AM2/26/19
to markus....@ecosia.org, bazel-discuss

markus....@ecosia.org

unread,
Mar 27, 2019, 9:42:59 AM3/27/19
to bazel-discuss
Just to have this mentioned here also, another solution that came up, but I have not had the chance to try yet is write a repository rule which keeps track of all the commit/sha/etc attributes of all existing rules and writes that to a json which can be inspected. I have not tested that approach yet but in theory we would run this for the previous and the current commit and then just diff the json easily. The disadvantage here, it is very important for this rule to be put at the very end of the WORKSPACE file and also it only tracks specified attributes, so might not be so suitable for python or npm dependencies.

Here is the code though: https://gist.github.com/aaliddell/7eef8077098f0d9a5f5dec60fe3b3b4a

v...@uber.com

unread,
Apr 14, 2019, 10:31:31 PM4/14/19
to bazel-discuss
Another alternative solution would be computing rule key (hash) for each build rule in the graph on Bazel side and printing it in the query output. This would allow user to simply run queries on two revisions and compare checksums for each pair of rules.

Here is corresponding feature request in Bazel:
https://github.com/bazelbuild/bazel/issues/7962

I'm not sure though if there are limitations on Bazel side that would make it difficult to implement this feature.

I've also described solution that I've prototyped so far that is using existing features of Bazel in the above issue.

markus....@gmail.com

unread,
May 31, 2019, 2:08:20 PM5/31/19
to bazel-discuss
Seems there is a chance to get some improvements in bazel core: https://github.com/bazelbuild/bazel/issues/8521
Reply all
Reply to author
Forward
0 new messages