Input discovery in Bazel

633 views
Skip to first unread message

Lukács T. Berki

unread,
Jan 29, 2019, 7:40:30 AM1/29/19
to bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, David Morgan, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
NOTE TO PEOPLE AT GOOGLE: This thread is readable by everyone, including non-Googlers. 

Hey there,

Bazel apparently has a few use cases where it would be useful for actions to do input discovery, i.e. depend on artifacts that are only known in the execution phase:
  • C++ include scanning (i.e. only put those headers on the inputs of the action that are actually included)
  • Java classpath shortening (one usually does not need the .jar files in the whole transitive closure to compile a Java library, but one needs to invoke the compiler to be sure)
  • Scala classpath shortening (the same but with Scala, and AFAIU it's a bit more complicated)
  • Dart
This suggests that maybe there should be a common implementation: we currently do C++ and Java input discovery using bespoke Java code and we don't support the same for Scala and Dart.

The Bazel people don't have the capacity to work on such a thing at the moment in light of our upcoming 1.0 version, but if we can come to an agreement about what should be done, we'd be happy to accept contributions.

Emmanuel kindly implemented a prototype that adds two arguments to ctx.actions.run():
  1. discoverable_inputs_pool= that gives a superset of all possible inputs
  2. discoverable_inputs_list= that references a text file wherein each line must contain the path to a file in discoverable_inputs_pool= and corresponds to the set of inputs actually needed
This is one possible model of doing it, however, it would result in inefficiencies for at least Java and C++, on which we can't afford to regress:
  1. Compiler invocations would need to be stored in two Actions (the one doing the discovery and the one doing the actual compilation)
  2. These actions would both need to have a large list of inputs and it takes RAM to store that
  3. The input lists themselves would need to be written to a file then read back again, which is slow
An alternative would be to do the input discovery in a single action, just like Java and C++ do today. That's currently implemented in Java that cannot be easily translated to Starlark. 

There is the alternative implementation (the discoverable_inputs_pool= "superset" argument omitted for brevity):

# Calls a shell command to do the input discovery
ctx.actions.run(input_discovery=ctx.actions.shell_input_discovery([
"/usr/bin/cpp -E foo/bar.cc > discovered_inputs.txt""]))

# Invokes the include scanner implemented in Java
ctx.actions.run(input_discovery=cc_common.invoke_cc_include_scanner()])) 

Where whatever is referenced by the input_discovery= argument would get invoked by Action#discoverInputs() as usual, with the shell command possibly sandboxed to enforce that it only reads the files in the superset argument.

This would have the advantage of meshing much better with Skyframe's gradual input discovery model and not incurring the cost of additional action graph edges to Java and C++.

WDYT?

As a pie-in-the-sky goal, if everything goes well, we could also figure something out to get rid of the irksome ActionExecutionValue.discoveredModules field that so beautifully special-cases C++ in our core... unfortunately, that would help only problem (3) with Emmanuel's proposal, but hey, one can dream!

-- 
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

Ittai Zeidman

unread,
Jan 29, 2019, 7:55:12 AM1/29/19
to Lukács T. Berki, Christopher Parsons, Daniel Jasper, David Morgan, Emmanuel Pellereau, Laurent Le Brun, Liam Miller-Cushon, Nathan Harmata, P. Oscar Boykin, Philipp Wollermann, bazel-dev
+Oscar
This sounds related, but still different, from Oscar’s proposal of dynamically discovering targets, right? The above can help have coarse grain targets while achieving fine grain actions 

László Csomor

unread,
Jan 29, 2019, 8:09:19 AM1/29/19
to Ittai Zeidman, Lukács T. Berki, Christopher Parsons, Daniel Jasper, David Morgan, Emmanuel Pellereau, Laurent Le Brun, Liam Miller-Cushon, Nathan Harmata, P. Oscar Boykin, Philipp Wollermann, bazel-dev
Feels like we're on the path to reinvent Skyframe for action execution. The goal is a similar evaluation model: gradual input (= dependency) discovery where earlier-requested (= built) inputs determine later-requested inputs. Does this mapping of the problem to Skyframe make sense?

On a more concrete level, consider Windows (and Unixes without Bash) when talking about "shell commands" -- i.e. don't assume Bash, or any Sh-compatible shell, is always available.

--
László Csomor | Software Engineer | laszlo...@google.com


Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To post to this group, send email to baze...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CAFRCsYSQVZ0ZGFoUorEe-kZzLpgmYgCJ4PGv746Mt_OdPdUveQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David "Morgan" Morgan

unread,
Jan 29, 2019, 9:53:38 AM1/29/19
to László Csomor, Ittai Zeidman, Lukács T. Berki, Christopher Parsons, Daniel Jasper, Emmanuel Pellereau, Laurent Le Brun, Liam Miller-Cushon, Nathan Harmata, P. Oscar Boykin, Philipp Wollermann, bazel-dev
Thanks everyone. A few points from the Dart side:

 - In the case we're looking at right now, the input discovery directly uses the output of previous compiler actions; so there are no additional "discover inputs" actions. Obviously, a more flexible approach would work for us too; and we might well discover a need for it.

 - Unlike the Java and C++ cases, but like the Scala case, we are currently doing without this optimization; there is no native support for Dart in bazel. There is direct and quantifiable cost to many engineers. So we have a very strong interest in landing this in some form, and can certainly spend engineering effort on it. Hopefully, all we need is a little guidance :)

Thanks!

Morgan

Lukács T. Berki

unread,
Jan 29, 2019, 10:24:08 AM1/29/19
to László Csomor, Ittai Zeidman, Christopher Parsons, Daniel Jasper, David Morgan, Emmanuel Pellereau, Laurent Le Brun, Liam Miller-Cushon, Nathan Harmata, P. Oscar Boykin, Philipp Wollermann, bazel-dev
@Oscar, @Ittai: which proposal was that? I can only find this one (Post-analysis actions in Skylark), which doesn't list you are author.

@David: we'll definitely need this for our long-term plan to migrate all rules to Skylark, and that's why I'm willing to accept a suitable patch. However, in order to keep the complexity of our code under control, it's important that there be one implementation that covers all the use cases. That's why I'm pulling in all these people. I *think* my proposal might just work, but I'm not confident enough just yet to just assert that it will be fine.


On Tue, Jan 29, 2019 at 2:09 PM László Csomor <laszlo...@google.com> wrote:
Feels like we're on the path to reinvent Skyframe for action execution. The goal is a similar evaluation model: gradual input (= dependency) discovery where earlier-requested (= built) inputs determine later-requested inputs. Does this mapping of the problem to Skyframe make sense?
Yes, my long-term plan is to fold Action#discoverInputs() into Action#execute() using Skyframe restarts. The latter is already called within a Skyframe function, so it's not inconceivable, but it's not going to happen in the foreseeable future.
 
On a more concrete level, consider Windows (and Unixes without Bash) when talking about "shell commands" -- i.e. don't assume Bash, or any Sh-compatible shell, is always available.
Of course. I should have said "whatever arguments ctx.actions.run()" takes in place of the argument of shell_input_discovery(). 

P. Oscar Boykin

unread,
Jan 29, 2019, 12:27:38 PM1/29/19
to Lukács T. Berki, László Csomor, Ittai Zeidman, Christopher Parsons, Daniel Jasper, David Morgan, Emmanuel Pellereau, Laurent Le Brun, Liam Miller-Cushon, Nathan Harmata, Philipp Wollermann, bazel-dev
Here is a lightning talk I gave at the first bazel conference:


The second proposal on allowing "bounded dynamic rules", these are rules that have static inputs and static outputs, but have to phases: the first run looks at all the inputs and determines an action graph to run to produce the needed outputs.

In this way, scala (or c++ or haskell or rust or haskell) users could use the pattern of one module per directory (glob(["*.scala"])) and the compiler could look at the inputs and the outputs to determine which to build. So, again, I'm mostly concerned with the dependency graph between *files* on a particular build.

Right now, some scala users use a rule set that does this outside of bazel using workers. That's indication there is demand to solve this problem. I don't want to be outside of bazel and hiding state elsewhere because I want remote execution and caching to be safe and correct.
--

Lukács T. Berki

unread,
Jan 29, 2019, 1:00:56 PM1/29/19
to P. Oscar Boykin, László Csomor, Ittai Zeidman, Christopher Parsons, Daniel Jasper, David Morgan, Emmanuel Pellereau, Laurent Le Brun, Liam Miller-Cushon, Nathan Harmata, Philipp Wollermann, bazel-dev
Oh, that one!

That one seems to be a much more difficult problem, so I think it's better left out of scope for now. We were discussing an (eventual) approach where Bazel would parse input files in some way to generate dependency edges and maybe even targets, but it's a much more intrusive thing, since it would require the dynamic generation of actions (and maybe even targets) and not only the set of inputs for an action.

The closest thing to it is called Gazelle (generates BUILD files for Go), but it's a tool outside of Bazel and not part of it.

David "Morgan" Morgan

unread,
Feb 1, 2019, 5:01:34 AM2/1/19
to Lukács T. Berki, P. Oscar Boykin, László Csomor, Ittai Zeidman, Christopher Parsons, Daniel Jasper, Emmanuel Pellereau, Laurent Le Brun, Liam Miller-Cushon, Nathan Harmata, Philipp Wollermann, bazel-dev
Thanks Lukacs.

Indeed we should aim for something that covers the known/expected use cases.

We'd like to start work as soon as possible--the potential benefits for us are huge--so if those who have not yet commented would like to do so, we'd appreciate your input, please :)

Andrew Suffield

unread,
Feb 2, 2019, 7:10:00 AM2/2/19
to Lukács T. Berki, bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, David Morgan, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
I feel like I haven't quite understood what this feature is trying to accomplish. I see the value in removing the special cases, but not why they exist in the first place. A few probing questions to try to scope out the problem here:

How is this better than simply using all possible inputs to the compiler and letting it sort itself out?

Is this expected to improve caching? It's not clear to me if there's a way to avoid having the entire superset of inputs in the cache key.

How can this be sandboxed, without creating the same "action with large list of inputs" problem? Don't we always have to deal with that?

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To post to this group, send email to baze...@googlegroups.com.

David "Morgan" Morgan

unread,
Feb 4, 2019, 4:13:36 AM2/4/19
to Andrew Suffield, Lukács T. Berki, bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
Thanks Andrew. Let me explain for Dart.

The primary reason it's better is for incremental rebuilds.

We see cases with the Dart compile where we currently give all transitive compile outputs to the compiler, i.e., the output of all previous compile steps where there is any dependency link, direct or indirect, to the current compile step. We have to do this because, without examining the source files, there's no way to know which exact subset is needed.

It's usually a much smaller subset that is needed.

Currently this makes incremental rebuild cost O(n) where n is the length of the dependency chain from the file being modified to the entrypoint of the program. The entire dependency chain needs rebuilding, in series, because every compile step along the chain has a direct dependency on the thing that changed, and a dependency on the step before it.

This change would allow us to break most of those dependencies, preventing most of those rebuilds and making the rebuild cost more like O(1). (In practice, a slightly higher number than 1, because there still might be a chain of required dependencies longer than 1.)

At Google the saving is relatively modest because we have massively parallel build infrastructure; we're talking about a rebuild that use to be three minutes taking <10s. This is still huge from the developer's point of view. For everyone else, though, the saving is potentially vastly more: we could be talking about turning a half hour or hour long rebuild into a ten second rebuild.

Lukács T. Berki

unread,
Feb 4, 2019, 5:06:20 AM2/4/19
to Andrew Suffield, bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, David Morgan, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
On Sat, Feb 2, 2019 at 1:10 PM Andrew Suffield <asuf...@gmail.com> wrote:
I feel like I haven't quite understood what this feature is trying to accomplish. I see the value in removing the special cases, but not why they exist in the first place. A few probing questions to try to scope out the problem here:

How is this better than simply using all possible inputs to the compiler and letting it sort itself out?  
 

Is this expected to improve caching? It's not clear to me if there's a way to avoid having the entire superset of inputs in the cache key.  
 

How can this be sandboxed, without creating the same "action with large list of inputs" problem? Don't we always have to deal with that?
This has a few advantages:
  1. Even if the compiler doesn't read all input files, Bazel won't know about that and would re-execute the action if an unused input changes. 
  2. It's beneficial to ship less files to remote execution executors: aside from the increased cache hit rate, it decreases the overhead of shuffling the necessary files to where the action actually runs.
Unfortunately, to determine which files are actually needed, it's usually necessary to read some input files of the action, which Bazel is not allowed to do in the analysis phase (and if those files are not source files, they may not even exist yet).

It's worth mentioning that Bazel actually has two mechanisms for pruning action inputs like this:
  1. Input discovery, currently special-cased for C++ and available only in our internal version (traces of it can be seen in CppCompileAction, though). This runs before the action actually executes and discovers the set of required inputs based on running a very limited version of the C++ preprocessor on the file to be compiled and the headers it includes.
  2. .d file scanning: after the compiler finishes, it emits a file containing the headers actually read (or, in case of MSVC, it prints it on its console output). Bazel processes this data and retroactively removes the input files that the compiler did not need. 
this thread is about implementing (1) in a way that works for all existing use cases. (2) is interesting, too, but since the correctness of Bazel relies on doing it right (you could in theory remove every input of the action), I'm more wary of offering it to Starlark.

Andrew Suffield

unread,
Feb 4, 2019, 5:31:28 PM2/4/19
to Lukács T. Berki, bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, David Morgan, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
This is a fascinating problem space.

I have an incomplete idea that I've been toying with all day, and it's not quite coherent yet, but I'll write it up and see if anybody can finish it off. Everything I've learned about working with bazel has taught me to decompose this sort of problem into tiny, highly-cacheable graphs which do as much processing as possible near the start of the graph, and defer aggregation of results as late as possible. With that in mind, what if we break the problem up into "extraction" and "combination" steps? Extraction runs on exactly one source file, and produces some intermediate proto; combination takes all the protos and walks the graph of dependencies to produce the list of probably-required files.

For C++, extraction could approximate "a list of all #include lines in the input" and combination is a simple graph walk. For Dart, I think the extraction step is already handled by the compiler, and it just needs a combination step? I can't immediately tell whether we'd need a per-language combiner, or if there's a generic form which could be implemented once.

Advantages:
 - each input file is loaded only once in a single, highly parallel action
 - really good cache survival rates, because changes to source files which don't affect includes will invalidate nothing, and if that fails then we get another chance at a cache hit when combining results
Flaws:
 - no solution for C++ preprocessor gymnastics or python abuses (doctrinal conflict integrate, anybody?)
 - Java transitive dependencies are kind of interesting, I'm not entirely sure what to do there, although it sounds solveable

I'm not sure how it can be made to work, but it feels like it should be possible.

Ulf Adams

unread,
Feb 4, 2019, 5:56:12 PM2/4/19
to Andrew Suffield, Lukács T. Berki, bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, David Morgan, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
As background:

C++ include scanning consists of two steps:
1. extract includes from the source files (approximately; we ignore #ifdef and friends, and do not support macro includes, i.e., #define MACRO "file.h" #include MACRO)
2. walk the include graph in combination with the invocation-specific include path to find all included files

(We have talked about making step 1 return a preprocessor AST rather than a list of includes, to fix the approximation problem. It's not clear that the costs are worth the benefits, but it would be an option.)

Note that we perform step 1 on-demand, as we're doing 2, rather than eagerly.

We did at one point have an implementation of this process in Skyframe, but deleted it due to performance issues. We did not seriously attempt to fix the performance issues.

At this time, this process happens at action execution time, using look-aside caches to avoid duplicate work. Having look-aside caches is rather undesirable, but this code precedes the existence of Skyframe.


The Java implementation also uses a look-aside cache for parsed files, which is cleared before every build (IIRC).

David "Morgan" Morgan

unread,
Feb 6, 2019, 9:40:54 AM2/6/19
to Ulf Adams, Andrew Suffield, Lukács T. Berki, bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
Great that there's interest in this :)

Are there any obvious next steps to making this happen? The main constraint seems to be that the people who will have time to work on it are not the same as the people with the expertise, so it seems like there will be some coordination/negotiation needed to make progress :)

Thanks!

Morgan

David "Morgan" Morgan

unread,
Feb 7, 2019, 10:12:22 AM2/7/19
to Ulf Adams, Andrew Suffield, Lukács T. Berki, bazel-dev, Ittai Zeidman, Daniel Jasper, Liam Miller-Cushon, Emmanuel Pellereau, Nathan Harmata, Christopher Parsons, Philipp Wollermann, Laurent Le Brun
Friendly ping. I'll be away all of next week (11th-15th)--but hopefully discussion can continue while I'm away :)

pa...@rivethealth.com

unread,
Mar 11, 2019, 3:38:21 PM3/11/19
to bazel-dev
Yay, some discussion about input discovery! 

I authored the Post-analysis actions in Skylark proposal, based on my own findings, Bazel documentation, Oscar's "subgraph" idea which I cited at the bottom, and my understanding of Skyframe.

Feels like we're on the path to reinvent Skyframe for action execution. The goal is a similar evaluation model: gradual input (= dependency) discovery where earlier-requested (= built) inputs determine later-requested inputs. Does this mapping of the problem to Skyframe make sense?

Yes, absolutely!

That one seems to be a much more difficult problem, so I think it's better left out of scope for now. We were discussing an (eventual) approach where Bazel would parse input files in some way to generate dependency edges and maybe even targets, but it's a much more intrusive thing, since it would require the dynamic generation of actions (and maybe even targets) and not only the set of inputs for an action.

I achieve dependency discovery today: (1) First using an action to generate archive of selected inputs (2) Giving that archive to the next step (3) Rely on Bazel short circuiting the second step if the archive is the same. I'd be thrilled for that process to use less disk space and be less awkward.

More generally, IMO there are three type needs for more dynamism in actions (also given in my proposal earlier).
  1. Pruning inputs to achieve better cache hits. C++ rules do this. This would be for anything with cheap static dependency analysis: Go, TypeScript, JavaScript, Sass.
  2. Using a fork-join pattern to parallelize work. The equivalent of Bazel native's "template actions". This would be useful for anything "shardable": compressed archives, tests, Android dexing.
  3. Plan efficient build graphs of arbitary shape. This is the generalization of 1 and 2. For example, given a source tree of 500 TypeScript files, examine the dependencies and generate ~50 interdependent build actions with a intelligent level of granularity. (Ditto for Java or anything else with a similar compilation model.)
These can be done today, but only through tedious manual specification, or increasingly complex code-generation build tools layered on top of Bazel like Gazelle. It's reminiscent of configure/cmake/make madness to address build tool inefficiencies.

This suggestion would fix that for #1, which, again, is very existing.

But with the thought "we're on the path to reinvent Skyframe for action execution", could we more generally solve #1, #2, and #3, by more fully leveraging the Skyframe model in Starlark?

Lukács T. Berki

unread,
Mar 18, 2019, 5:45:38 AM3/18/19
to pa...@rivethealth.com, bazel-dev, David Morgan, Laurent Le Brun, Marcel Hlopko
On Mon, Mar 11, 2019 at 8:38 PM <pa...@rivethealth.com> wrote:
Yay, some discussion about input discovery! 

I authored the Post-analysis actions in Skylark proposal, based on my own findings, Bazel documentation, Oscar's "subgraph" idea which I cited at the bottom, and my understanding of Skyframe.

Feels like we're on the path to reinvent Skyframe for action execution. The goal is a similar evaluation model: gradual input (= dependency) discovery where earlier-requested (= built) inputs determine later-requested inputs. Does this mapping of the problem to Skyframe make sense?

Yes, absolutely!

That one seems to be a much more difficult problem, so I think it's better left out of scope for now. We were discussing an (eventual) approach where Bazel would parse input files in some way to generate dependency edges and maybe even targets, but it's a much more intrusive thing, since it would require the dynamic generation of actions (and maybe even targets) and not only the set of inputs for an action.

I achieve dependency discovery today: (1) First using an action to generate archive of selected inputs (2) Giving that archive to the next step (3) Rely on Bazel short circuiting the second step if the archive is the same. I'd be thrilled for that process to use less disk space and be less awkward.

More generally, IMO there are three type needs for more dynamism in actions (also given in my proposal earlier).
  1. Pruning inputs to achieve better cache hits. C++ rules do this. This would be for anything with cheap static dependency analysis: Go, TypeScript, JavaScript, Sass.
Coincidentally this is exactly what David wanted to have. It seems that this is a somewhat common request: in addition to Paul and David, our own C++ rules will also require this to be rewritten in Starlark. I'd be willing to entertain a design doc that would allow us to do so, along the lines of adding an extra "this is the list of files that were required" argument to Starlark functions creating action, e.g.:

ctx.actions.run(dotd_file=<artifact>)

Laurent, Marcel, WDYT?
 
  1. Using a fork-join pattern to parallelize work. The equivalent of Bazel native's "template actions". This would be useful for anything "shardable": compressed archives, tests, Android dexing.
Yep. However, I'm not very happy with template actions in particular because it looks like a somewhat clumsy abstraction, so I'm somewhat reluctant to set them in stone by exposing them to Starlark.

  1. Plan efficient build graphs of arbitary shape. This is the generalization of 1 and 2. For example, given a source tree of 500 TypeScript files, examine the dependencies and generate ~50 interdependent build actions with a intelligent level of granularity. (Ditto for Java or anything else with a similar compilation model.)
Let's not bite off more than we can chew here. This would be quite a bit of rope, so I'd much rather not tackle this at the moment.

 

For more options, visit https://groups.google.com/d/optout.

Laurent Le Brun

unread,
Mar 18, 2019, 5:55:30 AM3/18/19
to Lukács T. Berki, pa...@rivethealth.com, bazel-dev, David Morgan, Marcel Hlopko
I'd be happy to review such a document.
--
Laurent

Janak Ramakrishnan

unread,
Mar 18, 2019, 8:00:44 AM3/18/19
to Lukács T. Berki, pa...@rivethealth.com, bazel-dev, David Morgan, Laurent Le Brun, Marcel Hlopko

  1. Using a fork-join pattern to parallelize work. The equivalent of Bazel native's "template actions". This would be useful for anything "shardable": compressed archives, tests, Android dexing.
Yep. However, I'm not very happy with template actions in particular because it looks like a somewhat clumsy abstraction, so I'm somewhat reluctant to set them in stone by exposing them to Starlark.
Can you say what you would want to change about template actions? Keeping them around causes a fair amount of ongoing pain. I've been acting under the assumption that they're a useful thing that is the future. If they're not, we should probably change that sooner rather than later. Do you want to change tree artifacts as well?

Lukács T. Berki

unread,
Mar 18, 2019, 8:27:54 AM3/18/19
to Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, David Morgan, Laurent Le Brun, Marcel Hlopko
Tree artifacts are definitely here to stay.

As for template actions, they are very useful, but I don't have enough experience with them to be able to confidently say that we won't regret giving them a Starlark API, thus my reluctance.
 

Lukács T. Berki

unread,
Apr 9, 2019, 4:55:10 AM4/9/19
to Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, David Morgan, Laurent Le Brun, Marcel Hlopko, Emmanuel Pellereau
News!

There is now a design doc by Emmanuel. In short, the plan is to add parameter to action.run() that points to an Artifact. It's an output artifact of the action, and after the action finishes executing, Bazel would look at the file and consider all the artifacts that are in that files not to be inputs of the action anymore.

In general, I like the approach, with a few minor caveats and one large one:
  1. I suppose the file would contain exec paths to the now-unneeded artifacts?
  2. We'd need to handle the case where the file contains files that are not inputs of the action. The safe bet is to error out in that case.
  3. I don't particularly like the idea of listing *unused* inputs instead of used ones, but I do sympathize with the perception that listing all the inputs is cumbersome. Not even C++ actions do that. So I guess it will work.
And the big caveat: as it is, this is not usable for C++ rules because C++ compilers emit a list of *used* headers and not unused ones, which is unfortunate. This is exacerbated by the fact that MSVC reports the set of used headers in a completely different way.

I see two ways around this issue:
  1. Use this protocol anyway and make C++ rules wrap the C++ compiler and massage its output to the right format.
  2. Do something really complicated, like passing in a Starlark function that would take the contents of what is the unused input list artifact now and the stdout/stderr of the compiler and return a list of Artifacts.
(2) would be complicated, would probably encounter some headwind from the Starlark folks, and if ever implemented, migrating unused_inputs_list to this one would not be very hard. So I vote for (1). 

Laurent, Marcel, what do you think?

Laurent Le Brun

unread,
Apr 9, 2019, 5:02:03 AM4/9/19
to Lukács T. Berki, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, David Morgan, Marcel Hlopko, Emmanuel Pellereau
> bazel can stop watching for any change in those files, as they cannot affect the outcome of the action.

This means that any bug in the action can lead to incremental incorrectness in Bazel?
--
Laurent

Lukács T. Berki

unread,
Apr 9, 2019, 5:13:59 AM4/9/19
to Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, David Morgan, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 11:02 AM Laurent Le Brun <laur...@google.com> wrote:
> bazel can stop watching for any change in those files, as they cannot affect the outcome of the action.

This means that any bug in the action can lead to incremental incorrectness in Bazel?
Correct, this would mean relying on action implementations to be correct. I'm not very happy with this, but that's what C++ needs...

Ulf Adams

unread,
Apr 9, 2019, 5:17:43 AM4/9/19
to Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, David Morgan, Marcel Hlopko, Emmanuel Pellereau
How does an action know which files are unused?

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To post to this group, send email to baze...@googlegroups.com.

Jakob Buchgraber

unread,
Apr 9, 2019, 5:19:20 AM4/9/19
to Laurent Le Brun, Lukács T. Berki, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, David Morgan, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 11:02 AM 'Laurent Le Brun' via bazel-dev <baze...@googlegroups.com> wrote:
> bazel can stop watching for any change in those files, as they cannot affect the outcome of the action.

This means that any bug in the action can lead to incremental incorrectness in Bazel?

What's different to the current state of things? I'd assume that any sane implementation of this would only
provide the subset of inputs to the action on reruns?

David "Morgan" Morgan

unread,
Apr 9, 2019, 6:52:27 AM4/9/19
to Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 11:17 AM Ulf Adams <ulf...@google.com> wrote:
How does an action know which files are unused?

 That's entirely up to the action. For Dart we add an option to the compiler to track which srcs it ends up using and output the list of unused ones.

Lukács T. Berki

unread,
Apr 9, 2019, 6:55:59 AM4/9/19
to David Morgan Morgan, Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
...which is the same thing as the C++ compiler does.

David "Morgan" Morgan

unread,
Apr 9, 2019, 6:58:02 AM4/9/19
to Jakob Buchgraber, Laurent Le Brun, Lukács T. Berki, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
No, on subsequent reruns all the inputs are provided again. Because the set of inputs needed is determined from the inputs, and the inputs must have changed to trigger a rerun, we don't know if the set of inputs needed stayed the same.

The gain is that if one of the unused inputs changes, no rerun is triggered. "Unused" is very strict: it really means the action didn't read the file at all. If it didn't read the file at all, there's no way a change to it can require a rerun.

Incorrectness could happen if the action reports inputs as unused when they were actually used. Then, the action will not be rerun when these inputs changed, and its output will be stale.

Ulf Adams

unread,
Apr 9, 2019, 7:18:24 AM4/9/19
to David Morgan Morgan, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
How do you know which files are available then?

David "Morgan" Morgan

unread,
Apr 9, 2019, 7:27:29 AM4/9/19
to Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 1:18 PM Ulf Adams <ulf...@google.com> wrote:
How do you know which files are available then?

The set of available files doesn't change--all the inputs are provided to every run.

The only thing that changes is the action is rerun less often.

Effectively, the action promises that it will produce the same output if only files it declares unused changed. Bazel trusts it and does not rerun for such cases.
 

Austin Schuh

unread,
Apr 9, 2019, 1:52:47 PM4/9/19
to David Morgan Morgan, Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 4:27 AM 'David "Morgan" Morgan' via bazel-dev <baze...@googlegroups.com> wrote:


On Tue, Apr 9, 2019 at 1:18 PM Ulf Adams <ulf...@google.com> wrote:
How do you know which files are available then?

The set of available files doesn't change--all the inputs are provided to every run.

The only thing that changes is the action is rerun less often.

Effectively, the action promises that it will produce the same output if only files it declares unused changed. Bazel trusts it and does not rerun for such cases.

It seems like there might be value in re-running the action without the excluded inputs available and confirming that the result is the same. That's also expensive, but it could be hidden behind a flag for rule developers.  Enable it in the bazel CI as well.  It won't catch every bug, but building a reasonably large code base for a couple of weeks will pretty quickly tell you if it broke.

Austin

Jakob Buchgraber

unread,
Apr 9, 2019, 4:27:00 PM4/9/19
to David Morgan Morgan, Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 1:27 PM 'David "Morgan" Morgan' via bazel-dev <baze...@googlegroups.com> wrote:
The set of available files doesn't change--all the inputs are provided to every run.

Why would it be hard (expensive?) to provide only the filtered inputs after they first run?

Paul Draper

unread,
Apr 9, 2019, 4:30:31 PM4/9/19
to Austin Schuh, David Morgan Morgan, Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
It won't catch every bug, but building a reasonably large code base for a couple of weeks will pretty quickly tell you if it broke.

And in the meantime, you'll have poisoned your cache.

Why can't this be done in two steps? One for `cc -M`, the other for `cc`.
The `cc -M` is run when any of the input files change. The result is a list of input which are provided for the second step.

The requires that `cc -M` be sufficiently fast, but to my knowledge that is at least true for C/C++.

---

If it really is a requirement that this be done all in one action, there is an interesting idea at the bottom of http://make.mad-scientist.net/papers/advanced-auto-dependency-generation/ where LD_PRELOAD is used to track which files are opened. Still, it seems that separate `cc -M` type step is better.
--
Paul Draper
Rivet, CTO

David "Morgan" Morgan

unread,
Apr 10, 2019, 4:01:52 AM4/10/19
to Jakob Buchgraber, Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, pa...@rivethealth.com, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
Because

 - if none of the used files changed, you don't need to rerun at all;
 - if one of the used files changed, causing a rerun, the needed file set may have changed; so you need to provide them all.

Remember that the used file set is computed from the inputs, so it can change whenever any of the inputs changes.

So it is simply incorrect to filter the inputs for reruns.

David "Morgan" Morgan

unread,
Apr 10, 2019, 4:07:47 AM4/10/19
to Paul Draper, Austin Schuh, Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 10:30 PM Paul Draper <pa...@rivethealth.com> wrote:
It won't catch every bug, but building a reasonably large code base for a couple of weeks will pretty quickly tell you if it broke.

And in the meantime, you'll have poisoned your cache.

Why can't this be done in two steps? One for `cc -M`, the other for `cc`.
The `cc -M` is run when any of the input files change. The result is a list of input which are provided for the second step.

The requires that `cc -M` be sufficiently fast, but to my knowledge that is at least true for C/C++.

Yes, the problem is efficiency.

At some point during the action you have enough understanding of the source files to know which inputs are not needed.

This can be significantly more than pattern matching. For some Dart actions it goes as far as type inference, which we definitely want to do only once.

Lukács T. Berki

unread,
Apr 10, 2019, 6:43:17 AM4/10/19
to Paul Draper, Austin Schuh, David Morgan Morgan, Ulf Adams, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
On Tue, Apr 9, 2019 at 4:30 PM Paul Draper <pa...@rivethealth.com> wrote:
It won't catch every bug, but building a reasonably large code base for a couple of weeks will pretty quickly tell you if it broke.

And in the meantime, you'll have poisoned your cache.
Yeah, that's the tradeoff and that's the reason for my reluctance. The counterargument is that we already do that for C++ and it seems to work well enough. In the end, I think we must come up with something like this so that the C++ rules can be fully Starlarkified.

I kinda like both of these ideas for verifying whether an action worked well, but both of them have caveats: Austin's is an ex-post-facto one that needs to run in a continuous build and as Paul said opens up the possibility of cache poisoning and Paul's doesn't work on Windows and it's leaky in any case (what a about subprocesses doing creative things with the environment?)

Marcel floated the idea of doing what our C++ rules do: instead of having a list of unused files, have a list of *used* files (which is presumably easier for any compiler to compute) and a list of mandatory files that are always included, whether they are in the list of used files or not (this is for things like the compiler binary itself).

The most generic way, of course, would be an (stdout, stderr, output files) -> (kept files) lambda, but that's a bit too complicated for my liking and arguably if you want to run Starlark lambdas during an action, you should just put them in the action itself.

Austin Schuh

unread,
Apr 10, 2019, 2:07:34 PM4/10/19
to Lukács T. Berki, Paul Draper, Austin Schuh, David Morgan Morgan, Ulf Adams, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
On Wed, Apr 10, 2019 at 3:43 AM Lukács T. Berki <lbe...@google.com> wrote:

On Tue, Apr 9, 2019 at 4:30 PM Paul Draper <pa...@rivethealth.com> wrote:
It won't catch every bug, but building a reasonably large code base for a couple of weeks will pretty quickly tell you if it broke.

And in the meantime, you'll have poisoned your cache.
Yeah, that's the tradeoff and that's the reason for my reluctance. The counterargument is that we already do that for C++ and it seems to work well enough. In the end, I think we must come up with something like this so that the C++ rules can be fully Starlarkified.

I kinda like both of these ideas for verifying whether an action worked well, but both of them have caveats: Austin's is an ex-post-facto one that needs to run in a continuous build and as Paul said opens up the possibility of cache poisoning and Paul's doesn't work on Windows and it's leaky in any case (what a about subprocesses doing creative things with the environment?)

Some of this reluctance is inherent in the concept of input discovery.  Bugs in the rule can now cause big incremental correctness problems.  All we can do is provide tools to help catch those bugs faster.  We will never eliminate them.  Imagine a compiler which randomly picks a number between 0-10 and opens the folder with that number.  Without prior knowledge about it's behavior, we will never automatically deduce what files it actually needs.

I think it's pretty safe to say that bugs in rules with input discovery will cause buggy output.  Cache poisoning is a real part of that.  By the way, you get cache poisoning with non reproducible rules too today.  So, don't write buggy rules?

The cache poisoning is easy to significantly mitigate, and is pretty much unavoidable with any design.  Don't push the result to the cache unless the outputs match the second limited build.  I was assuming that the actual action being run would implement the run twice behavior.  Essentially, gcc $(args) -> gcc ${args}; rm ${extra_files}; gcc ${args}; diff ${outputs}.  And then you only push to the cache when the diff succeeds.

Austin

David "Morgan" Morgan

unread,
Apr 11, 2019, 4:55:06 AM4/11/19
to Austin Schuh, Lukács T. Berki, Paul Draper, Ulf Adams, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
Risk of incorrectness is important--but must be traded off against performance.

With this change we're expecting to regularly see build machine use reductions of x100 for incremental Dart builds, and wall time reductions of x10.

For the largest (but still not unusual) incremental rebuilds we'll see build machine use fall by a factor of x1000.

These are very big numbers. I think I could reasonably argue for taking on a bit of extra risk for a x2 improvement. Given the numbers, this is overwhelmingly worth a bit of extra risk for us.

Paul Draper

unread,
Apr 11, 2019, 3:00:39 PM4/11/19
to David Morgan Morgan, Austin Schuh, Lukács T. Berki, Paul Draper, Ulf Adams, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
This was pushed hard 3 years ago from the Scala community.
https://groups.google.com/d/topic/bazel-discuss/3iUy5jxS3S0/discussion
Stateful compilers, modifying the cache key post-compilation, etc.

Interesting to see an Approved Google Language finally make some traction with it.
Previously, the conclusion by the Bazel team was this was a fundamentally unsafe way for Bazel to operate.

---

We will never eliminate them.
 
This is similar to my comment in 2015: "there's no magic force than can prevent a rule-writer from writing non-deterministic code." 
Ulf's response:

Determinism != Correctness. I'd define correctness roughly like this:
 
A build system is correct if the output of an incremental build is consistent with the output of a clean build from the same source, where consistency means that you could come up with a hypothetical sequence / timing of running the actions pertaining to the build to obtain that output.
 
The idea of that definition being that we can define correctness of a build system independently of the correctness or determinism of the individual tools. This is useful for two reasons: (1) there are scenarios in which build systems fail to produce correct output even if the tools are all fully correct and deterministic, and (2) otherwise you depend on all tools being correct and deterministic, which is obviously not the case. (Note that this precludes concurrent modifications to the file system, though we're also trying to ensure that Bazel always picks up changes made while the build is running in the next build at the latest.) 

With the exception of a small number of uncontrolled inputs like /dev/random, time(), and intra-action, concurrency, Bazel today guarantees that incremental builds are correct.

---

In light of the previous discussions, higherkindness/rules_scala has opt-in incremental compilation by the Zinc (an incremental compiler for Scala) by using a non-stateful worker that stores files with dependency information locally. The effects of any correctness errors are limited by using only for local dev builds.

With this change we're expecting to regularly see build machine use reductions of x100 for incremental Dart builds

Where are the Dart rules?

Paul Draper

unread,
Apr 11, 2019, 3:03:14 PM4/11/19
to David Morgan Morgan, Austin Schuh, Lukács T. Berki, Paul Draper, Ulf Adams, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
non-stateful worker

*stateful worker 

Ulf Adams

unread,
Apr 11, 2019, 3:53:24 PM4/11/19
to Paul Draper, David Morgan Morgan, Austin Schuh, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
Let's collectively decide that it's ok for Bazel to offer a generic facility to reduce the list of input files of an action after the fact using an additional output from the action describing either the used or the unused files. The performance wins are significant enough that people are willing to make that trade off, which seems reasonable.

There are already loopholes in Bazel, witness C++ and persistent workers (Java, JS, TS, Android, etc.). AFAICT, the compiler engineers are careful enough that problems are _very rare_. While it's probably a good idea to require opt-in in the beginning, I expect that actively maintained tools using a new mechanism will quickly stabilize to a comparable level of correctness.

David "Morgan" Morgan

unread,
Apr 12, 2019, 4:48:44 AM4/12/19
to Paul Draper, Austin Schuh, Lukács T. Berki, Ulf Adams, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau


With this change we're expecting to regularly see build machine use reductions of x100 for incremental Dart builds

Where are the Dart rules?

They're not published (yet).

David "Morgan" Morgan

unread,
Apr 15, 2019, 9:48:22 AM4/15/19
to Ulf Adams, Paul Draper, Austin Schuh, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
That sounds good to me.

Given that we want to go ahead with something--does anyone have specific feedback on the proposal, please? Paul, does it look like this will cover what's needed for Scala?

Thanks

Morgan

Paul Draper

unread,
Apr 15, 2019, 11:07:53 AM4/15/19
to David Morgan Morgan, Ulf Adams, Austin Schuh, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
The answer is same for Scala as it is for Java.

And if I understand correctly, Java requires an unused list, not a used list.
(If I add Example.java, everything must be recompiled.)

Ulf Adams

unread,
Apr 15, 2019, 11:19:01 AM4/15/19
to Paul Draper, David Morgan Morgan, Austin Schuh, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
Bazel has a list of all files that are visible in the sandbox, so it can 'trivially' (famous last words!) compute the unused list from the used and vice versa. However, I'm not sure how an external tool would get the list of all inputs *visible in the sandbox* without doing a recursive ls, and even that is not fully reliable due to symlinks / runfiles trees.

David "Morgan" Morgan

unread,
Apr 15, 2019, 12:00:51 PM4/15/19
to Ulf Adams, Paul Draper, Austin Schuh, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
Yes.

Re: Paul's comment, note that we are not discussing different semantics for used list vs unused list--the two are equivalent, as blaze can compute the rest and must do so immediately if it needs it--the question is which is more convenient for the tool to calculate.

David "Morgan" Morgan

unread,
Apr 18, 2019, 7:31:52 AM4/18/19
to Ulf Adams, Paul Draper, Austin Schuh, Lukács T. Berki, Laurent Le Brun, Janak Ramakrishnan, bazel-dev, Marcel Hlopko, Emmanuel Pellereau
Hi everyone,

FYI, Emmanual and I will both be on vacation for a couple of weeks. I will be back on Monday May 6th.

As it doesn't sound like there are any major points of contention left about the proposal, I'm hoping we can resolve this reasonably quickly when we're back :)

Thanks

Morgan

david...@google.com

unread,
May 6, 2019, 9:58:49 AM5/6/19
to bazel-dev
Hi everyone,

Easter vacation is over, we're now back and eager to get this moving in any way possible.

Thanks :)

Morgan

Laurent Le Brun

unread,
Jun 24, 2019, 7:13:56 AM6/24/19
to David Morgan, bazel-dev
Update: the proposal has been implemented. It is released with Bazel 0.27, but you need to pass the flag --experimental_starlark_unused_inputs_list.

Please give it a try.
We're going to remove the flag and make the feature available for everyone soon.

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To post to this group, send email to baze...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Laurent

pa...@rivethealth.com

unread,
Jun 28, 2019, 4:49:46 PM6/28/19
to bazel-dev
Excellent!

To make sure I'm not missing anything....there's no way for this to work with Java/Scala/Kotlin, etc. right?

I am compiling A.java.
class A { A() { new mypackage.MyClass(); } }
with a classpath of lib1.jar and lib2.jar.

Suppose in run #1, lib2.jar has mypackage.MyClass with a no-args constructor and lib1.jar is marked unused.

Now suppose in run #2, lib1.jar is changed to have mypackage.MyClass without a no-arg constructor.
No observed inputs have changed and build succeeds, but it should have failed.

I'm not missing something, right?
To unsubscribe from this group and stop receiving emails from it, send an email to baze...@googlegroups.com.

To post to this group, send email to baze...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/3003a1b5-ddbb-428e-8493-2c1e17c5f94b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Laurent

pa...@rivethealth.com

unread,
Jun 28, 2019, 5:10:00 PM6/28/19
to bazel-dev
I figured it out....for Java et al there needs to be a "check" action as a dummy input to the compilation that looks for duplicate classes (a very fast operation) and fails if one exists.

P. Oscar Boykin

unread,
Aug 3, 2019, 7:04:07 PM8/3/19
to bazel-dev
How does this feature interact with remote caching?

I don't fully understand how the action cache keys are computed, but I assumed they depend on the hash of the inputs to the rule that produces the output.

In the case that you are going to ignore an input is there some mechanism to add a new cache entry for each changed but ignored input but not bother to recompute?

Or, perhaps this feature interacts badly with caching and harms cache hit rates.

Thanks!
To unsubscribe from this group and stop receiving emails from it, send an email to baze...@googlegroups.com.

To post to this group, send email to baze...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/3003a1b5-ddbb-428e-8493-2c1e17c5f94b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Laurent

Lukács T. Berki

unread,
Aug 12, 2019, 4:02:20 AM8/12/19
to P. Oscar Boykin, bazel-dev, Jakob Buchgraber
Adding Jakob as the authority for the "remote caching" part.

My understanding is that this feature is neutral to remote caching: the remote cache works as it did before (action checksum = checksum of all inputs sent + checksum of command line). All that changes is the local action cache: if Bazel notices that a file is changed but that file was removed from the set of inputs by the "unused inputs" facility, it's still an action cache hit. Were Bazel to re-execute the action, it would be a remote cache miss since RBE doesn't know which inputs went unused.

This is implemented by computing the local action cache entry by taking the unused file list into account. The process is:
  1. Determine the list of all inputs
  2. Run the action
  3. Parse the unused input list (or .d file), determine the list of actual inputs
  4. Compute the action key based on the set of inputs from (3)
The remote cache operates on the set of inputs from (1), the local cache operates on the set of inputs from (3). Thus, if an unused input changes, it will be a local cache hit, but it would be a remote cache miss if it got that far.

To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/89ea16a9-98d5-44fd-a987-e732373ed710%40googlegroups.com.

P. Oscar Boykin

unread,
Aug 12, 2019, 3:20:16 PM8/12/19
to Lukács T. Berki, bazel-dev, Jakob Buchgraber
Thanks.

So, that means that using unused files heavily will decrease our cache hit rate, I think IIUC. This is because rules will accumulate many unused deps, which are used in cache key computation, but actually not useful. That seems like it will be a regression relative to erroring on unused dependencies. Erroring is a bad user experience, but cache hit reduction is probably too high a price to pay for slow compilers.
--

Lukács T. Berki

unread,
Aug 13, 2019, 4:12:27 AM8/13/19
to P. Oscar Boykin, bazel-dev, Jakob Buchgraber
On Mon, Aug 12, 2019 at 9:20 PM P. Oscar Boykin <oscar....@gmail.com> wrote:
Thanks.

So, that means that using unused files heavily will decrease our cache hit rate, I think IIUC. This is because rules will accumulate many unused deps, which are used in cache key computation, but actually not useful. That seems like it will be a regression relative to erroring on unused dependencies. Erroring is a bad user experience, but cache hit reduction is probably too high a price to pay for slow compilers.
Oh yes, if you *can* error out on unused dependencies, please do so. Keeping your source code in a good shape is strictly superior to Bazel trying to paper over its shortcomings: the compiler invocation has to accumulate less files thus resulting in faster action setup times and depending on your compiler, maybe also faster compiler invocations and Bazel has to load and analyze less targets.

David "Morgan" Morgan

unread,
Aug 13, 2019, 6:22:59 AM8/13/19
to Lukács T. Berki, P. Oscar Boykin, bazel-dev, Jakob Buchgraber
Definitely, if there's a way to update the BUILD files to reduce number of deps/inputs then that's always preferred over deps pruning.

A couple of things you might want to that are not deps pruning, but help the user experience when you fail on unused deps:

1. Print a buildozer command to remove the unused dep. I have no idea if bazel users outside google typically have buildozer available, if they do, printing "run this to fix" is very handy.
2. Have a mode hidden behind a define that prints the unused deps warning but does not fail the build. Then people can ignore unused deps warnings while they hack.

You received this message because you are subscribed to a topic in the Google Groups "bazel-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bazel-dev/oFRdGdrm8DM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CAOu%2B0LV9FGVcuRXGD%2Bf%3DuS%2B9%3DRh4x7qt%3D4is%3Du55g89Lm8aAMQ%40mail.gmail.com.

P. Oscar Boykin

unread,
Aug 14, 2019, 3:07:16 PM8/14/19
to David Morgan Morgan, Lukács T. Berki, bazel-dev, Jakob Buchgraber
We do all these things, but the UX is still bad, especially compared with other tools. If we could modify the pruning to not reduce cache efficiency (e.g. by computing the cache key after computing the list of unused files, which would require recording the list of unused files for each target so we could fetch that before computing the cache key on read), it could really improve usability.

It's a tough sell for bazel that you have to give up so much UX compared to other tools in order to get the reproducibility and caching. It would be great to get those AND a great UX.

Lukács T. Berki

unread,
Aug 16, 2019, 3:26:35 AM8/16/19
to P. Oscar Boykin, David Morgan Morgan, bazel-dev, Jakob Buchgraber
Why does pruning result in a worse user experience? AFAICT it's a strict improvement: it does not affect remote cache hit rates and positively affects local cache hit rates.

What it does it makes it possible to get away with having unused dependencies longer (since it papers over them for a while), but that's a strange argument for not doing it...

P. Oscar Boykin

unread,
Aug 17, 2019, 2:33:58 PM8/17/19
to Lukács T. Berki, David Morgan Morgan, bazel-dev, Jakob Buchgraber
If it doesn't affect remote cache, I don't understand something.

Imagine the following case: you have two rules A, and B. B depends on A, but it is unused.

When developer 1 goes to build A and B they see that B's srcs have changed, so they need to rebuild. They write to the action cache as though B depends on A since writing to the remote cache is not aware of pruning.

Developer 2 goes to build B. In fact, their srcs to B are the same as developer 1, but A is different from developer 1. Since the remote cache wrongly depends on A and B, developer 2 finds no entry in the cache and now rebuilds B even though the output from developer 1's build would have been sufficient.

What have I misunderstood in this scenario? If the above is accurate, building up unused dependencies lowers your cache efficiency.

Last note: I have proposed in the past a "build efficiency metric". Each time we rebuild a target, we compare the outputs and if the output is bit-for-bit identical, but was not fetched from the cache, we call this a wasted build. Efficiency is (total_time - wasted_time)/total_time. Wasted time comes from overly sensitive builds that claim to depend on things that don't change the output. It seems if my above is correct that pruning reduces this metric since it increases the number of things going into the action cache key that don't change the output.

Lukács T. Berki

unread,
Aug 19, 2019, 4:08:30 AM8/19/19
to P. Oscar Boykin, David Morgan Morgan, bazel-dev, Jakob Buchgraber
On Sat, Aug 17, 2019 at 8:33 PM P. Oscar Boykin <oscar....@gmail.com> wrote:
If it doesn't affect remote cache, I don't understand something.

Imagine the following case: you have two rules A, and B. B depends on A, but it is unused.

When developer 1 goes to build A and B they see that B's srcs have changed, so they need to rebuild. They write to the action cache as though B depends on A since writing to the remote cache is not aware of pruning.

Developer 2 goes to build B. In fact, their srcs to B are the same as developer 1, but A is different from developer 1. Since the remote cache wrongly depends on A and B, developer 2 finds no entry in the cache and now rebuilds B even though the output from developer 1's build would have been sufficient.

What have I misunderstood in this scenario? If the above is accurate, building up unused dependencies lowers your cache efficiency.
Of course! I think we are in violent agreement :)

What I said was that pruning unused input files from the local action cache does not make this *worse*. Of course, even better would be to prune unused inputs from the remote cache, but it's complicated (would need support from RBE) and there is no scenario where you get a better cache hit rate without unused input file pruning in the local action cache than with it.
Reply all
Reply to author
Forward
0 new messages