controlling the hash-key used to determine recompilation

172 views
Skip to first unread message

P. Oscar Boykin

unread,
Mar 3, 2017, 8:52:16 PM3/3/17
to bazel-discuss
As far as I know, right now skylark rules have no way to control hashing of files used as inputs.

What if I want to say to determine the key for this file pass it through this function first (which will do some normalization). My specific use-case is for the scala compiler. Scala does not have the fastest compiler (putting it mildly). So we don't want to recompile any more than needed. But using ijar has some issues because scala embeds a scala signature inside the jar that changes even when private API aspects change. Re-engineering how to normalize that in ijar is a challenge involving really knowing fine details of the compiler and the language spec, so we'd like to avoid that.

That said, zinc, scala's existing incremental compiler, can extract an API suitable for hashing from code. We can't use this directly now though because this is only a hash key, the compiler cannot actually use this API to compile against, it still needs the compiled jar.

So, if we could emit outputs like: `(a, b)` and I will only read `b` if `a` has changed, then we could use this approach with bazel.

I assume this is not something that is possible without a bazel modification, but could a PR to add this feature be something that could be considered? Specifically, if a skylark rule could explicit emit a hash key and not have it default to being the md5 hash of the content.

Damien Martin-guillerez

unread,
Mar 6, 2017, 4:07:47 AM3/6/17
to P. Oscar Boykin, bazel-discuss
Hi,

I am not sure I fully understand your use case but it is certainly something that would need a proper design doc. Let's figure out wether we want to do it first.

Can't you just do that logic inside the file (i.e. you call the action but actually do the compile action only if the hash has changed)? Also what is wrong with the signature, do scala actually needs it (i.e. can't we just strip that signature with ijar)? 


--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/079389c8-730d-4877-82ea-c121c979efdf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

P. Oscar Boykin

unread,
Mar 6, 2017, 11:24:52 AM3/6/17
to Damien Martin-guillerez, bazel-discuss
Thanks for the reply Damien.

In reverse order: scala needs the scala signature since the scala type system is larger than java. Extra aspects of the type system live in the signature, so you can't accurately compile scala without the signature.

Next: could we currently implement what we want using the proposed code from existing sbt incremental compiler to produce a key? I don't see how. The rules can't have any state about the previous run currently. So we have no way of knowing even if we compute the exact API whether or not we should run. The branch inside the rules to maybe not run the compile action, I don't see what we can compare to: if (apiSignature != what?)...

Is there some trick I'm missing?

Damien Martin-guillerez

unread,
Mar 7, 2017, 7:40:29 AM3/7/17
to P. Oscar Boykin, bazel-discuss
On Mon, Mar 6, 2017 at 5:24 PM P. Oscar Boykin <oscar....@gmail.com> wrote:
Thanks for the reply Damien.

In reverse order: scala needs the scala signature since the scala type system is larger than java. Extra aspects of the type system live in the signature, so you can't accurately compile scala without the signature.

Humm how is that preventing to get the signature in the ijar? Or do you mean the signature includes too much?
 

Next: could we currently implement what we want using the proposed code from existing sbt incremental compiler to produce a key? I don't see how. The rules can't have any state about the previous run currently. So we have no way of knowing even if we compute the exact API whether or not we should run. The branch inside the rules to maybe not run the compile action, I don't see what we can compare to: if (apiSignature != what?)...

Well using worker that is technically possible but probably not desirable, especially having action that depends on the state of the compiler is terrible.

I don't really like the approach of providing a hash back to bazel because it is going to mess so much with assumptions of Bazel. Unfortunately I don't have a better idea on how to support that use case :(

P. Oscar Boykin

unread,
Mar 7, 2017, 4:38:42 PM3/7/17
to Damien Martin-guillerez, bazel-discuss
On Tue, Mar 7, 2017 at 2:40 AM Damien Martin-guillerez <dmar...@google.com> wrote:
On Mon, Mar 6, 2017 at 5:24 PM P. Oscar Boykin <oscar....@gmail.com> wrote:
Thanks for the reply Damien.

In reverse order: scala needs the scala signature since the scala type system is larger than java. Extra aspects of the type system live in the signature, so you can't accurately compile scala without the signature.

Humm how is that preventing to get the signature in the ijar? Or do you mean the signature includes too much?

Yes, it includes too much. It includes everything about the scala types, including private things that are not important to external modules.

It was just not designed to hide at that phase, the scala compiler blocks access, but it does not hide the API types from itself.
 
 

Next: could we currently implement what we want using the proposed code from existing sbt incremental compiler to produce a key? I don't see how. The rules can't have any state about the previous run currently. So we have no way of knowing even if we compute the exact API whether or not we should run. The branch inside the rules to maybe not run the compile action, I don't see what we can compare to: if (apiSignature != what?)...

Well using worker that is technically possible but probably not desirable, especially having action that depends on the state of the compiler is terrible.

If we did this with a worker, which I agree is bad since workers are not stateless at that point, it matters which worker you get sent to when deciding to re-run or not, unless the workers are actually communicating between themselves, which is even more state and even scarier.
 

I don't really like the approach of providing a hash back to bazel because it is going to mess so much with assumptions of Bazel. Unfortunately I don't have a better idea on how to support that use case :(

I can understand that, but we could also imagine some kind of "variant" input. Like, we don't use it for determining invalidation. In this way, we can make the jar a "variant" input, but the extracted API a normal input, then bazel's normal approach would work: when the API does not change, the hash key will be fixed and it will ignore that the "variant" input has changed.

Without something like this, it becomes a pretty big chore to make sane build rules for a lot of systems. We have to export whatever mechanism they have through a bottleneck of a single file such that they need to run if and only if that single file changes even 1 bit. 

Damien Martin-Guillerez

unread,
Mar 8, 2017, 5:12:00 AM3/8/17
to P. Oscar Boykin, bazel-discuss, ulf...@google.com, dsl...@google.com, laur...@google.com
Ok you convinced me that might need something along the lines. +Ulf Adams for his opinion, also since it would change Skylark API +Dmitry Lomov +Laurent Le Brun 

I believe a deeper design will be needed if the people in copy do not object in moving forward.

Damien Martin-Guillerez

unread,
Mar 17, 2017, 5:53:35 AM3/17/17
to P. Oscar Boykin, bazel-discuss, ulf...@google.com, dsl...@google.com, laur...@google.com
Ulf, Dmitry: friendly ping?

Ulf Adams

unread,
Mar 20, 2017, 11:44:15 AM3/20/17
to Damien Martin-Guillerez, P. Oscar Boykin, bazel-discuss, Dmitry Lomov, Laurent Le Brun
Can we first build a prototype, maybe based on the persistent tool API we have? I'd like to see some more data on the expected performance benefit.

Ulf, Dmitry: friendly ping?

To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.

P. Oscar Boykin

unread,
Mar 20, 2017, 1:49:02 PM3/20/17
to Ulf Adams, Damien Martin-Guillerez, bazel-discuss, Dmitry Lomov, Laurent Le Brun
I can see at least two approaches:

1) "phantom inputs": an input is a phantom input if it is ignored for cache invalidation purposes. This should be use VERY carefully by rule authors to ensure they do not violate idempotency of the rules. The main use case is for programs that do not have the notion of "compile APIs" against which a compile could run. In this case, the full compiled artifact of any dependencies might be phantom inputs, which the inputs, which do invalidate the cache if they change, are some API extraction (against which the compiler presumably cannot compile against). Here we would add a new field (phantom_inputs) to the skylark APIs which build actions and those inputs are ignored in the cache invalidation phase.

2) "keyed outputs" (kind of the opposite of the above), we introduce `keyed_outputs` to ctx.action, which is a list a structs with two fields: cache_key (a Path), output (a Path). The current case can be thought of the output being its own cache_key. In this model, the cache_key is private to the rule. Any downstream dependencies will depend on `output` but bazel will transparently and bazel will manage hiding any differences unless the cache key has changed.

To me, the second one seems more useful since consumers don't need to know anything about cache key, but the first is also workable in most cases since cross-rule dependencies are generally not that common and usually much "simpler" in terms of what structure they assume about the outputs.

Ulf, Dmitry: friendly ping?

To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

pauld...@gmail.com

unread,
Mar 25, 2017, 6:04:13 PM3/25/17
to bazel-discuss, ulf...@google.com, dmar...@google.com, dsl...@google.com, laur...@google.com
It would be nice to take zinc's Scala API hashing logic and instead of hashing each part of the API info, extract each part of the API, thus creating a good Scala equivalent of ijar.

If you can correctly hash an output's interface (in the general sense of the word), it shouldn't be that much more difficult to canonicalize the output. I think this applies very generally.

Admittedly, if someone else has built the hashing, it may be a pain to maintain a similar canonicalizer.

P. Oscar Boykin

unread,
Mar 25, 2017, 6:33:46 PM3/25/17
to pauld...@gmail.com, bazel-discuss, ulf...@google.com, dmar...@google.com, dsl...@google.com, laur...@google.com
The thing about an ijar is that you can compile against it. Making zinc able to compile without the jar but with just the interface file it extracts would be cool but I've been told by one sbt hacker that it would be fairly non-trivial.
You received this message because you are subscribed to a topic in the Google Groups "bazel-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bazel-discuss/MWE_CC_wg6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/4363fffe-ac14-4e9a-b2c5-4a09752404b3%40googlegroups.com.

pauld...@gmail.com

unread,
Mar 27, 2017, 12:37:40 AM3/27/17
to bazel-discuss, pauld...@gmail.com, ulf...@google.com, dmar...@google.com, dsl...@google.com, laur...@google.com
> The thing about an ijar is that you can compile against it. Making zinc able to compile without the jar but with just the interface file it extracts would be cool but I've been told by one sbt hacker that it would be fairly non-trivial.

I'm still thinking you'd create a valid jar to compile against.

My question is "How hard is it to create Scala ijar?". I.e. a tool that produces a jar keeping enough information for scalac, but removing the rest.

If we already have a hashing algorithm that tells us only what is relevant, could we use that to create a canonicalizing algorithm that keeps only what is relevant?

Zinc hash algorithm: https://github.com/sbt/zinc/blob/5df5062c555d3acd9f9fb925f8206d606e51badd/internal/zinc-apiinfo/src/main/scala/xsbt/api/HashAPI.scala

If it were feasible, it would have two advantages

(1) Works with Bazel as it exists.

(2) If the algorithm was ever wrong, it would be fail deterministically, i.e. regardless of cache state.

> Well using worker that is technically possible but probably not desirable, especially having action that depends on the state of the compiler is terrible.

FYI, I like this as a solution to fast compile times ( https://groups.google.com/forum/#!topic/bazel-discuss/3iUy5jxS3S0 ). But only as a dev solution, where seconds make a big difference.

Ulf Adams

unread,
Mar 27, 2017, 10:06:50 AM3/27/17
to Paul Draper, bazel-discuss, Damien Martin-guillerez, Dmitry Lomov, Laurent Le Brun
On Mon, Mar 27, 2017 at 6:37 AM, <pauld...@gmail.com> wrote:
> The thing about an ijar is that you can compile against it. Making zinc able to compile without the jar but with just the interface file it extracts would be cool but I've been told by one sbt hacker that it would be fairly non-trivial.

I'm still thinking you'd create a valid jar to compile against.

My question is "How hard is it to create Scala ijar?". I.e. a tool that produces a jar keeping enough information for scalac, but removing the rest.

If we already have a hashing algorithm that tells us only what is relevant, could we use that to create a canonicalizing algorithm that keeps only what is relevant?

Zinc hash algorithm: https://github.com/sbt/zinc/blob/5df5062c555d3acd9f9fb925f8206d606e51badd/internal/zinc-apiinfo/src/main/scala/xsbt/api/HashAPI.scala

If it were feasible, it would have two advantages

(1) Works with Bazel as it exists.

(2) If the algorithm was ever wrong, it would be fail deterministically, i.e. regardless of cache state.

That would be much preferable from our POV. It certainly would seem odd if you could create a hash of something, but not write out the corresponding information that you hashed over.

P. Oscar Boykin

unread,
Mar 27, 2017, 2:50:25 PM3/27/17
to Ulf Adams, Paul Draper, bazel-discuss, Damien Martin-guillerez, Dmitry Lomov, Laurent Le Brun
I look forward to your implementation Paul! :)

It seems in principle not super hard, but a former lightbend/typesafe engineer has tried to convince me there are serious dragons there. Note that the latest sbt release still has bugs around interface hashing. They think zinc 1.0 (not using in sbt yet) has fixed them, but this is YEARS after they have been the main scala build tool and really trying to fix issues. The problem is that scala, the language, is a little bit hostile to separate compilation as we are trying to do here. For instance, private names can change implicit resolution in public APIs. I didn't know that, but it is because name resolution and accessibility are two different concerns, so a private can shadow even though it can't be used.

Anyway, it is fairly complex, from what I understand, and the win is a bit speculative: we are trying to reduce compilations that are unneeded, but if most changes to upstream jars do actually change the public API, there is no win. It would be great to get an idea of what kind of win we could get. I suggested instrumenting bazel so we could measure how much time is spent on builds that produce the same output as they did previously (wasted builds). We are only going to improve that efficiency, but right now, for all we know, the efficiency is quite high.

On that note, it is interesting to thing bazel could even do a prediction to speculatively assume a target won't really change in order to increase parallelism. That would be an interesting project since it would help all builds potentially.

--
You received this message because you are subscribed to a topic in the Google Groups "bazel-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bazel-discuss/MWE_CC_wg6o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bazel-discus...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages