Discussion "Remote Repository Cache" proposal

307 views
Skip to first unread message

Jakob Buchgraber

unread,
Apr 29, 2019, 5:48:34 AM4/29/19
to bazel-discuss, John Millikin
Hey,

this is a thread to discuss John's proposal for a remote repository proxy in Bazel [1]. I think it's easier
to discuss this via e-mail than the commenting section and I also see this as sharing the document
with the wider community in order for everyone to be able to voice their opinions.

First of all thanks for putting this proposal forward! As I understand it this proposal tries to solve two
major* goals:
  1. Bazel should support serving remote repositories via a caching proxy/mirror for both http and
    https in order to be able to remove a network bottleneck.
  2. Allow to enforce in Bazel to only fetch remote repositories that have a checksum attached to it.
Does that roughly sum it up John?

Both these goals sound absolutely reasonable to me and I am all for having this in Bazel. 

Ad 1.)
The main objection that I have is that I am not convinced that this should be implemented via a gRPC
service (saying this as a former member of the gRPC team). I'd rather argue that when specifying a
--remote_repository_cache then Bazel should support proxying of HTTP and HTTPS via GET. The main
advantages that I see are that proxying via HTTP is a dead simple protocol and already exist. Popular
open source projects like nginx support it out of the box [2]. While nginx might not fit Stripe's use case I
am sure it fits a lot of other people's use cases e.g. Bazel's own CI.

Ad 2.)
As John pointed out there is already a feature request (that seems to have been accepted but never
implemented) for this [3]. It seems to me that we'd like to have this feature in Bazel independently of
whether a --remote_repository_cache is specified or not. So I'd vote for having this in core Bazel as
opposed to leaving it to the --remote_repository_cache implementation.

Best,
Jakob

* I believe there are also minor goals like supporting additional schemas in addition to http/https like s3://
as well as a variety of hash functions besides SHA-256. I think that these should best be written up two
separate proposal as they require more fundamental discussion as to whether we should support more
hash functions (i.e. as you suggest via subresource integrity format) and whether we'd actually want to
have these protocols besides http for fetching remote repositories in Bazel.


Jakob Buchgraber

Software Engineer


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

    

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.


John Millikin

unread,
Apr 29, 2019, 2:44:06 PM4/29/19
to Jakob Buchgraber, bazel-discuss
Thanks for starting a thread, Jakob. I agree that email will probably be more useful than Doc comments here.

We have three core use cases for the proposed remote repository cache:

(1) More customized caching policies

We want to replace Bazel's hardcoded "sha256 means cacheable" with custom logic. Some of our dependencies come
from sources where calculating a fixed checksum is difficult, but we assert the URL contents should be treated as if they
were deterministic. This might include an LRU, so for example a daily snapshot of some internal package might be
cached for 15 minutes as a build-time performance optimization that still allows regeneration during an emergency.

We've also had issues in the past where an official Bazel package depended on GitHub auto-generated archives, which
do not have stable checksums. See https://github.com/protocolbuffers/protobuf/issues/3894 for discussion. One of the
cache policies we plan to implement in our remote build cluster is "a URL has multiple valid checksums", and the cache
would serve the one that the client asked for.

(2) Restricting which URLs may be depended on

Related to the issue linked above, we've had devs accidentally depend on the GitHub archive link at a non-stable
identifier (like "master"). This will work fine until the next commit, then the cache will paper over the failure, until some
future point when the cache is pruned and the build starts failing.

Thus we want to forbid URLs like https://github.com/protocolbuffers/protobuf/archive/master.zip from being depended on,
even if the dev specified checksums.

Additionally, for certain builds, we want to restrict dependencies to a set that has been manually reviewed by a member
of the security team. We want to prevent the caching layer from accidentally bypassing this condition.

(3) Supporting more checksum formats

Some language specific package managers, including NPM, use SHA-1 for some old packages. We don't want to add
SHA-1 support to Bazel itself, but do want to use a trusted cache as an intermediate layer. The cache would retrieve the
remote file, verify it against recorded checksums with a stronger algorithm, and only then allow Bazel to download it. This
would be restricted to a set of grandfathered URLs, so that an existing package can't have its checksum downgraded.

---

Regarding HTTP vs gRPC, I feel very strongly that gRPC is the correct protocol for implementing this cache API. I do not
want to deal with GET parameters[0], encoding[1], header sanitization, or any of the other issues caused by constructing
an ad-hoc RPC protocol on top of raw HTTP.

If there are any members of the Bazel community who do want to use an HTTP proxy in that way, I think they should submit
a separate proposal. Bazel already supports separate gRPC and HTTP paths for the remote CAS, so there is precedent.

In the interest of keeping this proposal moving, I'm going to make the doc clearer about being part of the existing gRPC API
surface.

---

[0] Are they part of the URL, or instructions to the cache?


Oscar Bonilla

unread,
Apr 29, 2019, 6:44:48 PM4/29/19
to John Millikin, Jakob Buchgraber, bazel-discuss
In our case, there is a fourth use case. We keep an internal mirror of external dependencies where the license has already been vetoed by legal. Our CI machines can't connect to the internet so this has been a bit of a pain to deal with.

I think an HTTP cache would be easier to maintain/deploy than an gRPC one in the general case, even with all the weirdness that John pointed out. But I'll take a gRPC one over no cache ;-)

Cheers,

-Oscar


--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/CAJmdR_onLNUknpFWg%2BG12Ek9CkAJSAg4P4Wicxmaq9tjENSNxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Klaus Aehlig

unread,
May 2, 2019, 4:27:41 AM5/2/19
to bazel-...@googlegroups.com, John Millikin

Hi,

I'd like to start a related discussion on the local repository
cache, which basically constains downloaded archives indexed by
hash. As a request to artificially ensure non-cache hits has been
brought up in several bazel issues, I've written up a document how
that would look like and which consequences it would have. Please
read

https://github.com/bazelbuild/proposals/blob/master/designs/2019-04-29-cache.md

and provide your feedback via email.

Thanks,
Klaus

--
Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Paul Terence Manicle, Halimah DeLaine Prado

Austin Schuh

unread,
May 2, 2019, 12:54:47 PM5/2/19
to Klaus Aehlig, bazel-discuss, John Millikin
What about actually using the URL as the canonical identifier?  And tracking URLs you've verified.

Essentially:
  1) look in my verified URL list.  Is this URL in it?
    a) if not, fetch and hash.
      i) confirm that the hash matches.
      ii) check to see if we have this hash in our cache.  If not, add it.
  2) Grab the file from the cache.

That would require no changes from the user's point of view, be reasonably efficient, and catch the error you are looking for.

Austin

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Klaus Aehlig

unread,
May 3, 2019, 4:44:18 AM5/3/19
to Austin Schuh, bazel-discuss, John Millikin
> What about actually using the URL as the canonical identifier?

What do you mean by "the URL" in the current interface where a user
specifies a list of alternative URLs to download the file from? Should
we then trust all if we downloaded from one, or should we download from all?

ishi...@google.com

unread,
May 3, 2019, 8:29:21 AM5/3/19
to bazel-discuss
<< I think an HTTP cache would be easier to maintain/deploy than an gRPC one in the general case, even with all the weirdness that John pointed out. But I'll take a gRPC one over no cache ;-)

What if we don't have to choose between gRPC and HTTP(s)? I think we don't, as the whole thing could be done in a communication protocol independent way.
Clearly, the process itself should be discussed and agreed, but the protocol itself could be anything. I believe it could be implemented this way - just by placing the code, responsible for communication, to a set of 'adapters', which could be easily extended.

This does not mean all possible protocols should be supported, but it means any protocol can be supported, if there is a need.
Let's say gRPC is the most needed at the moment, therefore it could be the first supported communication protocol. Whenever someone needs HTTP(s) - it could be easily added as well.

WDYT?

BR,
Ira

Jakob Buchgraber

unread,
May 3, 2019, 9:08:19 AM5/3/19
to Ira Shikhman, bazel-discuss
I think it's clear to everyone that it's possible to implement both a gRPC and HTTP client. What I'd like to get out of this discussion is to better understand which protocol makes sense and why.
Ideally we only have one that's well thought out, simple and easy to deploy a backend for. Additional protocols increase the API and support surface and add complexity to Bazel.

Ian O'Connell

unread,
May 3, 2019, 9:53:34 AM5/3/19
to Klaus Aehlig, Austin Schuh, bazel-discuss, John Millikin
Is there any real downside to just fixing it at the single URL level? The users interface wouldn't need to change, if we get a cache hit only the first URL in the list of urls will matter anyway? 

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Klaus Aehlig

unread,
May 3, 2019, 10:36:24 AM5/3/19
to Ian O'Connell, Austin Schuh, bazel-discuss, John Millikin
> Is there any real downside to just fixing it at the single URL level? The
> users interface wouldn't need to change, if we get a cache hit only the
> first URL in the list of urls will matter anyway?

There are enough cases, where the first URL is highly non-canonical, e.g.,
the closest mirror or something. As users are free to add the first URL as
"canonical_entry", this is still possible to opt into this behaviour.

Also, there *are* use cases where really only the content matters, e.g.,
source archives mirrored by different organisations.

Ian O'Connell

unread,
May 3, 2019, 10:51:57 AM5/3/19
to Klaus Aehlig, Austin Schuh, bazel-discuss, John Millikin
So the concern concretely here would be if the first URL's are "optimal" but frequently would 404, even if it was stored in the hash we could do a bunch of http requests that would 404 before hitting our cached entry each time?

Jakob Buchgraber

unread,
May 3, 2019, 12:27:45 PM5/3/19
to John Millikin, bazel-discuss
(1) More customized caching policies

We want to replace Bazel's hardcoded "sha256 means cacheable" with custom logic. Some of our dependencies come
from sources where calculating a fixed checksum is difficult, but we assert the URL contents should be treated as if they
were deterministic. This might include an LRU, so for example a daily snapshot of some internal package might be
cached for 15 minutes as a build-time performance optimization that still allows regeneration during an emergency.

So for these kinds of dependencies you wouldn't specify a hash in your Bazel project at all? It'd be up to your proxy to
cache certain dependencies that technically don't have a deterministic hash but by caching them makes them 
deterministic for a while. Correct?


We've also had issues in the past where an official Bazel package depended on GitHub auto-generated archives, which
do not have stable checksums. See https://github.com/protocolbuffers/protobuf/issues/3894 for discussion. One of the
cache policies we plan to implement in our remote build cluster is "a URL has multiple valid checksums", and the cache
would serve the one that the client asked for.

Makes sense. Do I understand correctly that this wouldn't need protocol support either?
 
(2) Restricting which URLs may be depended on

Sounds reasonable. Do I understand correctly that you envision the caching layer to be able to do this transparently without
requiring protocol support?

(3) Supporting more checksum formats

Makes sense. I am confused though as this paragraph seems to contradict the protocol you are describing in your design
document which seems to indicate that the ResolveRequest messages sent by Bazel will contain hashes in subresource
integrity format because you expect Bazel to support a whole variety of hash functions. 

message ResolveRequest {
  string instance_name = 1;
  repeated string urls = 2;
  repeated string integrity = 3;
}

Why have this field in the protocol then?

Best,
Jakob

John Millikin

unread,
May 3, 2019, 1:27:45 PM5/3/19
to Jakob Buchgraber, bazel-discuss
On Fri, May 3, 2019 at 9:27 AM Jakob Buchgraber <buc...@google.com> wrote:
(1) More customized caching policies

So for these kinds of dependencies you wouldn't specify a hash in your Bazel project at all? It'd be up to your proxy to
cache certain dependencies that technically don't have a deterministic hash but by caching them makes them 
deterministic for a while. Correct?

That's correct. Our ctx.download() call would have a URL and no checksum, and we'd be relying on the remote cache to
be a trusted source of content for that file.
 
We've also had issues in the past where an official Bazel package depended on GitHub auto-generated archives, which
do not have stable checksums. See https://github.com/protocolbuffers/protobuf/issues/3894 for discussion. One of the
cache policies we plan to implement in our remote build cluster is "a URL has multiple valid checksums", and the cache
would serve the one that the client asked for.

Makes sense. Do I understand correctly that this wouldn't need protocol support either?

I'm not sure what you mean by "protocol support". Bazel would need to send the checksum to the remote cache, so there
is some protocol needed for how that checksum gets sent. This is the proposed `ResolveRequest::integrity` field.

Bazel would not need to be aware of how the cache is implementing its lookup.

 
(2) Restricting which URLs may be depended on

Sounds reasonable. Do I understand correctly that you envision the caching layer to be able to do this transparently without
requiring protocol support?

Correct. Restricting the set of permitted URLs would not require any protocol support, or client-side changes in Bazel.

 
(3) Supporting more checksum formats

Makes sense. I am confused though as this paragraph seems to contradict the protocol you are describing in your design
document which seems to indicate that the ResolveRequest messages sent by Bazel will contain hashes in subresource
integrity format because you expect Bazel to support a whole variety of hash functions. 

message ResolveRequest {
  string instance_name = 1;
  repeated string urls = 2;
  repeated string integrity = 3;
}

Why have this field in the protocol then?

The field is required to implement goal (1), where resolving a URL to a content digest requires the cache to know which
checksum the client expects to receive.

The initial implementation of multi-checksum integrity (https://github.com/bazelbuild/bazel/pull/7208) adds support for
SHA-384 and SHA-512. Bazel would validate content against the strongest supported checksum. I expect the set of
checksum algorithms supported by Bazel would remain small, and will never support deprecated algorithms such
as SHA-1.

In the future, I would like to add a "checksum passthrough" flag where Bazel doesn't inspect or verify download checksums.
This is entirely a client-side change and needs to be done carefully, so I'd like it to be separate from the remote cache protocol.

Austin Schuh

unread,
May 6, 2019, 12:15:11 AM5/6/19
to Ian O'Connell, Klaus Aehlig, bazel-discuss, John Millikin
I think there are 3 different stages of the lifecycle that we care about.

1) fresh URL or list of URLs that we've never seen before
2) URL (or list) we have cached but need to fetch.
3) URL list changes

For 1)
Use the same logic as today to find a good URL.  Add just the (URL, hash) that was used to the cache/verified list.

For 2)
Check the cache to verify that all URLs either aren't in the cache, or are in the cache and match.  This enforces consistency.
Then, use the data out of the cache.

For 3)
Treat it like 2).

Without consuming a bunch of network bandwidth, we can't verify that the full list is going to be valid.  Is there a "Deep check" concept in bazel that checking every URL could be added to?  I don't know of one.

I think this has as many of the the properties that we want as we can reasonably get.

* If a user forgets to update the primary URL but changes the hash, they will get an error.
* If a user forgets to update a secondary URL but changes the hash, and we know that it's wrong, they will get an error.
* If the user updates the primary URL but doesn't change the hash, they will get an error.
* If the user copies a secondary URL from somewhere else that we know is wrong, they will get an error.


It only fails to detect secondary URLs which have the wrong hash but we haven't checked, but doesn't trust them either.
Reply all
Reply to author
Forward
0 new messages