Sharing basic design ideas and asking for opinions

108 views
Skip to first unread message

Yun Peng

unread,
Sep 18, 2020, 4:02:28 AM9/18/20
to extern...@bazel.build, Xudong Yang, Philipp Wollermann
Hi everyone,

While Xudong, Philipp and I are still working on the design doc, I want to share the basic ideas of our solution for improving the external dependencies management experience in Bazel and ask for your opinions on some important design decisions.

To be short, we want to introduce a new way to declare external dependencies and move the responsibility of managing dependencies from Bazel to a new dependencies management tool. The design has the following parts:
  • Bazel Module and MODUEL.bazel: Bazel module is a collection of available versions of a Bazel project. This project declares its dependencies on other Bazel modules in a MODULE.bazel file. Unlike the WORKSPACE file, users only need to declare their direct dependencies.
  • The Bazel dependencies management tool: You can use this tool to add, remove, upgrade, and query your dependencies. It will resolve your dependencies transitively by reading the MODULE.bazel files and make the required external sources available for the Bazel build.
There are a lot of design details, we'll share the doc as soon as it's in shape.

However, the obvious question here is how does a user publish their project as a Bazel module. In our design, we have the concept of Bazel registry. It is basically an index of Bazel modules hosted somewhere on the internet in the form that's understandable by the dependencies management tool.
We have the following ideas of how Bazel registries should look like in the new world, but we think the community's opinions on this are very important for making the decision.
  • Bazel Central Registry 
Like the Maven Central Repo or crates.io, we create a central registry for hosting Bazel modules. This is where all users should publish their project in order to make it available to others. While this is the main source of Bazel external dependencies, third party Bazel registry will also be supported for use cases that the project cannot be in the central registry (eg. publishing internal libraries inside the company). But in most cases, users just have to specify the module name and version of their direct dependencies, then our tool will know how to pull them from the central registry.

Pros
    • It's easy for users to find and declare dependency, module name + version, that's it.
    • In the central registry, we can store patch files that are unable to be upstreamed for some reason (eg. for adding BUILD files for a non-Bazel project), and this can be shared with all Bazel users.
    • The Bazel modules are reviewed before checking into the registry, which ensures their license validity and security.
    • It's possible to calculate the dependents of a module, therefore compatibility check is easier when a new version comes out.
    • No module name conflict because the same module name can only appear once in the registry.
    • The transitive dependency closure of any given module can be precomputed, saving a lot of HTTP downloads at dependency resolution time.
Cons
    • Users probably have to figure out a way to get their dependencies into the central registry in the first place, especially in the initial phase.
    • Very likely a huge maintenance cost that's nearly impossible for a three people team to deal with. Mitigate: The community can join in and help with the governance.
  • Bazel Official Registry + Third Party Registries
The Bazel team will host a registry for official Bazel rules, Starlark libraries, and other important Bazel related projects (Kind of like the Bazel Federation). Other interest groups can host their own Bazel registries. For example, the Bazel C++ community can host a third party registry for releasing C++ projects as Bazel modules. Note that one Bazel module in a registry may have to depend on a module in another registry. For example, a library in the C++ Bazel registry may have to depend on rules_cc in the official Bazel registry. With this approach, users have to specify not only the module name and version of their direct dependencies, but also a list of registries that provide all the Bazel modules in their transitive dependencies. For some use cases, we could support a git repo as a mini registry that contains only one module.

Pros
    • The first three points of pros of the Bazel central registry solution.
    • Maintenance cost is spread across the community.
    • Each interest group can have full control of their registry.
Cons
    • The first point of cons of the Bazel central registry solution.
    • The same module name might be used in multiple registries, which could cause a conflict. Mitigate: we can require users to use reversed internet domain as module name (they are already recommended for repo name)
    • When adding a new dependency, users have to make sure they also add it's required registries. This list can grow as the number of registries in the ecosystem grows.
    • It's not very clear for some multi-language projects to choose which registry they should go into.
  • Decentralized
In a decentralized world, we think the best way is to distribute Bazel modules as git repositories with version tags. We can still have Bazel registries, but they will not be the main sources for pulling Bazel dependencies. When users declare dependencies, the source (a git repo or a Bazel registry) of a module should also be specified along with the module name and version.

Pros
    • Low maintenance cost for Bazel registry. Because even if it exists, its size should be very small.
    • Easier for users to "publish" their projects. Just make a new version tag.
Cons
    • If one git repo changes, it could transitively break many downstream projects. Mitigate: we can use a mirror to ensure what was available is always available and the same.
    • We have a much higher chance to have module name conflicts. Eg. 1) different projects accidentally use the same module name. 2) The same module is hosted in different git repos (due to clone perhaps). In the first case, we can distinguish modules by url and use repo_remapping to mitigate, but in the second case, there could still be conflicts during linking time.
    • For projects not using Bazel already, this means the corresponding Bazel module (with Bazel BUILD files) has to be created and hosted by a third party.
    • Compared to the registries as the main source solutions, this approach has less security promises.
As you can see, each solution has its pros and cons. Please tell us what you think is the best approach. You can of course reply to this thread directly or provide us with more detailed information of your use case and opinions by filling out this form.

Cheers,
Yun Peng

Tony Aiuto

unread,
Sep 18, 2020, 10:08:55 AM9/18/20
to Yun Peng, external-deps, Xudong Yang, Philipp Wollermann
On Fri, Sep 18, 2020 at 4:02 AM 'Yun Peng' via external-deps <extern...@bazel.build> wrote:
Hi everyone,

While Xudong, Philipp and I are still working on the design doc, I want to share the basic ideas of our solution for improving the external dependencies management experience in Bazel and ask for your opinions on some important design decisions.

To be short, we want to introduce a new way to declare external dependencies and move the responsibility of managing dependencies from Bazel to a new dependencies management tool. The design has the following parts:
  • Bazel Module and MODUEL.bazel: Bazel module is a collection of available versions of a Bazel project. This project declares its dependencies on other Bazel modules in a MODULE.bazel file. Unlike the WORKSPACE file, users only need to declare their direct dependencies.
  • The Bazel dependencies management tool: You can use this tool to add, remove, upgrade, and query your dependencies. It will resolve your dependencies transitively by reading the MODULE.bazel files and make the required external sources available for the Bazel build.
There are a lot of design details, we'll share the doc as soon as it's in shape.

However, the obvious question here is how does a user publish their project as a Bazel module. In our design, we have the concept of Bazel registry. It is basically an index of Bazel modules hosted somewhere on the internet in the form that's understandable by the dependencies management tool.
We have the following ideas of how Bazel registries should look like in the new world, but we think the community's opinions on this are very important for making the decision.
  • Bazel Central Registry 
Like the Maven Central Repo or crates.io, we create a central registry for hosting Bazel modules. This is where all users should publish their project in order to make it available to others. While this is the main source of Bazel external dependencies, third party Bazel registry will also be supported for use cases that the project cannot be in the central registry (eg. publishing internal libraries inside the company). But in most cases, users just have to specify the module name and version of their direct dependencies, then our tool will know how to pull them from the central registry.

Pros
    • It's easy for users to find and declare dependency, module name + version, that's it.
    • In the central registry, we can store patch files that are unable to be upstreamed for some reason (eg. for adding BUILD files for a non-Bazel project), and this can be shared with all Bazel users.
    • The Bazel modules are reviewed before checking into the registry, which ensures their license validity and security.
    • It's possible to calculate the dependents of a module, therefore compatibility check is easier when a new version comes out.
    • No module name conflict because the same module name can only appear once in the registry.
    • The transitive dependency closure of any given module can be precomputed, saving a lot of HTTP downloads at dependency resolution time.
 
Cons
    • Users probably have to figure out a way to get their dependencies into the central registry in the first place, especially in the initial phase.
    • Very likely a huge maintenance cost that's nearly impossible for a three people team to deal with. Mitigate: The community can join in and help with the governance.
A central registry must have reliability and security guarantees which would certainly require special launch effort and ongoing serving costs. This requires a plan to permanently fund it.

  • Bazel Official Registry + Third Party Registries
The Bazel team will host a registry for official Bazel rules, Starlark libraries, and other important Bazel related projects (Kind of like the Bazel Federation). Other interest groups can host their own Bazel registries. For example, the Bazel C++ community can host a third party registry for releasing C++ projects as Bazel modules. Note that one Bazel module in a registry may have to depend on a module in another registry. For example, a library in the C++ Bazel registry may have to depend on rules_cc in the official Bazel registry. With this approach, users have to specify not only the module name and version of their direct dependencies, but also a list of registries that provide all the Bazel modules in their transitive dependencies. For some use cases, we could support a git repo as a mini registry that contains only one module.

Pros
    • The first three points of pros of the Bazel central registry solution.
    • Maintenance cost is spread across the community.
    • Each interest group can have full control of their registry.
Cons
    • The first point of cons of the Bazel central registry solution.
    • The same module name might be used in multiple registries, which could cause a conflict. Mitigate: we can require users to use reversed internet domain as module name (they are already recommended for repo name)
    • When adding a new dependency, users have to make sure they also add it's required registries. This list can grow as the number of registries in the ecosystem grows.
    • It's not very clear for some multi-language projects to choose which registry they should go into.
  • Decentralized
In a decentralized world, we think the best way is to distribute Bazel modules as git repositories with version tags. We can still have Bazel registries, but they will not be the main sources for pulling Bazel dependencies. When users declare dependencies, the source (a git repo or a Bazel registry) of a module should also be specified along with the module name and version.

Pros
    • Low maintenance cost for Bazel registry. Because even if it exists, its size should be very small.
    • Easier for users to "publish" their projects. Just make a new version tag.


 
Cons
    • If one git repo changes, it could transitively break many downstream projects. Mitigate: we can use a mirror to ensure what was available is always available and the same.
Why would downstream break? If everyone depends on versioned artifacts, a repo changing at head will not impact anyone.
 
    • We have a much higher chance to have module name conflicts. Eg. 1) different projects accidentally use the same module name. 2) The same module is hosted in different git repos (due to clone perhaps). In the first case, we can distinguish modules by url and use repo_remapping to mitigate, but in the second case, there could still be conflicts during linking time.
 
    • For projects not using Bazel already, this means the corresponding Bazel module (with Bazel BUILD files) has to be created and hosted by a third party.
Since we are moving to reconfigure and expand the role of the Bazel Federation, we could host patches and BUILD files there.
    • Compared to the registries as the main source solutions, this approach has less security promises.
Only if you presume the registries are promising centralized security. This comes at a setup and support cost. In the central model we pay it all in one place, in the Bazel+third_party model then many registries have to incur that cost. My hunch is that the third_party ones will simply be github published, so that degenerates to the same as decentralized.
 

As you can see, each solution has its pros and cons. Please tell us what you think is the best approach. You can of course reply to this thread directly or provide us with more detailed information of your use case and opinions by filling out this form.

Cheers,
Yun Peng

--
To unsubscribe from this group and stop receiving emails from it, send an email to external-dep...@bazel.build.

James Sharpe

unread,
Sep 18, 2020, 10:32:43 AM9/18/20
to Yun Peng, extern...@bazel.build, Xudong Yang, Philipp Wollermann
On Fri, 18 Sep 2020 at 09:02, 'Yun Peng' via external-deps <extern...@bazel.build> wrote:
Hi everyone,

While Xudong, Philipp and I are still working on the design doc, I want to share the basic ideas of our solution for improving the external dependencies management experience in Bazel and ask for your opinions on some important design decisions.

To be short, we want to introduce a new way to declare external dependencies and move the responsibility of managing dependencies from Bazel to a new dependencies management tool. The design has the following parts:
  • Bazel Module and MODUEL.bazel: Bazel module is a collection of available versions of a Bazel project. This project declares its dependencies on other Bazel modules in a MODULE.bazel file. Unlike the WORKSPACE file, users only need to declare their direct dependencies.
Unfortunately this is a bit over simplistic. There are projects which have optional dependencies whose use can depend on external factors such as licensing restrictions. There is also the notion that package managers such as yum / apt use of 'virtual' packages i.e. dependencies that can be swapped out to a given standardised interface e.g. BLAS or MPI to name a couple of examples.
  • The Bazel dependencies management tool: You can use this tool to add, remove, upgrade, and query your dependencies. It will resolve your dependencies transitively by reading the MODULE.bazel files and make the required external sources available for the Bazel build.
I like how the renovate tool currently works in this regard - automatically updates the WORKSPACE when it can deduce that there is a new version of a repository available. I'd hope that this tool works like this but also atomically updates the transitive dependencies when this occurs.
IMO the tool will also need to be aware of licensing constraints; optional dependencies may require compliance to given licensing terms (commercial or otherwise) and so the user needs to be able to specify these restrictions before the dependency solver can do its thing.
SPACK as a dependency manager does this using a constraint solver. See the recent FOSDEM talk for their approach: https://www.youtube.com/watch?v=xBhpfW5cZ-w

There are a lot of design details, we'll share the doc as soon as it's in shape.

However, the obvious question here is how does a user publish their project as a Bazel module. In our design, we have the concept of Bazel registry. It is basically an index of Bazel modules hosted somewhere on the internet in the form that's understandable by the dependencies management tool.
We have the following ideas of how Bazel registries should look like in the new world, but we think the community's opinions on this are very important for making the decision.
  • Bazel Central Registry 
Like the Maven Central Repo or crates.io, we create a central registry for hosting Bazel modules. This is where all users should publish their project in order to make it available to others. While this is the main source of Bazel external dependencies, third party Bazel registry will also be supported for use cases that the project cannot be in the central registry (eg. publishing internal libraries inside the company). But in most cases, users just have to specify the module name and version of their direct dependencies, then our tool will know how to pull them from the central registry.

For languages that have existing registries such as pypi, npm to name a few additional ones are you proposing that the bazel registry would have to mirror the dependencies contained in these repositories to be usable? Where binary components are required at install time then bazel will potentially be responsible for ensuring these are available to the language specific package manager; for example a BLAS package needs to be available before installing scipy in Python. I have yet to see a set of rules that provides a way of injecting built binary deps into the step that installs from a language specific repository. I guess this is partly due to the bazel design phases as the language package manager doesn't necessarily have a way of separating fetching of the dependencies from building/installing them. 
 

Pros
    • It's easy for users to find and declare dependency, module name + version, that's it.
    • In the central registry, we can store patch files that are unable to be upstreamed for some reason (eg. for adding BUILD files for a non-Bazel project), and this can be shared with all Bazel users.
    • The Bazel modules are reviewed before checking into the registry, which ensures their license validity and security.
    • It's possible to calculate the dependents of a module, therefore compatibility check is easier when a new version comes out.
    • No module name conflict because the same module name can only appear once in the registry.
    • The transitive dependency closure of any given module can be precomputed, saving a lot of HTTP downloads at dependency resolution time.
Cons
    • Users probably have to figure out a way to get their dependencies into the central registry in the first place, especially in the initial phase.
    • Very likely a huge maintenance cost that's nearly impossible for a three people team to deal with. Mitigate: The community can join in and help with the governance.
This is probably actually a con for all the registry options - or it simply encourages people to run forks of the central registry which fragments the community and creates a decentralized registry.  
Con: The user of the package may want different visibility of the build targets to those that are in the BUILD file in the registry 

So I guess the other option is centralized definition but decentralised hosting?
Also I would expect an official registry to be run by much more than a three person team; in fact I'd likely want to say that it should definitely be formed by a team that includes people from outside of Google.
  • Bazel Official Registry + Third Party Registries
The Bazel team will host a registry for official Bazel rules, Starlark libraries, and other important Bazel related projects (Kind of like the Bazel Federation). Other interest groups can host their own Bazel registries. For example, the Bazel C++ community can host a third party registry for releasing C++ projects as Bazel modules. Note that one Bazel module in a registry may have to depend on a module in another registry. For example, a library in the C++ Bazel registry may have to depend on rules_cc in the official Bazel registry. With this approach, users have to specify not only the module name and version of their direct dependencies, but also a list of registries that provide all the Bazel modules in their transitive dependencies. For some use cases, we could support a git repo as a mini registry that contains only one module.

Pros
    • The first three points of pros of the Bazel central registry solution.
    • Maintenance cost is spread across the community.
    • Each interest group can have full control of their registry.
Cons
    • The first point of cons of the Bazel central registry solution.
    • The same module name might be used in multiple registries, which could cause a conflict. Mitigate: we can require users to use reversed internet domain as module name (they are already recommended for repo name)
    • When adding a new dependency, users have to make sure they also add it's required registries. This list can grow as the number of registries in the ecosystem grows.
    • It's not very clear for some multi-language projects to choose which registry they should go into.
  • Decentralized
In a decentralized world, we think the best way is to distribute Bazel modules as git repositories with version tags. We can still have Bazel registries, but they will not be the main sources for pulling Bazel dependencies. When users declare dependencies, the source (a git repo or a Bazel registry) of a module should also be specified along with the module name and version.

Pros
    • Low maintenance cost for Bazel registry. Because even if it exists, its size should be very small.
    • Easier for users to "publish" their projects. Just make a new version tag.
Cons
    • If one git repo changes, it could transitively break many downstream projects. Mitigate: we can use a mirror to ensure what was available is always available and the same.
    • We have a much higher chance to have module name conflicts. Eg. 1) different projects accidentally use the same module name. 2) The same module is hosted in different git repos (due to clone perhaps). In the first case, we can distinguish modules by url and use repo_remapping to mitigate, but in the second case, there could still be conflicts during linking time.
    • For projects not using Bazel already, this means the corresponding Bazel module (with Bazel BUILD files) has to be created and hosted by a third party.
    • Compared to the registries as the main source solutions, this approach has less security promises.
As you can see, each solution has its pros and cons. Please tell us what you think is the best approach. You can of course reply to this thread directly or provide us with more detailed information of your use case and opinions by filling out this form.

Cheers,
Yun Peng

Yun Peng

unread,
Sep 18, 2020, 11:56:47 AM9/18/20
to James Sharpe, Tony Aiuto, extern...@bazel.build, Xudong Yang, Philipp Wollermann
> A central registry must have reliability and security guarantees which would certainly require special launch effort and ongoing serving costs. This requires a plan to permanently fund it.

The current idea is to implement the central registry as a github repo for storing metadata (name, version, dependencies, url of source blobs, etc.) plus a service for mirroring the source blobs (like the bazel mirror, this can simply be a GCS bucket or something). I think it will be simpler than hosting a running http service, but still there will be a lot of maintenance work. 

> Why would downstream break? If everyone depends on versioned artifacts, a repo changing at head will not impact anyone.

I meant if a git repo is suddenly unavailable, or it's moved somewhere else, or the version tag was modified.

> Since we are moving to reconfigure and expand the role of the Bazel Federation, we could host patches and BUILD files there.

I don't quite understand, why should Bazel Federation host patches for unrelated projects? Maybe we should sync a bit on your plan for Bazel Federation.

> Only if you presume the registries are promising centralized security. This comes at a setup and support cost. In the central model we pay it all in one place, in the Bazel+third_party model then many registries have to incur that cost. My hunch is that the third_party ones will simply be github published, so that degenerates to the same as decentralized.

Yes, that's definitely a valid concern.


James,

Thanks for the feedback!

> Unfortunately this is a bit over simplistic. There are projects which have optional dependencies whose use can depend on external factors such as licensing restrictions. There is also the notion that package managers such as yum / apt use of 'virtual' packages i.e. dependencies that can be swapped out to a given standardised interface e.g. BLAS or MPI to name a couple of examples.

This is very interesting, to be honest, we haven't thought about it. Does this have to be on the package manager level instead of on the BUILD file level?

> I like how the renovate tool currently works in this regard - automatically updates the WORKSPACE when it can deduce that there is a new version of a repository available. I'd hope that this tool works like this but also atomically updates the transitive dependencies when this occurs.
IMO the tool will also need to be aware of licensing constraints; optional dependencies may require compliance to given licensing terms (commercial or otherwise) and so the user needs to be able to specify these restrictions before the dependency solver can do its thing.
SPACK as a dependency manager does this using a constraint solver. See the recent FOSDEM talk for their approach: https://www.youtube.com/watch?v=xBhpfW5cZ-w

Yes, we intend to make the tool be able to upgrade your transitive dependencies. For version resolution, the current plan is to use MVS.

> For languages that have existing registries such as pypi, npm to name a few additional ones are you proposing that the bazel registry would have to mirror the dependencies contained in these repositories to be usable? Where binary components are required at install time then bazel will potentially be responsible for ensuring these are available to the language specific package manager; for example a BLAS package needs to be available before installing scipy in Python. I have yet to see a set of rules that provides a way of injecting built binary deps into the step that installs from a language specific repository. I guess this is partly due to the bazel design phases as the language package manager doesn't necessarily have a way of separating fetching of the dependencies from building/installing them. 

We thought about mirroring existing third party registries (maven, pypi, cargo, etc), but decided not to do so. We have some other solution to integrate with instead of mirroring them. Will share once the design doc is ready. I know this is another big topic, but I want to focus on discussing what's the role of Bazel registries in this thread.

> This is probably actually a con for all the registry options - or it simply encourages people to run forks of the central registry which fragments the community and creates a decentralized registry.  
Con: The user of the package may want different visibility of the build targets to those that are in the BUILD file in the registry 

Do you mean a project may want to limit who can depend on it? I don't know how this could be achieved in the open source world.

> So I guess the other option is centralized definition but decentralised hosting?

Does the plan mentioned in the reply to Tony fit your description? (Git repo for module metadata + a mirror of source blobs)

> Also I would expect an official registry to be run by much more than a three person team; in fact I'd likely want to say that it should definitely be formed by a team that includes people from outside of Google.

Yes, if there is a central Bazel registry, we would really like to work with people outside of Google.



Daniel Wagner-Hall

unread,
Sep 21, 2020, 9:25:47 AM9/21/20
to Yun Peng, James Sharpe, Tony Aiuto, 'Tony Aiuto' via external-deps, Xudong Yang, Philipp Wollermann

On 18 Sep 2020, at 16:56, 'Yun Peng' via external-deps <extern...@bazel.build> wrote:

> A central registry must have reliability and security guarantees which would certainly require special launch effort and ongoing serving costs. This requires a plan to permanently fund it.

The current idea is to implement the central registry as a github repo for storing metadata (name, version, dependencies, url of source blobs, etc.) plus a service for mirroring the source blobs (like the bazel mirror, this can simply be a GCS bucket or something). I think it will be simpler than hosting a running http service, but still there will be a lot of maintenance work. 

> Why would downstream break? If everyone depends on versioned artifacts, a repo changing at head will not impact anyone.

I meant if a git repo is suddenly unavailable, or it's moved somewhere else, or the version tag was modified.

> Since we are moving to reconfigure and expand the role of the Bazel Federation, we could host patches and BUILD files there.

I don't quite understand, why should Bazel Federation host patches for unrelated projects? Maybe we should sync a bit on your plan for Bazel Federation.

> Only if you presume the registries are promising centralized security. This comes at a setup and support cost. In the central model we pay it all in one place, in the Bazel+third_party model then many registries have to incur that cost. My hunch is that the third_party ones will simply be github published, so that degenerates to the same as decentralized.

Yes, that's definitely a valid concern.


James,

Thanks for the feedback!

> Unfortunately this is a bit over simplistic. There are projects which have optional dependencies whose use can depend on external factors such as licensing restrictions. There is also the notion that package managers such as yum / apt use of 'virtual' packages i.e. dependencies that can be swapped out to a given standardised interface e.g. BLAS or MPI to name a couple of examples.

This is very interesting, to be honest, we haven't thought about it. Does this have to be on the package manager level instead of on the BUILD file level?

> I like how the renovate tool currently works in this regard - automatically updates the WORKSPACE when it can deduce that there is a new version of a repository available. I'd hope that this tool works like this but also atomically updates the transitive dependencies when this occurs.
IMO the tool will also need to be aware of licensing constraints; optional dependencies may require compliance to given licensing terms (commercial or otherwise) and so the user needs to be able to specify these restrictions before the dependency solver can do its thing.
SPACK as a dependency manager does this using a constraint solver. See the recent FOSDEM talk for their approach: https://www.youtube.com/watch?v=xBhpfW5cZ-w

Yes, we intend to make the tool be able to upgrade your transitive dependencies. For version resolution, the current plan is to use MVS.

> For languages that have existing registries such as pypi, npm to name a few additional ones are you proposing that the bazel registry would have to mirror the dependencies contained in these repositories to be usable? Where binary components are required at install time then bazel will potentially be responsible for ensuring these are available to the language specific package manager; for example a BLAS package needs to be available before installing scipy in Python. I have yet to see a set of rules that provides a way of injecting built binary deps into the step that installs from a language specific repository. I guess this is partly due to the bazel design phases as the language package manager doesn't necessarily have a way of separating fetching of the dependencies from building/installing them. 

We thought about mirroring existing third party registries (maven, pypi, cargo, etc), but decided not to do so. We have some other solution to integrate with instead of mirroring them. Will share once the design doc is ready. I know this is another big topic, but I want to focus on discussing what's the role of Bazel registries in this thread.

I think it’s very hard to reason about/evaluate this design without knowing its plans for these integrations. Do you have an estimate for when you’ll have more of an idea of the shape of the integrations?

Xudong Yang

unread,
Sep 21, 2020, 10:48:43 PM9/21/20
to Daniel Wagner-Hall, Tony Aiuto, Yun Peng, Philipp Wollermann, James Sharpe, Xudong Yang, 'Tony Aiuto' via external-deps
> I think it’s very hard to reason about/evaluate this design without knowing its plans for these integrations. Do you have an estimate for when you’ll have more of an idea of the shape of the integrations?

The basic idea is that something like today's custom repo rules (like rules_jvm_external) will be supported by the new tool to pull dependencies from non-Bazel registries such as Maven. This part of the design is almost ready and we just need to think some details through.

The purpose of this email, though, is to gather feedback on the options regarding where to host Bazel modules -- that is, projects already using Bazel build files and what not. It's largely orthogonal to the problem of "how do we pull dependencies from non-Bazel registries". Perhaps this orthogonality was not made sufficiently clear in the original email -- sorry about that! -- but hopefully the short description above is enough to amend that.

And of course, if you still disagree that these problems are orthogonal at all, we'd always like to hear your thoughts.

James Sharpe

unread,
Sep 22, 2020, 11:51:37 AM9/22/20
to Yun Peng, Tony Aiuto, extern...@bazel.build, Xudong Yang, Philipp Wollermann
On Fri, 18 Sep 2020 at 16:56, Yun Peng <pcl...@google.com> wrote:

James,

Thanks for the feedback!

> Unfortunately this is a bit over simplistic. There are projects which have optional dependencies whose use can depend on external factors such as licensing restrictions. There is also the notion that package managers such as yum / apt use of 'virtual' packages i.e. dependencies that can be swapped out to a given standardised interface e.g. BLAS or MPI to name a couple of examples.

This is very interesting, to be honest, we haven't thought about it. Does this have to be on the package manager level instead of on the BUILD file level?

It could be at the BUILD file level but this causes the dependency to propagate into the user and requires duplication of the select condition to select the option.
Take for example this dependency tree:
 Application -> Library -> Boost MPI -> MPI 
Lets assume Library onwards is in a bazel registry.  Now the two options are to provide the combinatorial explosion of configuration space for Library as separate build targets and then the application can pick which one it needs (and hence needing to duplicate any select logic needed to do so - in the case of MPI this could be based upon a number of factors - platform, user choice, application requirements) or the alternative is to provide a single configuration option on the Boost MPI -> MPI edge and use transitions to change the configuration at the Application level. 
Neither solution feels that clean as the select logic is non-trivial and creates technical debt for the end user as it changes in the central registry. 

Next consider the case where the application has an integration test of 2 applications; one GPU accelerated, the other not; in this case you might wish to choose to build the GPU accelerated on with OpenMPI for the CUDA awareness and the other with Intel MPI. Again transitions are likely needed to support this use case. 

I've attempted to use transitions and haven't found it very easy to do so due to lack of documentation and concrete examples. I suspect that having a package manager that effectively generates the used subset of the configuration space as separate named build targets (which could even use generated transitions under the hood if necessary) would be a very useful and user friendly way to solve this problem.


> For languages that have existing registries such as pypi, npm to name a few additional ones are you proposing that the bazel registry would have to mirror the dependencies contained in these repositories to be usable? Where binary components are required at install time then bazel will potentially be responsible for ensuring these are available to the language specific package manager; for example a BLAS package needs to be available before installing scipy in Python. I have yet to see a set of rules that provides a way of injecting built binary deps into the step that installs from a language specific repository. I guess this is partly due to the bazel design phases as the language package manager doesn't necessarily have a way of separating fetching of the dependencies from building/installing them. 

We thought about mirroring existing third party registries (maven, pypi, cargo, etc), but decided not to do so. We have some other solution to integrate with instead of mirroring them. Will share once the design doc is ready. I know this is another big topic, but I want to focus on discussing what's the role of Bazel registries in this thread.

I would say that it is very relevant to the role as to how third party registries are included as the implications of this could affect a distributed registry design; a distributed design could end up with different integration styles, making it harder for a user to assess whether two given registries are compatible; indeed we already have this problem at the rules level which is what the federation was attempting to address, and even in this more controlled set of rulesets has it proven difficult to do correctly. 

Also consider the fact that these registries tend to be dynamic languages that can call out to binary components; libraries such as scipy / numpy / mpi4py have all the same issues that I described earlier but with the added problem of it being wrapped up in a third party registry which may have already taken an attempt at making one of those decisions for you. Within the python ecosystem this is why we end up with things like conda distributions where these decisions have been taken in a particular direction e.g. Intel provide a conda distribution of python where everything is using the intel toolchain from compilers to MPI. 
 

> This is probably actually a con for all the registry options - or it simply encourages people to run forks of the central registry which fragments the community and creates a decentralized registry.  
Con: The user of the package may want different visibility of the build targets to those that are in the BUILD file in the registry 

Do you mean a project may want to limit who can depend on it? I don't know how this could be achieved in the open source world.

I mean that a given set of BUILD files will have certain visibility statements in the rules defined in the package and that the decision made by the maintainer of the package in the central registry of which specific rules within the package are made visible externally. For example I've had to override the dropbox dbx_build_tools build rules to make the pyconfig.h header available outside the python build repository so that I can build extensions outside of their ruleset (and indeed in our ecosystem outside of bazel as we are only partially building in bazel at the moment).
I guess what I mean is with a central repo there is potential for forks to end up appearing where the package maintainer in the central registry disagrees with a given end-user around what specific rules for a given package should be exposed via visibility rules.


> So I guess the other option is centralized definition but decentralised hosting?

Does the plan mentioned in the reply to Tony fit your description? (Git repo for module metadata + a mirror of source blobs)
Yes
 

> Also I would expect an official registry to be run by much more than a three person team; in fact I'd likely want to say that it should definitely be formed by a team that includes people from outside of Google.

Yes, if there is a central Bazel registry, we would really like to work with people outside of Google.

On the flip side of this; how much of the build rules from google3 would be opensourced? Would there be any commitment from google that the maintainers of the internal mirrors of packages would also make these visible in the central registry? To me that seems like an easy way of expanding that three person team but obviously that does present an additional burden initially on google staff. I'd argue however that longer term that if a central registry worked that google would then be able to leverage the maintenance of third party packages via the community as well as through their internal effort?

Andreas Bergmeier

unread,
Sep 22, 2020, 4:05:48 PM9/22/20
to external-deps, Yun Peng, Xudong Yang, Philipp Wollermann
One of the missing aspects IMO is how one would go about having private Modules. With private I mean where source code must only be accessible via a certain ip-subnet, where internet connectivity is "slow", with Authentication required and SSL broken.
In such an environment you have a few requirements:
1. If there is a central registry, you must have a way of hosting your own Instance
2. If there are additional services (analyzers, code browsers) connected to registries it would be good to be able to host own Instances.
3. You might need to have an on-site cache to mitigate internet connectivity
4. There needs to be a way of limiting Access
5. The whole process needs to work around partially or completely broken SSL support.

Most Module systems I have worked with make at least one of these hard to impossible.
Not supporting these scenarios in any way might even be a better approach than having a too complex solution.
Reply all
Reply to author
Forward
0 new messages