Representing external dependences in the brave new world of WORKSPACE files

242 views
Skip to first unread message

Lukács T. Berki

unread,
Feb 28, 2018, 7:27:37 AM2/28/18
to Klaus Aehlig, Dmitry Lomov, bazel-si...@googlegroups.com
Hey there,

We've been pondering how best to express dependencies on PyPI (Python package manager) packages and I'd like to make sure that we don't contradict your plans too much. 

The brief description of the landscape is that PyPI processes a file called requirements.txt which contains a set of packages with version constraints (which version is installed possibly depends on the Python version and the OS) and some metadata (e.g. URL to find the packages), then it does transitive dependency resolution, fetches the packages required, possibly compiles native code, then installs the Python + native code somewhere.

Question is, how best to integrate this in the brave new world of WORKSPACE files? In particular, your plans seem to hinge upon separating dependency resolution (non-hermetic) and actually fetching and building them (hermetic), which is difficult, because pip does both of these things. pip does have provisions for repeatability, but not for separating out the dependency resolution part.

My best plan is that we would have a WORKSPACE file like this:

pip_dependency_set(name="mydeps", requirements="my_requirements.txt")

and the "dependency resolution" of this repository would entail running pip and creating an "installation bundle". Then "fetching and building" would be just unpacking it and adding a convenient BUILD file, e.g. with a target per installed library (i.e. @mydeps//:ladle would be the library called "ladle" fetched according to instructions in my_requirements.txt)

Of course, this would preclude cross-compilation, abuse the concept of "dependency resolution", depend on a version of pip installed on the host system and would make it possible to have multiple versions of the same package in the same workspace, or even the same version of the same package multiple times (from different requirements.txt files)

On the flip side, we wouldn't have to re-implement anything (e.g. version resolution or compiling native coed) from pip, which is a very welcome development.

WDYT?


--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

John Field

unread,
Feb 28, 2018, 9:31:39 AM2/28/18
to Lukács T. Berki, ca...@google.com, dan...@google.com, Klaus Aehlig, Dmitry Lomov, bazel-si...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-sig-python/CAOu%2B0LVdEMC-e92yG00xZ0SpL86x_E0k1oi2pVU%2BPOBs3FgFzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry Lomov

unread,
Feb 28, 2018, 9:52:29 AM2/28/18
to John Field, Lukács T. Berki, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
On Wed, Feb 28, 2018 at 3:31 PM John Field <jfi...@google.com> wrote:

On Wed, Feb 28, 2018 at 7:27 AM 'Lukács T. Berki' via Bazel/Python Special Interest Group <bazel-si...@googlegroups.com> wrote:
Hey there,

We've been pondering how best to express dependencies on PyPI (Python package manager) packages and I'd like to make sure that we don't contradict your plans too much. 

The brief description of the landscape is that PyPI processes a file called requirements.txt which contains a set of packages with version constraints (which version is installed possibly depends on the Python version and the OS) and some metadata (e.g. URL to find the packages), then it does transitive dependency resolution, fetches the packages required, possibly compiles native code, then installs the Python + native code somewhere.

Question is, how best to integrate this in the brave new world of WORKSPACE files? In particular, your plans seem to hinge upon separating dependency resolution (non-hermetic) and actually fetching and building them (hermetic), which is difficult, because pip does both of these things. pip does have provisions for repeatability, but not for separating out the dependency resolution part.

My best plan is that we would have a WORKSPACE file like this:

pip_dependency_set(name="mydeps", requirements="my_requirements.txt")

and the "dependency resolution" of this repository would entail running pip and creating an "installation bundle". Then "fetching and building" would be just unpacking it and adding a convenient BUILD file, e.g. with a target per installed library (i.e. @mydeps//:ladle would be the library called "ladle" fetched according to instructions in my_requirements.txt)

Looks like "WORKSPACE.resolved" version of this rule should use "repeatable pip" (e.g. https://pip.pypa.io/en/stable/user_guide/#hash-checking-mode).

Let's say you have some requirements.txt. Is there a mode in pip to run dependency resolution and then get the specific version numbers to pinpoint them? (maybe you can parse the installation bundle to get them?)
 

Of course, this would preclude cross-compilation,

I don't have a great idea on how to solve this :(
 
abuse the concept of "dependency resolution",
Hmm, I don't really thing so: if a result of "sync" to WORKSPACE is a WORKSPACE.resolved with pinpointed hashes/versions, this seems good enough. 
 
depend on a version of pip installed on the host system

Yes. I do not think this matters too much.
 
and would make it possible to have multiple versions of the same package in the same workspace, or even the same version of the same package multiple times (from different requirements.txt files)

Is that problematic? (especially given we will be able to split diamond dependencies soon: https://bazel-review.googlesource.com/c/bazel/+/42172)
 

On the flip side, we wouldn't have to re-implement anything (e.g. version resolution or compiling native coed) from pip, which is a very welcome development.

\o/
 

WDYT?


--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | GermanyGeschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-sig-python/CAOu%2B0LVdEMC-e92yG00xZ0SpL86x_E0k1oi2pVU%2BPOBs3FgFzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


--
Google Germany GmbH
Erika-Mann-Straße 33, 80636 München, Germany

Lukács T. Berki

unread,
Feb 28, 2018, 9:58:00 AM2/28/18
to Dmitry Lomov, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
On Wed, 28 Feb 2018 at 15:52, Dmitry Lomov <dsl...@google.com> wrote:

On Wed, Feb 28, 2018 at 3:31 PM John Field <jfi...@google.com> wrote:

On Wed, Feb 28, 2018 at 7:27 AM 'Lukács T. Berki' via Bazel/Python Special Interest Group <bazel-si...@googlegroups.com> wrote:
Hey there,

We've been pondering how best to express dependencies on PyPI (Python package manager) packages and I'd like to make sure that we don't contradict your plans too much. 

The brief description of the landscape is that PyPI processes a file called requirements.txt which contains a set of packages with version constraints (which version is installed possibly depends on the Python version and the OS) and some metadata (e.g. URL to find the packages), then it does transitive dependency resolution, fetches the packages required, possibly compiles native code, then installs the Python + native code somewhere.

Question is, how best to integrate this in the brave new world of WORKSPACE files? In particular, your plans seem to hinge upon separating dependency resolution (non-hermetic) and actually fetching and building them (hermetic), which is difficult, because pip does both of these things. pip does have provisions for repeatability, but not for separating out the dependency resolution part.

My best plan is that we would have a WORKSPACE file like this:

pip_dependency_set(name="mydeps", requirements="my_requirements.txt")

and the "dependency resolution" of this repository would entail running pip and creating an "installation bundle". Then "fetching and building" would be just unpacking it and adding a convenient BUILD file, e.g. with a target per installed library (i.e. @mydeps//:ladle would be the library called "ladle" fetched according to instructions in my_requirements.txt)

Looks like "WORKSPACE.resolved" version of this rule should use "repeatable pip" (e.g. https://pip.pypa.io/en/stable/user_guide/#hash-checking-mode).

Let's say you have some requirements.txt. Is there a mode in pip to run dependency resolution and then get the specific version numbers to pinpoint them? (maybe you can parse the installation bundle to get them?)
Hopefully someone more experienced with Python can tell :) The reason why I brought up the installation bundle is that because that one contains a superset of the information that hash checking mode yields.
 
 

Of course, this would preclude cross-compilation,

I don't have a great idea on how to solve this :(
 
abuse the concept of "dependency resolution",
Hmm, I don't really thing so: if a result of "sync" to WORKSPACE is a WORKSPACE.resolved with pinpointed hashes/versions, this seems good enough. 
 
depend on a version of pip installed on the host system

Yes. I do not think this matters too much.
Are you sure? I don't know how different pip implementations are. It's a source of non-hermeticity for sure that we can't easily get rid of because we can't fetch or build pip before fetching repositories. And we are stuck with e.g. whatever C++ compiler pip decides to use in any case.

 
and would make it possible to have multiple versions of the same package in the same workspace, or even the same version of the same package multiple times (from different requirements.txt files)

Is that problematic? (especially given we will be able to split diamond dependencies soon: https://bazel-review.googlesource.com/c/bazel/+/42172)
You tell me! I'm fine with it, but I don't know how well it meshes with your future plans.
 
 

On the flip side, we wouldn't have to re-implement anything (e.g. version resolution or compiling native coed) from pip, which is a very welcome development.

\o/
 

WDYT?


--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | GermanyGeschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-sig-python/CAOu%2B0LVdEMC-e92yG00xZ0SpL86x_E0k1oi2pVU%2BPOBs3FgFzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


--
Google Germany GmbH
Erika-Mann-Straße 33, 80636 München, Germany

Doug Greiman

unread,
Feb 28, 2018, 5:54:34 PM2/28/18
to Lukács T. Berki, Dmitry Lomov, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
Question 1: Dependency resolution hermetic or not?

Sample requirements.txt:
  django

Sample pinned requirements.txt for Python 2, os == 'linux' (E.g. output of "pip install -f requirements.txt; pip freeze > requirements.txt.lock")
  Django==1.11.10
  futures==3.1.1
  pytz==2017.4

Sample pinned requirements.txt for Python 3, os == 'mac'
  Django==2.0.2
  pytz==2018.3

Doing this pinning requires figuring out the right version of django, then the right wheel or sdist for that version, fetching it, installing it, reading the metadata for dependencies, then recursing on those dependencies.  You do this all in a virtualenv (probably one of several).  You could certainly throw away the packages you fetch, and fetch them again in the "hermetic" part if you wanted.

Question 2: One "pip" or many?

Scenario 2.1: You have Python 2 and 3 targets in a Bazel repository.  You use Django.  Django 2.0 only works under Python 3.  Your Python 2 code still uses Django 1.x.

You have a requirements.txt like this: 
  django < 2.0; python_version == "2.7"
  django >= 2.0; python_version >= "3.4.3"

Django 1.x and 2.x have different sub-dependencies.  The only way to find these sub-dependencies is to actually fetch the django package metadata and recursively evaluate it.

Unfortunately, when you run "pip", you can't tell it to evaluate requirements for Python 2 or Python 3.  "pip" uses whatever interpreter version its running under.  So Bazel needs to have host versions of pip like "pip2.7" and "pip3.4" and "pip3.5" etc for every target version of interest.  You can also do "python3.4 -m pip", except on platforms like Debian where they break it :(

Scenario 2.2: Maybe you also have a host system which is MacOS (x64) but the target system is iOS (ARM).  I.e. everyone doing iOS development.  But the foozle 2.x package hasn't been updated to work on MacOS.

So you have requirements.txt:
  foozle < 2; os == 'mac'
  foozle >= 2 ; os != 'mac'

For this type of thing, you're a bit stuck.  "pip", when run on a Mac, always evaluates the "os" as "mac".  So maybe you hack around it, like not using "foozle", or declaring a different set of targets for "MacOS" and "iOS" that have different requirements.txt files.

We'll either need to reimplement at least some of "pip" in Bazel (bleah), or else make some improvements to "pip" (ok), or just live with hacks and limitations (hmm).  We can handle scenario 2.1 with some annoyance, scenario 2.2 is harder.


On Wed, Feb 28, 2018 at 6:57 AM, 'Lukács T. Berki' via Bazel/Python Special Interest Group <bazel-si...@googlegroups.com> wrote:
On Wed, 28 Feb 2018 at 15:52, Dmitry Lomov <dsl...@google.com> wrote:
On Wed, Feb 28, 2018 at 3:31 PM John Field <jfi...@google.com> wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-python+unsubscribe@googlegroups.com.
To post to this group, send email to bazel-sig-python@googlegroups.com.


--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | GermanyGeschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-python+unsubscribe@googlegroups.com.
To post to this group, send email to bazel-sig-python@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-sig-python/CAOu%2B0LUqYzomj-e5vyika-GiebqtLshEAQSJSC2Fo8xsk77VJw%40mail.gmail.com.

Carmi Grushko

unread,
Feb 28, 2018, 6:05:05 PM2/28/18
to dgre...@google.com, Lukács T. Berki, Dmitry Lomov, John Field, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
Reiterating the ideas from dsl...@google.com and jfi...@google.com (and maybe others?) -
Suppose we have separate Docker containers for Python 2 and Python 3, each with their own Python and pip versions,
and Bazel "builds" the third-party deps by running pip inside the containers and pulling out the resulting artifacts.

It probably doesn't help with cross-compilation, for which we need hardware virtualization if pip can't cross-compile.



To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.


--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | GermanyGeschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.

joost.v...@gmail.com

unread,
Mar 1, 2018, 4:05:01 AM3/1/18
to Bazel/Python Special Interest Group
Instead of using pip it may be useful to borrow some of the functionality of pipenv: https://github.com/pypa/pipenv

Lukács T. Berki

unread,
Mar 1, 2018, 4:05:58 AM3/1/18
to Doug Greiman, Dmitry Lomov, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
On Wed, 28 Feb 2018 at 23:54, Doug Greiman <dgre...@google.com> wrote:
Question 1: Dependency resolution hermetic or not?

Sample requirements.txt:
  django

Sample pinned requirements.txt for Python 2, os == 'linux' (E.g. output of "pip install -f requirements.txt; pip freeze > requirements.txt.lock")
  Django==1.11.10
  futures==3.1.1
  pytz==2017.4

Sample pinned requirements.txt for Python 3, os == 'mac'
  Django==2.0.2
  pytz==2018.3

Doing this pinning requires figuring out the right version of django, then the right wheel or sdist for that version, fetching it, installing it, reading the metadata for dependencies, then recursing on those dependencies.  You do this all in a virtualenv (probably one of several).  You could certainly throw away the packages you fetch, and fetch them again in the "hermetic" part if you wanted.
Does the metadata of a dependency depend on fetching it and installing it? 
 

Question 2: One "pip" or many?

Scenario 2.1: You have Python 2 and 3 targets in a Bazel repository.  You use Django.  Django 2.0 only works under Python 3.  Your Python 2 code still uses Django 1.x.

You have a requirements.txt like this: 
  django < 2.0; python_version == "2.7"
  django >= 2.0; python_version >= "3.4.3"

Django 1.x and 2.x have different sub-dependencies.  The only way to find these sub-dependencies is to actually fetch the django package metadata and recursively evaluate it.

Unfortunately, when you run "pip", you can't tell it to evaluate requirements for Python 2 or Python 3.  "pip" uses whatever interpreter version its running under.  So Bazel needs to have host versions of pip like "pip2.7" and "pip3.4" and "pip3.5" etc for every target version of interest.  You can also do "python3.4 -m pip", except on platforms like Debian where they break it :(
It's actually even worse: when evaluating the WORKSPACE file, Bazel does not have access to the configuration, so it can't tell if the target wants to run on Python 2 or Python 3. Furthermore, the target OS is also part of the configuration, so you don't even have that. So you must assume that target OS == host OS and pick a Python version in the WORKSPACE file.

To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.


--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | GermanyGeschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.

Lukács T. Berki

unread,
Mar 1, 2018, 5:17:51 AM3/1/18
to Doug Greiman, Dmitry Lomov, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
After a bit more thinking, I have the following arguments for making the dependency resolution AND building the PIP packages part of "bazel fetch", and eventually part of the non-hermetic dependency resolution instead of the hermetic fetching:
  • Both dependency resolution and building are at the moment non-hermetic: it depends on the host Bazel runs on, requires pip and the easiest is to use whatever pip is installed
  • We can't easily make them hermetic at this point in time: WORKSPACE rules don't have access to BuildConfiguration, which is where the target architecture is specified and which is a natural place for things like the desired Python version and if cross-compilation is desired, they can't run actions anywhere else than the host system and there is no official way to tell them where cross-compilers / other pip / Python binaries are.
  • It yields a very simple and intuitive result: a repository with a target for each dependency desired.
  • When (and if) we'll have the ability to run dependency resolution / fetching in multiple configurations, remote hosts, whatnot, it won't require changes to BUILD files (since all they says is deps=["@pyrepo//:pypackage"]) and only minimal changes to WORKSPACE files, if any: if we stick with the "use whatever pip/C++ compiler/Python version/whatnot" is available approach, they would eventually take these from the BuildConfiguration and if we add flags for these, they are easy to remove
The only contentious issue I see is whether BuildConfiguration is the right place for things that can influence package resolution. I can't think of a better one, but maybe Dmitry/Klaus/Danna/Carmi have a different ingenious plan.

Dmitry Lomov

unread,
Mar 1, 2018, 8:00:07 AM3/1/18
to Lukács T. Berki, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
On Thu, Mar 1, 2018 at 11:17 AM Lukács T. Berki <lbe...@google.com> wrote:
After a bit more thinking, I have the following arguments for making the dependency resolution AND building the PIP packages part of "bazel fetch", and eventually part of the non-hermetic dependency resolution instead of the hermetic fetching:

The problem with that is: we do not have any place to put the results of non-hermetic build of PIP packages if it happens during "non-hermetic dependency resolution" (I think you mean non-deterministic, really).
The current thinking is:
 * "non-deterministic dependency resolution" aka `bazel sync` produces WORKSPACE.resolved
* "determenistic fetching" aka `bazel fetch` fetches predictable artifacts based on what is in WORKSPACE.resolved 

There are no other artifacts planned besides WORKSPACE.resolved that pass from `bazel sync` to `bazel fetch`. 
So what to do? I see several ways out:
a) output enough information into WORKSPACE.resolved to make the pip run determenistic
b) have support of additional artifacts that accompany WORKSPACE.resolved and that bazel sync would generate. The users will need to check them in and update on bazel sync.
c) accept that depending on pip packages is inherently non-determenistic and the users have to make their builds reproducible in other ways (either by checking in prebuilt bundles, or by dockerizing)
d) something else?...

(c) would be similar to how we approach C++ toolchain today: we came to accept that it is inherenly dependent on the execution envrionment (although there are ways to hermeticize it)


Dmitry

Lukács T. Berki

unread,
Mar 1, 2018, 8:12:35 AM3/1/18
to Dmitry Lomov, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
On Thu, 1 Mar 2018 at 14:00, Dmitry Lomov <dsl...@google.com> wrote:



On Thu, Mar 1, 2018 at 11:17 AM Lukács T. Berki <lbe...@google.com> wrote:
After a bit more thinking, I have the following arguments for making the dependency resolution AND building the PIP packages part of "bazel fetch", and eventually part of the non-hermetic dependency resolution instead of the hermetic fetching:

The problem with that is: we do not have any place to put the results of non-hermetic build of PIP packages if it happens during "non-hermetic dependency resolution" (I think you mean non-deterministic, really).
The current thinking is:
 * "non-deterministic dependency resolution" aka `bazel sync` produces WORKSPACE.resolved
* "determenistic fetching" aka `bazel fetch` fetches predictable artifacts based on what is in WORKSPACE.resolved 

There are no other artifacts planned besides WORKSPACE.resolved that pass from `bazel sync` to `bazel fetch`. 
So what to do? I see several ways out:
a) output enough information into WORKSPACE.resolved to make the pip run determenistic
We can't do that, because the output of the pip run (including compiling native code) depends on the Python version / OS / C++ compiler that is installed.

b) have support of additional artifacts that accompany WORKSPACE.resolved and that bazel sync would generate. The users will need to check them in and update on bazel sync.
This would work, I guess, but I don't think anyone would be thrilled at the prospect of Bazel essentially forcing them to check in binary blobs.
 
c) accept that depending on pip packages is inherently non-determenistic and the users have to make their builds reproducible in other ways (either by checking in prebuilt bundles, or by dockerizing)
...but then why monkey around with "fetch" and "sync"? The whole point of having two things is that one is deterministic and the other is not. If you put the boundary between "fetch" and "sync" at "accessing the network" then putting pip package checksums into WORKSPACE.resolved makes sense, but not if the boundary is that one is deterministic and the other isn't. I think the least bad approach is the putting package checksums into WORKSPACE.resolved. Then "bazel fetch" would not be deterministic, but at least the result wouldn't be radically different. And some of the behavior that's dependent on the system (package choice based on Python version + OS) would be in "bazel sync".


d) something else?...

(c) would be similar to how we approach C++ toolchain today: we came to accept that it is inherenly dependent on the execution envrionment (although there are ways to hermeticize it)
Except that those are not WORKSPACE rules.

Lukács T. Berki

unread,
Mar 1, 2018, 8:28:49 AM3/1/18
to Dmitry Lomov, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com, John Cater
On Thu, 1 Mar 2018 at 14:12, Lukács T. Berki <lbe...@google.com> wrote:



On Thu, 1 Mar 2018 at 14:00, Dmitry Lomov <dsl...@google.com> wrote:



On Thu, Mar 1, 2018 at 11:17 AM Lukács T. Berki <lbe...@google.com> wrote:
After a bit more thinking, I have the following arguments for making the dependency resolution AND building the PIP packages part of "bazel fetch", and eventually part of the non-hermetic dependency resolution instead of the hermetic fetching:

The problem with that is: we do not have any place to put the results of non-hermetic build of PIP packages if it happens during "non-hermetic dependency resolution" (I think you mean non-deterministic, really).
The current thinking is:
 * "non-deterministic dependency resolution" aka `bazel sync` produces WORKSPACE.resolved
* "determenistic fetching" aka `bazel fetch` fetches predictable artifacts based on what is in WORKSPACE.resolved 

There are no other artifacts planned besides WORKSPACE.resolved that pass from `bazel sync` to `bazel fetch`. 
So what to do? I see several ways out:
a) output enough information into WORKSPACE.resolved to make the pip run determenistic
We can't do that, because the output of the pip run (including compiling native code) depends on the Python version / OS / C++ compiler that is installed.

b) have support of additional artifacts that accompany WORKSPACE.resolved and that bazel sync would generate. The users will need to check them in and update on bazel sync.
This would work, I guess, but I don't think anyone would be thrilled at the prospect of Bazel essentially forcing them to check in binary blobs.
 
c) accept that depending on pip packages is inherently non-determenistic and the users have to make their builds reproducible in other ways (either by checking in prebuilt bundles, or by dockerizing)
...but then why monkey around with "fetch" and "sync"? The whole point of having two things is that one is deterministic and the other is not. If you put the boundary between "fetch" and "sync" at "accessing the network" then putting pip package checksums into WORKSPACE.resolved makes sense, but not if the boundary is that one is deterministic and the other isn't. I think the least bad approach is the putting package checksums into WORKSPACE.resolved. Then "bazel fetch" would not be deterministic, but at least the result wouldn't be radically different. And some of the behavior that's dependent on the system (package choice based on Python version + OS) would be in "bazel sync".
On a related note: do you already have plans what should happen if the set of things fetched over the network depends on the architecture you want to build for?

Currently the plan seems to be to just ignore that problem and go with whatever the host system needs, which is fine for the time being as long as it's compatible with whatever you have in mind for the future. The official location for the target platform is currently the BuildConfiguration, but that's not avaliable during "bazel fetch" or "bazel sync". So we either make that available (somehow), require people to hard-code choices in their WORKSPACE files, or?

Dmitry Lomov

unread,
Mar 1, 2018, 8:46:47 AM3/1/18
to Lukács T. Berki, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com
On Thu, Mar 1, 2018 at 2:12 PM Lukács T. Berki <lbe...@google.com> wrote:



On Thu, 1 Mar 2018 at 14:00, Dmitry Lomov <dsl...@google.com> wrote:



On Thu, Mar 1, 2018 at 11:17 AM Lukács T. Berki <lbe...@google.com> wrote:
After a bit more thinking, I have the following arguments for making the dependency resolution AND building the PIP packages part of "bazel fetch", and eventually part of the non-hermetic dependency resolution instead of the hermetic fetching:

The problem with that is: we do not have any place to put the results of non-hermetic build of PIP packages if it happens during "non-hermetic dependency resolution" (I think you mean non-deterministic, really).
The current thinking is:
 * "non-deterministic dependency resolution" aka `bazel sync` produces WORKSPACE.resolved
* "determenistic fetching" aka `bazel fetch` fetches predictable artifacts based on what is in WORKSPACE.resolved 

There are no other artifacts planned besides WORKSPACE.resolved that pass from `bazel sync` to `bazel fetch`. 
So what to do? I see several ways out:
a) output enough information into WORKSPACE.resolved to make the pip run determenistic
We can't do that, because the output of the pip run (including compiling native code) depends on the Python version / OS / C++ compiler that is installed.
 
well all of that can be a part of information in WORKSPACE.resolved - but it is tricky.

 

b) have support of additional artifacts that accompany WORKSPACE.resolved and that bazel sync would generate. The users will need to check them in and update on bazel sync.
This would work, I guess, but I don't think anyone would be thrilled at the prospect of Bazel essentially forcing them to check in binary blobs.

Well, if the pip cannot be made reproducible, and people want their builds to be repoducible, then they might just accept that. This cannot be the default mode though. 

 
 
c) accept that depending on pip packages is inherently non-determenistic and the users have to make their builds reproducible in other ways (either by checking in prebuilt bundles, or by dockerizing)
...but then why monkey around with "fetch" and "sync"? The whole point of having two things is that one is deterministic and the other is not. If you put the boundary between "fetch" and "sync" at "accessing the network" then putting pip package checksums into WORKSPACE.resolved makes sense, but not if the boundary is that one is deterministic and the other isn't. I think the least bad approach is the putting package checksums into WORKSPACE.resolved. Then "bazel fetch" would not be deterministic, but at least the result wouldn't be radically different. And some of the behavior that's dependent on the system (package choice based on Python version + OS) would be in "bazel sync".

I am not quite sure what you are trying to say here :) On one hand you are saying "The whole point of having two things is that one is deterministic and the other is not.", on the other hand "putting pip package checksums into WORKSPACE.resolved ... [does not make sense] if the boundary is that one is deterministic and the other isn't", and then you suggest to put hashes into WORKSPACE.resolved. :)

Maybe we can accept that WORKSPACE.resolved is "only mostly deterministic" (modulo pip version/C++ compiler version). Key issue I see with that is multi-platform: how can we have a single repo serving multiple platforms?
Here which brings me to the following: can we make building installation bundles from frozen requirements.txt a part of the build? At build time, we know the python toolchain, the C++ compiler, the execution platform etc.
The big issue with that, of course, is whether we can pre-fetch everything that is required ahead of the build, if we know all the versions (I assume not, so we need to think more :()




d) something else?...

(c) would be similar to how we approach C++ toolchain today: we came to accept that it is inherenly dependent on the execution envrionment (although there are ways to hermeticize it)
Except that those are not WORKSPACE rules.

Autoconfinguring the crosstool is.

Dmitry Lomov

unread,
Mar 1, 2018, 8:48:40 AM3/1/18
to Lukács T. Berki, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, Klaus Aehlig, bazel-si...@googlegroups.com, John Cater
On Thu, Mar 1, 2018 at 2:28 PM Lukács T. Berki <lbe...@google.com> wrote:



On Thu, 1 Mar 2018 at 14:12, Lukács T. Berki <lbe...@google.com> wrote:



On Thu, 1 Mar 2018 at 14:00, Dmitry Lomov <dsl...@google.com> wrote:



On Thu, Mar 1, 2018 at 11:17 AM Lukács T. Berki <lbe...@google.com> wrote:
After a bit more thinking, I have the following arguments for making the dependency resolution AND building the PIP packages part of "bazel fetch", and eventually part of the non-hermetic dependency resolution instead of the hermetic fetching:

The problem with that is: we do not have any place to put the results of non-hermetic build of PIP packages if it happens during "non-hermetic dependency resolution" (I think you mean non-deterministic, really).
The current thinking is:
 * "non-deterministic dependency resolution" aka `bazel sync` produces WORKSPACE.resolved
* "determenistic fetching" aka `bazel fetch` fetches predictable artifacts based on what is in WORKSPACE.resolved 

There are no other artifacts planned besides WORKSPACE.resolved that pass from `bazel sync` to `bazel fetch`. 
So what to do? I see several ways out:
a) output enough information into WORKSPACE.resolved to make the pip run determenistic
We can't do that, because the output of the pip run (including compiling native code) depends on the Python version / OS / C++ compiler that is installed.

b) have support of additional artifacts that accompany WORKSPACE.resolved and that bazel sync would generate. The users will need to check them in and update on bazel sync.
This would work, I guess, but I don't think anyone would be thrilled at the prospect of Bazel essentially forcing them to check in binary blobs.
 
c) accept that depending on pip packages is inherently non-determenistic and the users have to make their builds reproducible in other ways (either by checking in prebuilt bundles, or by dockerizing)
...but then why monkey around with "fetch" and "sync"? The whole point of having two things is that one is deterministic and the other is not. If you put the boundary between "fetch" and "sync" at "accessing the network" then putting pip package checksums into WORKSPACE.resolved makes sense, but not if the boundary is that one is deterministic and the other isn't. I think the least bad approach is the putting package checksums into WORKSPACE.resolved. Then "bazel fetch" would not be deterministic, but at least the result wouldn't be radically different. And some of the behavior that's dependent on the system (package choice based on Python version + OS) would be in "bazel sync".
On a related note: do you already have plans what should happen if the set of things fetched over the network depends on the architecture you want to build for?

Currently the plan seems to be to just ignore that problem and go with whatever the host system needs, which is fine for the time being as long as it's compatible with whatever you have in mind for the future. The official location for the target platform is currently the BuildConfiguration, but that's not avaliable during "bazel fetch" or "bazel sync". So we either make that available (somehow), require people to hard-code choices in their WORKSPACE files, or?

The current plan is that you have to predeclare everything for every architecture you care about in WORKSPACE file (just like we predeclare toolchains).

Klaus Aehlig

unread,
Mar 1, 2018, 9:15:06 AM3/1/18
to Lukács T. Berki, Dmitry Lomov, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, bazel-si...@googlegroups.com, John Cater
> On a related note: do you already have plans what should happen if the set
> of things fetched over the network depends on the architecture you want to
> build for?

So far, I didn't find the time to write up a design document, but I always had
the idea to separate
- fetch-like WORKSPACE rules from
- configure-like WORKSPACE rules.
My basic idea is that each rule would declare which category it belongs to. Then,
fetch-like rules would behave as Dmitry descibed, i.e., resolve dependency and return
the versions found. This resolved information, together with the hashes of the output
would be stored in WORKSPACE.resolved. In particular, those rules would run only
once, regardless on the number of execution platforms you have. Also, once you have
a WORKSPACE.resolved committed, every build from that source tree will either fail
(if the upstream archive is gone), or produce bit-wise identical output.

Configure-like rules, on the other hand would run once per execution platform and
store their resolution information (e.g., versions choosen) in WORKSPACE.configured
(or WORKSPACE.configured.<platform_name> if the execution platform was not the host
machine).

Maybe we should take this idea more serious. On the other hand, the idea of something
configure-like was that it should be somewhat cheap, which network access by definition
is not... So, no, I'm not aware of any convincing idea here.

John Cater

unread,
Mar 1, 2018, 10:01:17 AM3/1/18
to Klaus Aehlig, Lukács T. Berki, Dmitry Lomov, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, bazel-si...@googlegroups.com
The way existing toolchain-enabled rules work is exactly like this: the remote repository for, say, rules_go defines several toolchains for several different execution and target environments, and then individual targets depend on whichever toolchain is selected. That selected toolchain then becomes a dependency of the target, and whatever actions are needed to create it happen at that point. This is how the go rules only download the go sdks actually in use, not all ~200 that are defined.

The important part is separating out the repository-level step of "defining the toolchain" with the target-level step of actually configuring and downloading artifacts.

Lukács T. Berki

unread,
Mar 2, 2018, 8:44:08 AM3/2/18
to John Cater, Klaus Aehlig, Dmitry Lomov, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, bazel-si...@googlegroups.com
Hm, I like the idea of representing individual platform-specific libraries as toolchains. Well, then calling them a "toolchain" would be a misnomer, but we can live with that for the time being. How about this then: we have WORKSPACE rule that references a requirements.txt and optionally a platform (the latter not supported yet):

py_dependency_set(name="set", requirements="requirements.txt")

This, when "configured", calls PIP, and gets the transitive set of packages and their hashes out of it. From that, repositories arise for each such package, e.g.

py_pypi_repository(name="pypi_lib_a0bcd5") 

whose name is a function of the PyPI package name and the checksum and which contain rules like

py_library(name="pypi_lib_a0bcd5", deps=["pypi_base_10bb3f//:pypi_base_10bb3f"]

, which repository, when fetched, would download the Python package and if necessary, compile its native dependencies. Then the repository called "set" would contain references like this:

py_platform_dependent_library(name="lib")

and the toolchain resolution mechanism would ensure that @set//:lib" finds @pypi_lib_a0bcd5//:pypi_lib_a0bcd5  by matching the set of constraints That this is similar to how genrules should be able to depend on toolchains for the current platform -- John, do we support that? My memory is betraying me...

This system seems to provide most of the desirable features:
  • Can work in a multi-platform build and for cross-compilation (if py_dependency_set can be resolved in a cross-platform way)
  • After pypi_dependency_set is resolved, only those repositories are needed which represent libraries that are needed
  • When the same library is used twice, it's fetched only once
  • There is kind of a separation between configuring (pypi_dependency_set) and fetching (the single-package repositories)
  • We can implement something similar pretty quickly by skipping the toolchain resolution part; when later added, neither WORKSPACE nor BUILD files would need to change
With the following disadvantages:
  • A single pypi_dependency_set rule would need to be able to give rise to multiple single-package repositories
  • The same single-package repository (pypi_lib_a0bcd5) can arise from multiple py_dependency_set rules and I don't know if we can support this easily
  • Naively implemented, a package would be fetched twice: during the resolution of the py_dependency_set rule and during the fetching of the single-package repository
WDYT?

Dmitry Lomov

unread,
Mar 7, 2018, 9:32:28 AM3/7/18
to Lukács T. Berki, John Cater, Klaus Aehlig, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, bazel-si...@googlegroups.com
I like the direction of this.
I am not quite sure how the "configuration" step would work? What are the inputs to that step?
Is toolchain resolution mechanism easily extensible to allow this?

Lukács T. Berki

unread,
Mar 7, 2018, 9:43:55 AM3/7/18
to Dmitry Lomov, John Cater, Klaus Aehlig, Doug Greiman, John Field, Carmi Grushko, Danna Kelmer, bazel-si...@googlegroups.com
On Wed, 7 Mar 2018 at 15:32, Dmitry Lomov <dsl...@google.com> wrote:
I like the direction of this.
I am not quite sure how the "configuration" step would work? What are the inputs to that step?
Is toolchain resolution mechanism easily extensible to allow this?
It's a "configure-like WORKSPACE rule". It runs locally and has access to the local system and uses whatever pip is installed. When we figure out how to run the "configuration" part for systems other than the one Bazel is running on, we can give this a second look. The great thing about it is that we can upgrade transparently: all the remote pip invocations would do is to add new toolchains that could fulflll the "py_platform_dependent_library(name="lib")" part if the toolchain resolution wants, so no changes to BUILD files would be needed.
Reply all
Reply to author
Forward
0 new messages