Story for hermetic Python builds?

1,577 views
Skip to first unread message

Andrew Chronister

unread,
Jul 1, 2019, 8:03:31 PM7/1/19
to Bazel/Python Special Interest Group
Python build hermeticity seems to break down when using a system Python interpreter. In particular, even though Bazel orders your dependencies first in the PYTHON_PATH, that can fail to be a sufficient guarantee of hermeticity in the face of e.g. namespace packages, which as far as I can tell can declare packages that apply before Bazel's changes even come into play. If one wants to manage all dependencies with Bazel, including Python dependencies, and never allow importing environment packages, allowing dist-specific import directories is insufficient.

As a concrete example, in my team's codebase, we have added several of the "google.cloud" packages (such as google-cloud-storage) to our WORKSPACE exposing py_library targets for use in other scripts. However, if any of us have installed one of these packages in our system Python through pip, such that it declared a namespace package for 'google' and/or 'google.cloud', we get errors:

    from google.cloud import pubsub_v1
ImportError: cannot import name 'pubsub_v1'

It's unclear where the 'google.cloud' module was declared (running Python with -vvv shows the 'google' and 'google.cloud' namespaces being added before the first line of our script is run), but printing out "google.cloud.__path__" clearly shows it found a dist-packages path:

_NamespacePath(['/usr/local/lib/python3.5/dist-packages/google/cloud'])

The py_runtime docs do say that "a platform runtime is by its nature non-hermetic". But this seems like a different problem than the one described there (that a particular system path must exist). And there is at least one solution (B below) I can think of that makes things more hermetic while still using the system interpreter path.

Here are some options I've thought of for addressing this for my team:

A: Use a virtualenv instead of a direct system interpreter path. However, this just dodges the problem, since non-hermetic packages can still be installed into the virtualenv undetected by Bazel. Also, it burdens all of our developers with an additional setup task when checking out our code, and a number of hacks to get our resulting scripts to work properly in other environments (for example in a Docker container). It might be the most realistic solution at the present time, though.

B: Call Python with the "-S" flag in the stub script. This prevents it from running the code in the system "site.py" that initializes the python path with distribution-specific paths such as /usr/local/lib/python3.5/dist-packages. I tested this in my team's codebase using a wrapper script passed as the py_runtime "interpreter" argument and it fixes our problems outlined above. It would also prohibit any new python targets from depending on packages that we haven't explicitly added to Bazel, which could give non-hermetic output. However, it's difficult to integrate without upstream changes (the wrapper script hack I mentioned gives us trouble in docker and PAR files). Simply adding the -S flag to all python binary invocations probably isn't suitable for everyone as a default, since it greatly increases the barrier to using Python in Bazel. But if it could be an opt-in setting, it might work.

C: Build the Python interpreter itself with Bazel, rather than rely on a system Python interpreter. However, this is very heavy-handed and requires us to maintain a parallel build system for the whole Python codebase, which is a pretty big burden. A slightly weaker version of this would be to prepackage a set of Python binaries and system libraries and include them as a repository in our WORKSPACE, but then we have to do that for each new build host we decide to support. It also doesn't necessarily fix it if we intend to run the resulting code on another machine.

In a build system like Bazel, where hermeticity is a strong goal, this seems like a glaring omission. Am I missing a Bazel feature that fixes this problem? If not, is there a story for getting to a better place with this? And is there an existing issue I can track for my team's particular issue? (if not, should I create one?)

Thanks,
 -- Andrew

Philipp Schrader

unread,
Jul 1, 2019, 8:46:59 PM7/1/19
to Andrew Chronister, Bazel/Python Special Interest Group
Hi Andrew,

I don't have a full answer, but I can tell you how we overcame the majority of the hermeticity problems.
Note that we're not currently using the latest bazel version. We're currently at 0.23 so I apologize if something doesn't apply.

The basic idea is that we have a tarball that contains a Python3.7 installation set as the "files" for the py_runtime rule.
Apologies for the wall of text, but I hope it gets the idea across. Now that I think about it, might have been better as a Github gist. Oh well.

Essentially, it let's us write sandboxed Python that ignores whatever the host has installed. This includes pip also.
The downside is that importing pip packages into the bazel sandbox is kind of a manual/tedious process.
rules_python works, but as far as I can tell has trouble (i.e. doesn't work) with Python3. Last time I tried to get it to work was a year ago.

- Phil

//:.bazelrc
build --python_top=//:python3

//:WORKSPACE:
http_archive(
    name = "python",
    build_file = "//:python.BUILD",
    sha256 = "1fccf2d002e853f8f5995777c3e2ec84c6d45d72c40e8147ff420a2b21b43ae3",
    url = "http://build-mirror:8000/python-amd64_3.7.2-2_v5.tar.gz",
)

//:python.BUILD:
filegroup(
    name = "python3",
    srcs = glob([
        "usr/lib/python3.7/**/*.py",
        "usr/lib/**/*.so",
        "usr/lib/**/*.so.*",
        "lib/**/*.so",
        "lib/**/*.so.*",
        "usr/bin/xz*",
    ]) + [
        "usr/bin/python3",
        "usr/bin/python3.7",
        "usr/lib/python3.7/lib2to3/Grammar.txt",
        "usr/lib/python3.7/lib2to3/PatternGrammar.txt",
    ],
    visibility = ["//visibility:public"],
)

//:BUILD
py_runtime(
    name = "python3",
    files = [
        "//tools:python3_binary",
        "@python//:python3",
    ],
    interpreter = "//tools:python3_binary",
    visibility = ["//visibility:public"],
)

//tools:BUILD
filegroup(
    name = "python3_binary",
    srcs = ["python3_binary.sh"],
)

//tools:python3_binary.sh
#!/bin/bash
set -e
set -u
set -o pipefail
export PYTHONDONTWRITEBYTECODE=1
BASE_PATH=""
for path in ${PYTHONPATH//:/ }; do
  if [[ "$path" == *.runfiles/python ]]; then
    BASE_PATH="$path"
    export LD_LIBRARY_PATH="$path"/lib/x86_64-linux-gnu:"$path"/usr/lib:"$path"/usr/lib/x86_64-linux-gnu${LD_LIBRARY_PATH+:${LD_LIBRARY_PATH}}
    break
done
if [[ -z "$BASE_PATH" ]]; then
  echo "Could not find Python base path." >&2
  exit 1
fi
# There are a few utilities (e.g. xz) for which a sandboxed Python application
# shouldn't use the host's versions. Instead, the application should use the
# bundled version.
export PATH="${BASE_PATH}/usr/bin${PATH+:${PATH}}"
# Python really likes to escape the sandbox by periodically dereferencing
# symlinks when importing modules. This breaks some of the RPATH entries that
# bazel adds to its cc_binary targets. To work around this, we make sure that
# all the shared libraries are accessible all the time.
# Here we find the runfiles directory (parent folder of Python's base path) and
# then look for all the solib folders in the child folders. One of the child
# folders could be "com_peloton_tech" or "com_google_protobuf".
shopt -s nullglob
for solib_folder in "${BASE_PATH%/*}"/*/_solib_*; do
  for subfolder in "$solib_folder"/*/; do
    export LD_LIBRARY_PATH="$(readlink -f "$subfolder"):${LD_LIBRARY_PATH}"
  done
done
shopt -u nullglob
# Prevent adding the user's site-packages to the python path. For example, with
# the -s option you should no longer see
# /home/foo/.local/lib/python3.4/site-packages being added to the path.
# Also force everyone to get tk from @python3_tk_repo//
export TCL_LIBRARY="${BASE_PATH}/../python3_tk_repo/usr/share/tcltk/tcl8.6"
export TK_LIBRARY="${BASE_PATH}/../python3_tk_repo/usr/share/tcltk/tk8.6"
exec "$BASE_PATH"/usr/bin/python3 -s "$@"

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-sig-pyth...@googlegroups.com.
To post to this group, send email to bazel-si...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-sig-python/d66684bd-8cff-4360-b28c-73802550d26e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kornelijus Survila

unread,
Jul 2, 2019, 6:18:17 PM7/2/19
to Bazel/Python Special Interest Group, Andrew Chronister
I agree that this is a huge problem, especially as you start migrating to Python 3 and minor versions become a headache.

The only viable solution in my mind is option C, and it's what we do for other languages (C++, Go, Rust). It’s easier to build the interpreters for the platforms you need, I don’t see why that is any weaker in practice. rules_go and rules_rust already work this way as those languages distribute compiled versions. I assume rules_python also needs toolchain support if you use that.

Compiling the Python interpreter itself through Bazel is tricky—you’ll have to patch the code to look for the system libraries in the right place, as PYTHONHOME/PYTHONPATH detection no longer works correctly with runfiles.

If anyone else has gone down this path, it’d be nice to hear any experiences.

Kornelijus

gytis.ra...@gmail.com

unread,
Jul 3, 2019, 6:27:41 AM7/3/19
to Kornelijus Survila, Bazel/Python Special Interest Group, Andrew Chronister
There's an interesting take from Mozilla on packaging hermetic builds at https://pyoxidizer.readthedocs.io/en/latest/comparisons.html, the base of it the pre-built statically linked python distributions from https://github.com/indygreg/python-build-standalone.
Obviously in very early stage, but direction is there.

Gytis


For more options, visit https://groups.google.com/d/optout.


--

Jon Brandvein

unread,
Jul 8, 2019, 3:12:58 PM7/8/19
to Bazel/Python Special Interest Group
I think there's two distinct problems discussed in this thread:

    1) building a deployable artifact with minimal reliance on a target platform's system interpreter

    2) providing better guarantees about the imports that are available to a Python target

For the first problem, one limitation we currently face is that we depend on the system interpreter to run the stub script, which does the bootstrapping to launch the user's Python code (manipulating PYTHONPATH, locating the second-stage Python interpreter, and optionally extracting the runfiles zip). We design the stub script so that it's happy with either Python 2 or 3, so in theory the stub doesn't care too much about the system interpreter so long as it actually exists. Eliminating this dependency is #8446.

But this thread is more about imports, and how they're inherited from the second-stage interpreter, which is usually also the system interpreter. Disabling user site packages is #4939, and resolving the priority between system modules and user modules is #5899.

Regarding Andrew's options:

    A. Controlling the system environment outside of Bazel, e.g. with docker or virtualenv: This has some clear advantages given the status quo, but ideally we shouldn't need to do this in the common case. Relying on separate workspace rules for package installation (e.g. rules_python), or possibly vendoring in a specific interpreter, should be preferrable to maintaining a custom Python installation on the target platform. Note that for the specific case of virtualenv, Bazel is sensitive to what env was activated at the time you started the server (#4080).

    B. Adding -S (or -s?) to the Python command line: I think this is feasible as an opt-in, and could be done as a boolean attribute on the toolchain, or more generally a list of args. This doesn't solve all import-related problems, but it helps.

    C. Vendoring in the Python interpreter: Though this may be heavy-handed, it surely gives better hermeticity. Note that you don't have to compile it from source if you're willing to check in binaries for your target platforms. #4286 tracks usability improvements for the case of a compiled-from-source interpreter.

But aside from these solutions and from issues #5899 and #4939 mentioned above, we have other problems with imports:

    - Should relative imports be allowed? (#808)

    - Should each repository form its own module search path (#7067 and others)? Spoiler alert: No, but we can't disable this just yet because the alternative, qualifying imports with the repo name, is brittle in the face of repo renaming.

    - How can we avoid Python's insistence on adding undeclared dependencies to the module search path (#7091)? Note that manipulating PYTHONPATH isn't enough since this is unconditionally done upon interpreter startup.

My thinking is that most or all of these import issues could be addressed by the combination of 1) being stricter about site packages / site.py (whether or not that's the default behavior) and 2) injecting some Bazel-generated __init__.py boilerplate that customizes how module loading is done in the user's Python process. The latter is particularly exciting because it would give us a means of controlling import paths without having to create a specific module layout on the filesystem (which is a problem on Windows where you can't rely on symlinks).
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-si...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Bazel/Python Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-si...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages