The other day we discussed collisions between libraries defined as part of the Bazel build and libraries that come from the system or environment. Now I want to discuss a related but distinct kind of collision, between two or more libraries that are both defined as part of the Bazel build. This kind of collision is witnessed in issues #6886 and #7051. This mail is the starting point for what will at some point become a design doc.
The behavior of PYTHONPATH in Bazel is fairly confusing: Both the runfiles root, and its immediate children (the roots for each repository), are placed on PYTHONPATH. So if you have a source file @somerepo//pkg/subpkg:foo.py in your runfiles, you could import it as either `import somerepo.pkg.subpkg.foo` or `import pkg.subpkg.foo`. This means that for a given Python target, all of its transitively depended-on repos, plus all of those repos’ top-level packages that are in runfiles, all compete in the same namespace of top-level importable Python packages/modules. Hence the problem in the above issues, where common names of top-level packages like “third_party”, “tools”, “util”, etc. clash between multiple repos.
One idea is to not place repo directories in PYTHONPATH at all, and force all imports to be fully qualified with the repo name. This style is justified by the fact that names like “third_party” are common in monorepos but clearly not intended for external consumption. Indeed, Google internally uses the monorepo name “google3” as a prefix for all Bazel-defined Python packages.
Looking at the code history, it looks like the repo-on-the-pythonpath behavior was originally added in f1ac099 to get a use case working ad hoc. Then it was planned to be turned down via --[no]experimental_python_import_all_repositories, added in ed77952. But that never materialized, and I don’t think anyone’s using that flag today. In the meantime, nothing but naming conflicts has stopped people from importing using either style.
There are some cases where you do need the contents of a repo to be importable at the top-level. An example is if you have an existing Python project, say numpy, that is not natively built with Bazel. You import it as a Bazel repo “@numpy” using a repository rule, but you wouldn’t want to reference it in your own project as `import numpy.numpy` since that’s pretty non-standard. Nor would you want the repo rule to rewrite the source tree structure to lift the package contents to the root of the repo, just for the sake of working around Bazel’s behavior. We should be able to support this use case on an opt-in basis by having the appropriate py_library add the path of `@numpy//` to its `imports` attribute. That way packages that want to export their top-level packages as top-level Python-importables can do so, but it’s not the default.
The other problem is what to do about repository renaming. Repo renaming is a Bazel feature for splitting diamond dependencies: When I import your repo, I can can declare that all your references to repo `@foo` should actually be interpreted as references to repo `@bar`. This renaming affects all of Bazel’s BUILD/bzl processing, but it doesn’t rewrite your source files. So if the runfiles tree would’ve contained a `foo/` directory, it’ll instead contain `bar/`, but your source file will still try to import `foo.[...]`, which would fail at runtime.
We believe that repo renaming should only be in the business of rewriting Bazel packages and labels, not mangling imports in source files. If a user really needs to rename a python library to import two different versions simultaneously, they can do so but they’re on their own (e.g. they may need to create a dummy forwarding package). This suggests that a py_library target should embed knowledge of its canonical import path regardless of any repo renamings. Then when we construct the runfiles tree, instead of putting the runfiles root on PYTHONPATH, we can make a special directory somewhere underneath it that has the precise symlinks from the canonical names to their actual names.
There may be some useful precedent in the design of the cpp or go rules. Both of those languages, like Python and unlike Java, have flat namespaces rather than unambiguous fully-qualified package names. We'll look into this around the time a design doc is written.