Hi all,Goldie recently reached out to one person within the EasyBuild community regarding problems building TensorFlow from source. I wanted to come back offering an information exchange as I spent a (very) considerable amount of time during the last couple of month getting the TensorFlow 2.x releases to build on our HPC system.To give you an idea on how this works (on our HPC cluster at TU Dresden) and the constraints this has here a quick overview:
- No docker allowed (for security)
- All software is installed as modules loaded via LMod (Think: each software is in a separate folder and "loading a module" means setting environment variables like PATH so it is found/used)
- Some stuff is installed on the system (e.g. CUDA drivers) and can often not be changed (easily)
- Within toolchain generations (i.e. a combination of compilers, OpenMPI, BLAS, ...) only a single version of a software is installed, similar to system packages (I've seen contributions by Gentoo, so some know what I mean)
- EasyBuild is used to build the software and install the modules. This is basically a framework for downloading, patching, configuring, installing and testing software using recipes. Similar approaches are used in Spack, HPCCM or even in Gentoo.
Challenges we face with (especially) TF:
- For reproducible builds all downloading should happen before configuring and downloads should be checksummed
- Narrow version requirements (e.g. currently "scipy==1.4.1") cause major headaches as some other software might already use a different version and upgrading (or downgrading) is almost impossible as that means reinstalling the whole affected stack over HPC centers worldwide
- not using already installed software increases compilation time and introduces bugs, ODR violations, ... Example: TF builds curl but side-steps its configuration step resulting in the binaries not finding the system certificates
- Bazel as the buildsystem itself is "unusual". Not many are familiar with it and making it pick up installed software is challenging because it by default resets the environment. See e.g. https://github.com/tensorflow/tensorflow/issues/37861
- Compilers and binutils are installed into custom locations (i.e. not "/usr") which requires patching of TF sources (bzl files) to not report unexpected includes or use the correct linker (e.g. there was a hardcoded "/usr/bin")
- Additionally software (e.g. compilers) might be accessed through symlinks (there are good reasons for that) which make Bazel break because it does not always resolve symlinks and hence report unexpected includes
I'm interested in improving the support for such environments (as mentioned there are big similarities in e.g. system packages and HPC modules) and already contributed various issues and PRs. But as we are lacking expertise in Bazel it is hard coming up with fixes in TF or workarounds for short-comings of Bazel. So it would be great to have someone from the TF and/or Bazel team who is intrested in supporting our use case. I hope the above lists help in gettig an overview of what that involves.
Thanks so far,
Alexander Grund
--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.
> I'm wondering which bzl file you're talking about here - for cuda_configure.bzl you can set the host compiler prefix to something else.
Yes in recent versions there is GCC_HOST_COMPILER_PATH and GCC_HOST_COMPILER_PREFIX (the name of the latter is misleading, it is binutils prefix) but that hasn't always been the case.
But we also have code to add/replace "cxx_builtin_include_directory" entries and check for and replace other "-B/usr/bin/" occurences as well as paths like "/usr/bin/ar" and the like (there still are some according to our logs, but we stoppt checking if this is still required)
Re "root causes"
- Sure. But it also causes problems with e.g. intel compilers that require an environment variable with a license file to be set. We used to create wrapper scripts which do that internally and then forward to the real compiler
- Yes that seems to become more popular these days. But as mentioned for installing those on software stacks where mixing different versions is not possible it makes the admins life hard. I think it is clear to see that if every software says "I need the version that is latest at this point in time and don't work with any other version" then installing 2 independent (or, even worse, transitively dependent) libraries/applications will become impossible forcing you to use things like docker that allow running duplicate software stacks next to each other.
- I understand that. Just wanted to include this. And e.g. for scipy there is an open PR with a simple approach that allows to extent the possible scipy versions. And as 1.4.1 was chose to avoid a bug with 1.4.0 once could instead exclude that single version instead of requiring a very specific version
The docker-cross-build won't work unfortunately. There are dynamic dependency (shared libs) which need to be used from the cluster (e.g. MPI) and it is safer to have a software stack where the versions of all dependencies are known and fixed to avoid ODR violations due to e.g. MPI using hwloc 1.11 and TensorFlow using hwloc 2.1.
2nd reason this won't work is that we have e.g Power and ARM nodes and cross-building to a different architecture is way to much work if possible at all.
On Fri, Sep 4, 2020 at 4:30 PM Alexander Grund <alexand...@tu-dresden.de> wrote:
Yes in recent versions there is GCC_HOST_COMPILER_PATH and GCC_HOST_COMPILER_PREFIX (the name of the latter is misleading, it is binutils prefix) but that hasn't always been the case.
But we also have code to add/replace "cxx_builtin_include_directory" entries and check for and replace other "-B/usr/bin/" occurences as well as paths like "/usr/bin/ar" and the like (there still are some according to our logs, but we stoppt checking if this is still required)
The solution to cxx_builtin_include_directory is to run the detection of those with the right flags (cuda_configure.bzl, line 277). That will be interesting to design.The longer term solution is that we're (slowly :() working towards a unification of the cc_configure and GPU toolchains we have. Once that is done, next steps will be to make a lot of this behave much more like you'd expect it and give users the right sets of knobs.
- Yes that seems to become more popular these days. But as mentioned for installing those on software stacks where mixing different versions is not possible it makes the admins life hard. I think it is clear to see that if every software says "I need the version that is latest at this point in time and don't work with any other version" then installing 2 independent (or, even worse, transitively dependent) libraries/applications will become impossible forcing you to use things like docker that allow running duplicate software stacks next to each other.
Yep. The problem for a company is that the price of not doing that has a very large opportunity cost, so we're trying to figure out ways to make this work for all users.
Does the TF build check for the specific version? I think we do work with later versions (in fact, I've updated the install script for our docker removing lots of the "single version" restrictions, and everything builds / tests fine O.O)
Re: libraries: for cross-compilation you'd want to use a sysroot; you can put the right library versions into that sysroot.Re: archs: that is one of the things that bazel shines at - it is a cross-compiler at heart, so cross-compilation setups are relatively straight-forward
For example, our release is a cross-compile setup that can build on a recent ubuntu version and cross-compiles down to manylinux10 (basically centos6-library-versions). We do this by downloading the ancient versions from ubuntu-archive into a sysroot in a docker, and pass the sysroot to the TF build. Given that bazel support different host/target configurations out of the box, you also generally don't have problems with generators.