Building TensorFlow from source in EasyBuild on HPC systems

168 views
Skip to first unread message

Alexander Grund

unread,
Sep 4, 2020, 8:33:33 AM9/4/20
to SIG Build

Hi all,
Goldie recently reached out to one person within the EasyBuild community regarding problems building TensorFlow from source. I wanted to come back offering an information exchange as I spent a (very) considerable amount of time during the last couple of month getting the TensorFlow 2.x releases to build on our HPC system.

To give you an idea on how this works (on our HPC cluster at TU Dresden) and the constraints this has here a quick overview:
  • No docker allowed (for security)
  • All software is installed as modules loaded via LMod (Think: each software is in a separate folder and "loading a module" means setting environment variables like PATH so it is found/used)
  • Some stuff is installed on the system (e.g. CUDA drivers) and can often not be changed (easily)
  • Within toolchain generations (i.e. a combination of compilers, OpenMPI, BLAS, ...) only a single version of a software is installed, similar to system packages (I've seen contributions by Gentoo, so some know what I mean)
  • EasyBuild is used to build the software and install the modules. This is basically a framework for downloading, patching, configuring, installing and testing software using recipes. Similar approaches are used in Spack, HPCCM or even in Gentoo.
Challenges we face with (especially) TF:
  • For reproducible builds all downloading should happen before configuring and downloads should be checksummed
  • Narrow version requirements (e.g. currently "scipy==1.4.1") cause major headaches as some other software might already use a different version and upgrading (or downgrading) is almost impossible as that means reinstalling the whole affected stack over HPC centers worldwide
  • not using already installed software increases compilation time and introduces bugs, ODR violations, ... Example: TF builds curl but side-steps its configuration step resulting in the binaries not finding the system certificates
  • Bazel as the buildsystem itself is "unusual". Not many are familiar with it and making it pick up installed software is challenging because it by default resets the environment. See e.g. https://github.com/tensorflow/tensorflow/issues/37861
  • Compilers and binutils are installed into custom locations (i.e. not "/usr") which requires patching of TF sources (bzl files) to not report unexpected includes or use the correct linker (e.g. there was a hardcoded "/usr/bin")
  • Additionally software (e.g. compilers) might be accessed through symlinks (there are good reasons for that) which make Bazel break because it does not always resolve symlinks and hence report unexpected includes
I'm interested in improving the support for such environments (as mentioned there are big similarities in e.g. system packages and HPC modules) and already contributed various issues and PRs. But as we are lacking expertise in Bazel it is hard coming up with fixes in TF or workarounds for short-comings of Bazel. So it would be great to have someone from the TF and/or Bazel team who is intrested in supporting our use case. I hope the above lists help in gettig an overview of what that involves.

Thanks so far,
Alexander Grund

Manuel Klimek

unread,
Sep 4, 2020, 9:58:32 AM9/4/20
to Alexander Grund, SIG Build
On Fri, Sep 4, 2020 at 2:33 PM Alexander Grund <alexand...@tu-dresden.de> wrote:

Hi all,
Goldie recently reached out to one person within the EasyBuild community regarding problems building TensorFlow from source. I wanted to come back offering an information exchange as I spent a (very) considerable amount of time during the last couple of month getting the TensorFlow 2.x releases to build on our HPC system.

To give you an idea on how this works (on our HPC cluster at TU Dresden) and the constraints this has here a quick overview:
  • No docker allowed (for security)
  • All software is installed as modules loaded via LMod (Think: each software is in a separate folder and "loading a module" means setting environment variables like PATH so it is found/used)
  • Some stuff is installed on the system (e.g. CUDA drivers) and can often not be changed (easily)
  • Within toolchain generations (i.e. a combination of compilers, OpenMPI, BLAS, ...) only a single version of a software is installed, similar to system packages (I've seen contributions by Gentoo, so some know what I mean)
  • EasyBuild is used to build the software and install the modules. This is basically a framework for downloading, patching, configuring, installing and testing software using recipes. Similar approaches are used in Spack, HPCCM or even in Gentoo.
Challenges we face with (especially) TF:
  • For reproducible builds all downloading should happen before configuring and downloads should be checksummed
  • Narrow version requirements (e.g. currently "scipy==1.4.1") cause major headaches as some other software might already use a different version and upgrading (or downgrading) is almost impossible as that means reinstalling the whole affected stack over HPC centers worldwide
  • not using already installed software increases compilation time and introduces bugs, ODR violations, ... Example: TF builds curl but side-steps its configuration step resulting in the binaries not finding the system certificates
  • Bazel as the buildsystem itself is "unusual". Not many are familiar with it and making it pick up installed software is challenging because it by default resets the environment. See e.g. https://github.com/tensorflow/tensorflow/issues/37861
  • Compilers and binutils are installed into custom locations (i.e. not "/usr") which requires patching of TF sources (bzl files) to not report unexpected includes or use the correct linker (e.g. there was a hardcoded "/usr/bin")
I'm wondering which bzl file you're talking about here - for cuda_configure.bzl you can set the host compiler prefix to something else.

  • Additionally software (e.g. compilers) might be accessed through symlinks (there are good reasons for that) which make Bazel break because it does not always resolve symlinks and hence report unexpected includes
I'm interested in improving the support for such environments (as mentioned there are big similarities in e.g. system packages and HPC modules) and already contributed various issues and PRs. But as we are lacking expertise in Bazel it is hard coming up with fixes in TF or workarounds for short-comings of Bazel. So it would be great to have someone from the TF and/or Bazel team who is intrested in supporting our use case. I hope the above lists help in gettig an overview of what that involves.

Generally, I feel with you - all of this comes from a couple of root causes:
1. bazel is not yet really good at supporting "bring your own libraries & compilers in random locations", as it was fundamentally designed as a monorepo build system, and is slowly getting better at a broader set of use cases
2. TF is following a strong "live-at-head" development model, which gives fast iteration times to developers, at the cost of making it harder to support older libraries
3. Compatibility between python packages can be tricky to track, so while a lot of versions "might work", testing all different combinations is really hard

Given your docker requirement, one question is whether you can cross-compile from docker from a local linux box instead of trying to get it to run from the HPC system? (but I don't know what I'm talking about, so if that makes no sense, let me know)
 

Thanks so far,
Alexander Grund

--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.

Alexander Grund

unread,
Sep 4, 2020, 10:30:50 AM9/4/20
to SIG Build, Manuel Klimek, SIG Build, Alexander Grund
> I'm wondering which bzl file you're talking about here - for cuda_configure.bzl you can set the host compiler prefix to something else.

Yes in recent versions there is GCC_HOST_COMPILER_PATH and GCC_HOST_COMPILER_PREFIX (the name of the latter is misleading, it is binutils prefix) but that hasn't always been the case.
But we also have code to add/replace "cxx_builtin_include_directory" entries and check for and replace other "-B/usr/bin/" occurences as well as paths like "/usr/bin/ar" and the like (there still are some according to our logs, but we stoppt checking if this is still required)

Re "root causes"
  1. Sure. But it also causes problems with e.g. intel compilers that require an environment variable with a license file to be set. We used to create wrapper scripts which do that internally and then forward to the real compiler
  2. Yes that seems to become more popular these days. But as mentioned for installing those on software stacks where mixing different versions is not possible it makes the admins life hard. I think it is clear to see that if every software says "I need the version that is latest at this point in time and don't work with any other version" then installing 2 independent (or, even worse, transitively dependent) libraries/applications will become impossible forcing you to use things like docker that allow running duplicate software stacks next to each other.
  3. I understand that. Just wanted to include this. And e.g. for scipy there is an open PR with a simple approach that allows to extent the possible scipy versions. And as 1.4.1 was chose to avoid a bug with 1.4.0 once could instead exclude that single version instead of requiring a very specific version
The docker-cross-build won't work unfortunately. There are dynamic dependency (shared libs) which need to be used from the cluster (e.g. MPI) and it is safer to have a software stack where the versions of all dependencies are known and fixed to avoid ODR violations due to e.g. MPI using hwloc 1.11 and TensorFlow using hwloc 2.1.
2nd reason this won't work is that we have e.g Power and ARM nodes and cross-building to a different architecture is way to much work if possible at all.

Manuel Klimek

unread,
Sep 4, 2020, 10:54:56 AM9/4/20
to Alexander Grund, SIG Build
On Fri, Sep 4, 2020 at 4:30 PM Alexander Grund <alexand...@tu-dresden.de> wrote:
> I'm wondering which bzl file you're talking about here - for cuda_configure.bzl you can set the host compiler prefix to something else.

Yes in recent versions there is GCC_HOST_COMPILER_PATH and GCC_HOST_COMPILER_PREFIX (the name of the latter is misleading, it is binutils prefix) but that hasn't always been the case.
But we also have code to add/replace "cxx_builtin_include_directory" entries and check for and replace other "-B/usr/bin/" occurences as well as paths like "/usr/bin/ar" and the like (there still are some according to our logs, but we stoppt checking if this is still required)

The solution to cxx_builtin_include_directory is to run the detection of those with the right flags (cuda_configure.bzl, line 277). That will be interesting to design.
The longer term solution is that we're (slowly :() working towards a unification of the cc_configure and GPU toolchains we have. Once that is done, next steps will be to make a lot of this behave much more like you'd expect it and give users the right sets of knobs.
 

Re "root causes"
  1. Sure. But it also causes problems with e.g. intel compilers that require an environment variable with a license file to be set. We used to create wrapper scripts which do that internally and then forward to the real compiler
 Agreed.
  1. Yes that seems to become more popular these days. But as mentioned for installing those on software stacks where mixing different versions is not possible it makes the admins life hard. I think it is clear to see that if every software says "I need the version that is latest at this point in time and don't work with any other version" then installing 2 independent (or, even worse, transitively dependent) libraries/applications will become impossible forcing you to use things like docker that allow running duplicate software stacks next to each other.
Yep. The problem for a company is that the price of not doing that has a very large opportunity cost,  so we're trying to figure out ways to make this work for all users.
  1. I understand that. Just wanted to include this. And e.g. for scipy there is an open PR with a simple approach that allows to extent the possible scipy versions. And as 1.4.1 was chose to avoid a bug with 1.4.0 once could instead exclude that single version instead of requiring a very specific version
Does the TF build check for the specific version? I think we do work with later versions (in fact, I've updated the install script for our docker removing lots of the "single version" restrictions, and everything builds / tests fine O.O)
 
The docker-cross-build won't work unfortunately. There are dynamic dependency (shared libs) which need to be used from the cluster (e.g. MPI) and it is safer to have a software stack where the versions of all dependencies are known and fixed to avoid ODR violations due to e.g. MPI using hwloc 1.11 and TensorFlow using hwloc 2.1.
2nd reason this won't work is that we have e.g Power and ARM nodes and cross-building to a different architecture is way to much work if possible at all.

Re: libraries: for cross-compilation you'd want to use a sysroot; you can put the right library versions into that sysroot.
Re: archs: that is one of the things that bazel shines at - it is a cross-compiler at heart, so cross-compilation setups are relatively straight-forward

For example, our release is a cross-compile setup that can build on a recent ubuntu version and cross-compiles down to manylinux10 (basically centos6-library-versions). We do this by downloading the ancient versions from ubuntu-archive into a sysroot in a docker, and pass the sysroot to the TF build. Given that bazel support different host/target configurations out of the box, you also generally don't have problems with generators.

Alexander Grund

unread,
Sep 4, 2020, 11:48:13 AM9/4/20
to Manuel Klimek, SIG Build


Am 04.09.20 um 16:54 schrieb Manuel Klimek:
On Fri, Sep 4, 2020 at 4:30 PM Alexander Grund <alexand...@tu-dresden.de> wrote:
Yes in recent versions there is GCC_HOST_COMPILER_PATH and GCC_HOST_COMPILER_PREFIX (the name of the latter is misleading, it is binutils prefix) but that hasn't always been the case.
But we also have code to add/replace "cxx_builtin_include_directory" entries and check for and replace other "-B/usr/bin/" occurences as well as paths like "/usr/bin/ar" and the like (there still are some according to our logs, but we stoppt checking if this is still required)

The solution to cxx_builtin_include_directory is to run the detection of those with the right flags (cuda_configure.bzl, line 277). That will be interesting to design.
The longer term solution is that we're (slowly :() working towards a unification of the cc_configure and GPU toolchains we have. Once that is done, next steps will be to make a lot of this behave much more like you'd expect it and give users the right sets of knobs.
Yes I have seen that and it is possible that our patching is no longer required especially due to a patch for that detection that resolves symlinks: https://github.com/easybuilders/easybuild-easyconfigs/blob/63dfa417e2cd73c145e929894af55572bf0ce29f/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.1.0_fix-cuda-build.patch

  1. Yes that seems to become more popular these days. But as mentioned for installing those on software stacks where mixing different versions is not possible it makes the admins life hard. I think it is clear to see that if every software says "I need the version that is latest at this point in time and don't work with any other version" then installing 2 independent (or, even worse, transitively dependent) libraries/applications will become impossible forcing you to use things like docker that allow running duplicate software stacks next to each other.
Yep. The problem for a company is that the price of not doing that has a very large opportunity cost,  so we're trying to figure out ways to make this work for all users.
Usually this works fine when requiring a minimum version and staying within major (semantic) versions. Not upgrading just for the sake of it also helps. That is the problem with "life at head": People often don't consider the cost of upgrading the minimum version. E.g. 1 line more simple code or a cast doesn't offset the potential cost of requiring users to upgrade from 2.1 to 2.9 (just an example of course, I hope you get the gist)

Does the TF build check for the specific version? I think we do work with later versions (in fact, I've updated the install script for our docker removing lots of the "single version" restrictions, and everything builds / tests fine O.O)
It did until now. Funny thing: The whole thing was unused except for tests and now removed: https://github.com/tensorflow/tensorflow/commit/78026d6a66f7f0fc80c69b1a2f8843616f4cd2a7

Re: libraries: for cross-compilation you'd want to use a sysroot; you can put the right library versions into that sysroot.
Re: archs: that is one of the things that bazel shines at - it is a cross-compiler at heart, so cross-compilation setups are relatively straight-forward

For example, our release is a cross-compile setup that can build on a recent ubuntu version and cross-compiles down to manylinux10 (basically centos6-library-versions). We do this by downloading the ancient versions from ubuntu-archive into a sysroot in a docker, and pass the sysroot to the TF build. Given that bazel support different host/target configurations out of the box, you also generally don't have problems with generators.
Good to know, thanks! However for us the work to set up a sysroot with the used libs on the different nodes and integrating a docker based build into the workflow is likely too much work when we have an integrated framework that works for everything else. So I'd rather like to get the remaining issues fixed so TF can be build like everything else :)
Reply all
Reply to author
Forward
0 new messages