TF's 500Mb wheels dedicate 300Mb to 6 CUDA compute capabilities

Austin Anderson

unread,

Mar 30, 2020, 1:50:31 PM3/30/20

to SIG Build, tensorflow-devinfra-team, Martin Wicke, Günhan Gülsoy, John Kline

Hi all,

In March's SIG Build meeting, we discussed the recurring growth of TensorFlow's GPU pip wheels, which had recently grown to over 500Mb. The growth and need for storage causes friction between TensorFlow and PyPI as TF pip packages consume more of their storage space. Amit and I investigated a suggestion from someone (I neglected to record who, but thanks!) and found what appears to be new information: each of the six CUDA compute capabilities in TensorFlow's official wheels adds about 50Mb to the wheel, accounting for all 300Mb of _pywrap_tensorflow_internal.so. I estimated this by extracting and measuring CC-specific files from _pywrap_tensorflow_internal.so with cuobjdump: https://gist.github.com/angerson/668fc6062dc06a6660f5be373d1b2dd1

TensorFlow's official wheels are built with the Compute Capabilities 3.5, 3.7, 5.2, 6.0, 6.1, and 7.0. I'm not sure of the original rationale for the list; the earliest code I can find is from October 2017 when wheels were built with 3.0, 3.5, 3.7, 5.2, 6.0, and 6.1. For comparison, PyTorch's 1.5.0 nightly dev wheel is 850 Mb; measured by the same methods, their seven CCs (3.5, 3.7, 5.0, 6.0, 6.1, 7.0, 7.5) account for 615Mb. PyTorch isn't on PyPI so they don't share our size problems. In TF 1.5 (2018), each CC was ~25Mb and each slight CUDA size change since then has been multiplied by 6.

Here's more information about compute capabilities:

Definition: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability
Features: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
All Nvidia Products and supported capabilities (no release dates / usage data): https://developer.nvidia.com/cuda-gpus

My questions from this are:

How can we evaluate which CCs are valuable? IIRC, CCs are backwards-compatible, so as long as 3.5 is there GPU support will not change. I don't know how to measure the benefits of newer capabilities.
Could TF publish tf-nightly with fewer capabilities to reduce tf-nightly's space usage? This would improve our standing with PyPI but could slightly lower the usefulness of tf-nightly.
Would it be feasible and beneficial to host wheels with all capabilities outside of PyPI?

If valuable but not able to be done officially, could SIG Build do this?

Are there any other ways we can make large cuts to the wheel size?

What are your thoughts on this? I'm wary that removing a capability could hurt TensorFlow's performance, since (as far as I know) we don't have good data on the effects.

Thanks!

Austin

Günhan Gülsoy

unread,

Mar 30, 2020, 1:54:14 PM3/30/20

to Austin Anderson, Sanjoy Das, Artem Belevich, Christian Sigg, SIG Build, tensorflow-devinfra-team, Martin Wicke, John Kline

+cc Artem, Sanjoy, who are already discussing this on go/tf-pip-size

Martin Wicke

unread,

Mar 30, 2020, 2:32:18 PM3/30/20

to Günhan Gülsoy, Austin Anderson, Sanjoy Das, Artem Belevich, Christian Sigg, SIG Build, tensorflow-devinfra-team, John Kline

+Christian Sigg

Sanjoy Das

unread,

Mar 30, 2020, 3:03:29 PM3/30/20

to Martin Wicke, Günhan Gülsoy, Austin Anderson, Artem Belevich, Christian Sigg, SIG Build, tensorflow-devinfra-team, John Kline

Right now I believe we include both PTX *and* SASS for all the listed compute capabilities, but we technically don't need to include PTX for all CC's since PTX is forward compatible. +Christian Sigg is helping us figure out if we can be smarter and only include, say, PTX for 3.5 and 7.0. Including PTX for 3.5 should be sufficient for completeness, but including PTX for 7.0 ensures that we are fast on the newest GPUs we don't have SASS for.

Als Art has separately suggested that we could drop 5.2 if we really wanted to but we should keep SASS for the other CCs and possibly add 7.5.

-- Sanjoy

Sanjoy Das

unread,

Mar 31, 2020, 1:25:51 PM3/31/20

to Christian Sigg, Artem Belevich, Martin Wicke, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

On Tue, Mar 31, 2020 at 6:09 AM Christian Sigg <cs...@google.com> wrote:

On Mon, Mar 30, 2020 at 9:11 PM Artem Belevich <t...@google.com> wrote:

On Mon, Mar 30, 2020 at 12:03 PM Sanjoy Das <san...@google.com> wrote:

On Mon, Mar 30, 2020 at 11:32 AM Martin Wicke <wi...@google.com> wrote:
+Christian Sigg

On Mon, Mar 30, 2020 at 10:54 AM Günhan Gülsoy <gu...@google.com> wrote:
+cc Artem, Sanjoy, who are already discussing this on go/tf-pip-size

On Mon, Mar 30, 2020 at 10:50 AM Austin Anderson <ange...@google.com> wrote:
Hi all,

In March's SIG Build meeting, we discussed the recurring growth of TensorFlow's GPU pip wheels, which had recently grown to over 500Mb. The growth and need for storage causes friction between TensorFlow and PyPI as TF pip packages consume more of their storage space. Amit and I investigated a suggestion from someone (I neglected to record who, but thanks!) and found what appears to be new information: each of the six CUDA compute capabilities in TensorFlow's official wheels adds about 50Mb to the wheel, accounting for all 300Mb of _pywrap_tensorflow_internal.so. I estimated this by extracting and measuring CC-specific files from _pywrap_tensorflow_internal.so with cuobjdump: https://gist.github.com/angerson/668fc6062dc06a6660f5be373d1b2dd1

TensorFlow's official wheels are built with the Compute Capabilities 3.5, 3.7, 5.2, 6.0, 6.1, and 7.0. I'm not sure of the original rationale for the list; the earliest code I can find is from October 2017 when wheels were built with 3.0, 3.5, 3.7, 5.2, 6.0, and 6.1. For comparison, PyTorch's 1.5.0 nightly dev wheel is 850 Mb; measured by the same methods, their seven CCs (3.5, 3.7, 5.0, 6.0, 6.1, 7.0, 7.5) account for 615Mb. PyTorch isn't on PyPI so they don't share our size problems. In TF 1.5 (2018), each CC was ~25Mb and each slight CUDA size change since then has been multiplied by 6.

Here's more information about compute capabilities:
Definition: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability
Features: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
All Nvidia Products and supported capabilities (no release dates / usage data): https://developer.nvidia.com/cuda-gpus
My questions from this are:
How can we evaluate which CCs are valuable? IIRC, CCs are backwards-compatible, so as long as 3.5 is there GPU support will not change. I don't know how to measure the benefits of newer capabilities.

Here are the major architectural jumps I can think of.
sm_35 -- provides ldg/ldu instructions. Those are fairly commonly used (even if it may no longer be necessary on newer GPUs)
sm_6x -- provides fp16 support. Older GPUs will need to promote ops to to fp32. Noticeable performance bump for fp16.
Nit pick, and not relevant for the nightly PiPy builds: sm_53 (Jetson, Drive PX) introduced fp16.
sm_70 -- provides tensor cores -- major performance bump for fp16
sm_75 -- improves tensorcore performance. Will run sm_70 binaries at about 50% peak fp16.
sm_80 -- TBD

sm_35 is there to support cloud which still has a lot of them
sm_6x are the previous generation of consumer cards widely used outside of google.
sm_70 -- widely present in cloud
sm_75 -- current generation of consumer cards (AKA what many external TF users are likely to buy & use these days).

I think we'll need to keep these four around.

Starting with CUDA 11, sm_35 (up from sm_30) is the lowest supported target, and sm_52 is the lowest non-deprecated target.

Given all this, here is my recommendation:
sm_35 (assume we will move to CUDA 11 soon, this is the minimum supported version anyway)
sm_50 (otherwise we would need to ship ptx_50, the users will complain about a few minutes of single-threaded JITing on first startup)
sm_60 (for sm_6x consumer and workstation products)
sm_70 (also for sm_75, assuming all relevant tensor core code is in cuDNN/cuBLAS)

Notes:
sm_4x does not exist.
sm_xy is the SASS (binary) target that is compatible with sm_xz (z>=y).
do not ship PTX: JITing is slow, and is only really good for forward compatibility.
no ptx_7x either: JITing is too slow, and nightly builds don't need to be long-term forward compatible.

But the releases should support newer GPUs, and I'd prefer having the tf-nightly builds be the same as the release builds (within reason).

-- Sanjoy

When we upgrade to CUDA 11, we will add sm_80.

--Artem

Could TF publish tf-nightly with fewer capabilities to reduce tf-nightly's space usage? This would improve our standing with PyPI but could slightly lower the usefulness of tf-nightly.
Would it be feasible and beneficial to host wheels with all capabilities outside of PyPI?
If valuable but not able to be done officially, could SIG Build do this?
Are there any other ways we can make large cuts to the wheel size?
Right now I believe we include both PTX *and* SASS for all the listed compute capabilities, but we technically don't need to include PTX for all CC's since PTX is forward compatible. +Christian Sigg is helping us figure out if we can be smarter and only include, say, PTX for 3.5 and 7.0. Including PTX for 3.5 should be sufficient for completeness, but including PTX for 7.0 ensures that we are fast on the newest GPUs we don't have SASS for.

Als Art has separately suggested that we could drop 5.2 if we really wanted to but we should keep SASS for the other CCs and possibly add 7.5.

-- Sanjoy

What are your thoughts on this? I'm wary that removing a capability could hurt TensorFlow's performance, since (as far as I know) we don't have good data on the effects.

Thanks!
Austin

--
--Artem Belevich

Jason Zaman

unread,

Mar 31, 2020, 5:03:24 PM3/31/20

to Sanjoy Das, Christian Sigg, Artem Belevich, Martin Wicke, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

Hey,

A few random questions about all this, can someone clarify or fill in any gaps in my understanding. First off, I found these definitions:

- PTX (Portable Thread eXecution) is a forward-compatible, human-readable intermediate representation. It defines a RISC-like instruction set architecture. The CUDA runtime compiles this to machine-specific SASS.

- SASS (Shader ASSembler) is the native, architecture-specific instruction set architecture for NVIDIA GPUs. It is usually generated by ptxas from PTX.
- PTX is forward-compatible to all architectures. SASS is only forward-compatible within the same major family (i.e., within Fermi, within Kepler or within Maxwell).

If I only have a 1080Ti, do I only need SASS 6.1? or do I need both PTX and SASS? (I realize PTX would be generated first to output SASS but are both required in the final runtime binary?)

If I build for a newer card, will PTX be different than an older card? Eg will the PTX just not have any fp16 instructions at all and will fp16 just be completely unused if i use that binary on Turing?

Is SASS sort of like gcc -march=? Eg a -march=haswell binary will run on skylake but -march=skylake might have instructions that dont exist on haswell? and then PTX is like -mtune=? Or am I completely wrong to think of GPU arches like CPU arches?

What does exporting $TF_CUDA_COMPUTE_CAPABILITIES=6.1 do? does it produce both PTX and SASS for pascal? or do I end up with a binary with only SASS for the 1080? It sounds like PTX is different for different capabilities too, and we'd want different PTX/SASS capabilities, do we have a way of controlling that separately?

How do other compute projects handle this? Eg non-ML, like protein folding or some such? does everyone just have absolutely massive binaries cuz they have to build every possible version?

You also mentioned cuDNN/cuBLAS, those are pre-built by nvidia, so all of the above only actually matters for cuda kernels in TF itself, kernels completely from cuDNN will use tensor cores no matter what TF was built with? Or is that such a small percent of all ops to be irrelevant?

Is there an nvcc equivalent of like -Os instead of -O3? Can we optimize the older 3.x,5.x capabilities for size instead of speed?

On top of the investigation into size in the pip wheel that austin did, The C/C++ libraries are also massive. On my machine (built only for capability6.1) I have these sizes:

$ ls -lh /usr/lib64/libtensorflow*.so.2.* /usr/lib64/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
-rwxr-xr-x. 1 root root 336M Mar 30 17:29 /usr/lib64/libtensorflow.so.2.2.0*
-rwxr-xr-x. 1 root root 374M Mar 30 17:29 /usr/lib64/libtensorflow_cc.so.2.2.0*
-rwxr-xr-x. 1 root root 18M Mar 30 17:29 /usr/lib64/libtensorflow_framework.so.2.2.0*
-r-xr-xr-x. 1 root root 399M Mar 30 17:29 /usr/lib64/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so*

The framework is tiny, and the C/C++/python libs are all roughly the same size. I suspect the kernels are duplicated in each lib? They aren't used together (ie a program links against only one of libtf.so, libtf_cc.so, pywrap and all of those use the same _framework.so. It's not really important to the python wheel since libtf{,_cc} are not included but should we have a libtensorflow_kernels.so shared for each of the language libs to use instead?

But the releases should support newer GPUs, and I'd prefer having the tf-nightly builds be the same as the release builds (within reason).

Yeah, I'd rather they're the same too but if we end up dropping some we should at least keep the oldest so we dont have any regressions there.

-- Jason

-- Sanjoy

When we upgrade to CUDA 11, we will add sm_80.

--Artem

Could TF publish tf-nightly with fewer capabilities to reduce tf-nightly's space usage? This would improve our standing with PyPI but could slightly lower the usefulness of tf-nightly.
Would it be feasible and beneficial to host wheels with all capabilities outside of PyPI?
If valuable but not able to be done officially, could SIG Build do this?
Are there any other ways we can make large cuts to the wheel size?
Right now I believe we include both PTX *and* SASS for all the listed compute capabilities, but we technically don't need to include PTX for all CC's since PTX is forward compatible. +Christian Sigg is helping us figure out if we can be smarter and only include, say, PTX for 3.5 and 7.0. Including PTX for 3.5 should be sufficient for completeness, but including PTX for 7.0 ensures that we are fast on the newest GPUs we don't have SASS for.

Als Art has separately suggested that we could drop 5.2 if we really wanted to but we should keep SASS for the other CCs and possibly add 7.5.

-- Sanjoy

What are your thoughts on this? I'm wary that removing a capability could hurt TensorFlow's performance, since (as far as I know) we don't have good data on the effects.

Thanks!
Austin

--
--Artem Belevich

--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.

Jason Zaman

unread,

Mar 31, 2020, 6:42:39 PM3/31/20

to Artem Belevich, Sanjoy Das, Christian Sigg, Martin Wicke, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

Thanks so much for all the detailed info!

On Tue, 31 Mar 2020 at 14:50, Artem Belevich <t...@google.com> wrote:

On Tue, Mar 31, 2020 at 2:03 PM Jason Zaman <ja...@perfinion.com> wrote:

Hey,

A few random questions about all this, can someone clarify or fill in any gaps in my understanding. First off, I found these definitions:
- PTX (Portable Thread eXecution) is a forward-compatible, human-readable intermediate representation. It defines a RISC-like instruction set architecture. The CUDA runtime compiles this to machine-specific SASS.
- SASS (Shader ASSembler) is the native, architecture-specific instruction set architecture for NVIDIA GPUs. It is usually generated by ptxas from PTX.
- PTX is forward-compatible to all architectures. SASS is only forward-compatible within the same major family (i.e., within Fermi, within Kepler or within Maxwell).

SASS is forward-compatible in a sense that it will execute the code. It makes no promises about performance.
E.g. sm_70 SASS will only get you ~50% of the peak fp16 performance on sm_75.
sm_61 will execute fp16 ops, but only at 1/128th of the rate of sm_60, you'd be way better off recompiling the code and promote all fp16 ops to fp32 if you know that the code will be running on sm61.
sm_30 had way less support for fp64 compared to sm_35, running fp64 code sm_30 would normally happen either with reduced precision or a slowdown to emulate fp64. Running the same ptx on sm_35 will work, but, again, will be far from optimal.

A lot of these arguments apply to PTX, though there's a bit of a wiggle room for ptxas to optimize, but we still typically leave a lot of performance on the table. Presumably performance is something TF users do care about.

Bottom line is that if we are serious about claiming that we support particular GPU variant, we should compile for that particular GPU variant. If we only deliver "sort of running, but no promises about performance", we may as well just give users a CPU-only variant and call it a day.

If I only have a 1080Ti, do I only need SASS 6.1?

This is necessary and sufficient. Having PTX for sm_61 or older GPU will give you SASS via in-driver JIT, with various impacts on runtime.

or do I need both PTX and SASS? (I realize PTX would be generated first to output SASS but are both required in the final runtime binary?)

No, SASS is what gets executed on the GPU. If your executable has it, you're good.

If I build for a newer card, will PTX be different than an older card? Eg will the PTX just not have any fp16 instructions at all and will fp16 just be completely unused if i use that binary on Turing?

Yes. If you compile for PTX targeting sm_61 and run it on sm_70/sm_75, tensorcore instructions will not be used.

Is SASS sort of like gcc -march=? Eg a -march=haswell binary will run on skylake but -march=skylake might have instructions that dont exist on haswell?

Worse. It's like targeting ARM versus x86, if you specify a wrong target, your binary will not work at all. Different generations have completely different binary instruction sets.

and then PTX is like -mtune=? Or am I completely wrong to think of GPU arches like CPU arches?

PTX itself would be closer to -march as it controls a sort-of hierarchically expanding instruction set where new architectures tend to be a superset of older ones.

What does exporting $TF_CUDA_COMPUTE_CAPABILITIES=6.1 do? does it produce both PTX and SASS for pascal?

I may be wrong here. I believe for compilations done with clang you'll get both PTX and SASS for sm_61. Compilation with nvcc may be different.

or do I end up with a binary with only SASS for the 1080?

That's what we do internally. We had enough accidental trouble with unintentional in-driver JIT so we explicitly disabled PTX generation to have control over what we run on our GPUs.

It sounds like PTX is different for different capabilities too, and we'd want different PTX/SASS capabilities, do we have a way of controlling that separately?

I'm not sure I follow you. Whether a particular PTX instruction is available is predicated on the PTX version and the targeted GPU's variant. We do have a way to control GPU variant. PTX version is usually controlled by the CUDA version we're using (e.g. ptxas from older CUDA will refuse PTX 6.5 from CUDA-10.2).

You've sort of answered this, but what I meant was it seems like when we build for a capability, we build in both the PTX and SASS? so are we currently shipping sm_35, ptx_35, sm_37, ptx_37, sm_52, ptx_52, sm_60, ptx_60, sm_61, ptx_61, sm_70, ptx_70? And all the ptx's except 70 are basically never going to be used? Can we just drop all the PTX's and save 50% of the space easily? That sounds easier/better than removing capabilities. (obviously we should use the better recommended list from earlier in the thread not just our current capabilities). To do this we'd probably need to split up the TF_CUDA_COMPUTE_CAPABILITIES variables into separate lists for PTX and SASS?

-- Jason

How do other compute projects handle this? Eg non-ML, like protein folding or some such? does everyone just have absolutely massive binaries cuz they have to build every possible version?

No idea.

You also mentioned cuDNN/cuBLAS, those are pre-built by nvidia, so all of the above only actually matters for cuda kernels in TF itself, kernels completely from cuDNN will use tensor cores no matter what TF was built with?

Correct.

Or is that such a small percent of all ops to be irrelevant?

We do have substantial number of CUDA sources we complie ourselves. It's quite a bit smaller than the size of precompiled CUDA libraries, but the absolute size is not trivial.

Is there an nvcc equivalent of like -Os instead of -O3? Can we optimize the older 3.x,5.x capabilities for size instead of speed?

That's not going to buy all that much.We still have N different GPU architectures to target. Newer architectures tend to have *much* larger binaries. I believe sm_7x is 16 bytes per instruction, while sm_6x was only 8. It's not exactly apples-to-apples comparison, but the bloat factor is definitely higher than 1.0.

Plus, again, performance is heavily dependent on optimal use of GPU resources. We can generate small code, but it will likely run an order of magnitude slower than a completely unrolled humongous version, which is what GPU code typically looks like in the end. Failing to unroll hot loops is a common performance regression I see when CUDA stuff is compiled with clang, which is less aggressive about that compared to nvcc.

--Artem

--
--Artem Belevich

Sanjoy Das

unread,

Mar 31, 2020, 9:02:32 PM3/31/20

to Artem Belevich, Jason Zaman, Christian Sigg, Martin Wicke, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

On Tue, Mar 31, 2020 at 4:00 PM Artem Belevich <t...@google.com> wrote:

Correct.

Can we just drop all the PTX's and save 50% of the space easily?

We can, and probably should,

Just to avoid duplicating work, Christian is looking into this already.

though the size saving will not be *that* dramatic. I believe that's what NVIDIA does with their precompiled libraries -- they tend to carry SASS for all currently supported GPUs (or a functional equivalent of. E.g. they used to ship only sm_50 and not sm_52). PTX is text, which is highly compressible and it is embedded in compressed form, so the overall impact on the size is smaller than that of SASS. SASS binaries, on the other hand, are not compressed by default for optimized builds.

But the *pip packages* are compressed right (I'm just repeating what you've said in another thread :) )? So additionally compressing the binaries might not yield much (though I haven't checked)?

-- Sanjoy

We can save some space by compressing the binaries at the cost of some impact on startup performance and, maybe, on memory use.

That sounds easier/better than removing capabilities. (obviously we should use the better recommended list from earlier in the thread not just our current capabilities). To do this we'd probably need to split up the TF_CUDA_COMPUTE_CAPABILITIES variables into separate lists for PTX and SASS?

I think it's an 'all of the above' type of scenario -- settle on the set of GPUs we want to support, remove unnecessary PTX, consider enabling SASS compression based on observed runtime impact.

--Artem

--
--Artem Belevich

Martin Wicke

unread,

Mar 31, 2020, 9:03:47 PM3/31/20

to Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

Yes, they are compressed, extra compression won't help.

Manuel Klimek

unread,

Apr 1, 2020, 3:20:58 AM4/1/20

to Martin Wicke, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

Could we release the sass per platform and download it on demand? This seems strictly easier than downloading / distributing other binary artifacts on demand, as there's a single dimension: GPU type.

Martin Wicke

unread,

Apr 1, 2020, 3:25:45 AM4/1/20

to Manuel Klimek, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

There's a nice and easy (well, relatively) intermediate step: put each flavor into a separate pip package, and make TensorFlow depend on all of them. Then, if people want, they can restrict which ones they install (and we circumvent the package size limit, although it doesn't solve for pypi's bandwidth problem).

Manuel Klimek

unread,

Apr 1, 2020, 3:35:26 AM4/1/20

to Martin Wicke, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

On Wed, Apr 1, 2020 at 9:25 AM Martin Wicke <wi...@google.com> wrote:

There's a nice and easy (well, relatively) intermediate step: put each flavor into a separate pip package, and make TensorFlow depend on all of them. Then, if people want, they can restrict which ones they install (and we circumvent the package size limit, although it doesn't solve for pypi's bandwidth problem).

The main issue I see with that is relying on using python mechanisms to load things - given that these are binary blobs, it might not be too bad, though. I'd mainly be careful about adding another C++-level dependency between python packages ;)

--
You received this message because you are subscribed to the Google Groups "tensorflow-devinfra-team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tensorflow-devinfr...@google.com.
To view this discussion on the web visit https://groups.google.com/a/google.com/d/msgid/tensorflow-devinfra-team/CADtzJKMHi%2BKT9KHpOwpi61w-VKcHBfYodf_%3DTaoYAON9j-NYnw%40mail.gmail.com.

Martin Wicke

unread,

Apr 1, 2020, 3:41:58 AM4/1/20

to Manuel Klimek, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

I'm not thinking about relying on python, this would only use pip for delivery. I'd like these to be .so files the the depended-upon pip package deposits into a plugins/cuda folder, from where they'll be dlopened (the mechanism for that exists).

A better version of this, variously discussed, would make a source package which in it's compile step downloads the real binary from gcs.

Manuel Klimek

unread,

Apr 1, 2020, 3:49:27 AM4/1/20

to Martin Wicke, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

On Wed, 1 Apr 2020, 09:41 Martin Wicke, <wi...@google.com> wrote:

I'm not thinking about relying on python, this would only use pip for delivery. I'd like these to be .so files the the depended-upon pip package deposits into a plugins/cuda folder, from where they'll be dlopened (the mechanism for that exists).

And then have a C API to load the content?

Martin Wicke

unread,

Apr 1, 2020, 3:52:25 AM4/1/20

to Manuel Klimek, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

On Wed, Apr 1, 2020, 00:49 'Manuel Klimek' via SIG Build <bu...@tensorflow.org> wrote:

On Wed, 1 Apr 2020, 09:41 Martin Wicke, <wi...@google.com> wrote:
I'm not thinking about relying on python, this would only use pip for delivery. I'd like these to be .so files the the depended-upon pip package deposits into a plugins/cuda folder, from where they'll be dlopened (the mechanism for that exists).

And then have a C API to load the content?

Yes, that would be the idea. The issue might be that the content isn't particularly stable and we probably need to strongly tie versions of this together anyway (ie release exactly one version per version of TF, down to maybe even patch releases). If so, ABI also doesn't matter much.

Manuel Klimek

unread,

Apr 1, 2020, 7:00:50 AM4/1/20

to Martin Wicke, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

On Wed, Apr 1, 2020 at 9:52 AM Martin Wicke <wi...@google.com> wrote:

On Wed, Apr 1, 2020, 00:49 'Manuel Klimek' via SIG Build <bu...@tensorflow.org> wrote:

On Wed, 1 Apr 2020, 09:41 Martin Wicke, <wi...@google.com> wrote:
I'm not thinking about relying on python, this would only use pip for delivery. I'd like these to be .so files the the depended-upon pip package deposits into a plugins/cuda folder, from where they'll be dlopened (the mechanism for that exists).

And then have a C API to load the content?

Yes, that would be the idea. The issue might be that the content isn't particularly stable and we probably need to strongly tie versions of this together anyway (ie release exactly one version per version of TF, down to maybe even patch releases). If so, ABI also doesn't matter much.

Good point, makes sense.

Gabriel de Marmiesse

unread,

Apr 1, 2020, 7:34:17 AM4/1/20

to Manuel Klimek, Martin Wicke, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

Hi everyone,

Super happy to see that there are ideas towards addressing this issue. If I may, I'll give my experience as end user with an average internet bandwidth and dev of TF Addons.

Would it be possible to make "only download what is needed" the default behavior? I believe that this is what users would expect to happen on their end. And in 2-3 years, I can't imagine that the default behavior of "pip install tensorflow" would be pulling 1-2GB of wheels, it's not realistic. "pip install torch" is already around 750MB (hosted on PyPI) and that's already very painful for an average user. Also PyPI can't resume broken connections, so any cut during an ~15m download and we need to restart from scratch.

In TF Addons' CI, it takes ~2m to download + install the tensorflow wheel. For some builds, that's 50% of the running time. Most of the time must be spent on installing I suppose since download speed is not an issue in the CI. Also, we're really close in tensorflow Addons to reach our disk limit in Github Actions of 14GB per build job. The custom ops docker image (tensorflow/tensorflow:2.1.0-custom-op-gpu-ubuntu16) is around 9.5GB + bazel + JDK download 300MB + TF 500MB.

I believe there is a lot of room to improve the user experience here and I hope this feedback can help. Thanks again for this awesome framework!

Best,

Gabriel de Marmiesse

Mihai Maruseac

unread,

Apr 1, 2020, 9:47:48 AM4/1/20

to Gabriel de Marmiesse, Manuel Klimek, Martin Wicke, Sanjoy Das, Artem Belevich, Jason Zaman, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

A thing I was thinking of was to have `tensorflow`/`tensorflow_gpu` be
an empty pip which on `setup.py` (or some other script that is
guaranteed to run during `pip install`) determines details about your
platform (AVX support, CUDA version, python version) and then just
fetches a pip tailored to that, either from GCS or from PyPi.

This also has a nice benefit that we can detect cases where users have
a 32 bits Python (default install on Windows, source of a lot of
issues on GitHub), or don't have support for AVX on CPU (so we can
print a message, this is again a source of a lot of issues on GitHub),
or don't have the needed MSVC redistributable on Windows (again a
source of a lot of duplicated issues on GitHub).

Sure, it will take some design to create this, but afaik it should be possible.

Tensors must flow securely

Jason Zaman

unread,

Apr 1, 2020, 4:22:28 PM4/1/20

to Mihai Maruseac, Gabriel de Marmiesse, Manuel Klimek, Martin Wicke, Sanjoy Das, Artem Belevich, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

This whole ship tings separately thing is basically what Modular TensorFlow RFC is already aiming towards. the core will be small and just the runtime, then different pip packages can plug in different versions of the CUDA kernels or ROCm kernels or MKL. but like its not going to happen tomorrow, theres still a lot of work in the core before we'd be in a place for that. We could split up the package and unconditionally depend on them then later on when things are ready they can be optional i suppose.

But If doing something now as a stop-gap would delay or make modular TF more complicated then i'd personally rather just focus on modularizing things faster instead.

-- Jason

Vartika Singh

unread,

Apr 6, 2020, 8:02:46 PM4/6/20

to Jason Zaman, Mihai Maruseac, Cliff Woolley, Nathan Luehr, Gabriel de Marmiesse, Manuel Klimek, Martin Wicke, Sanjoy Das, Artem Belevich, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline

@Cliff Woolley @Nathan Luehr

From: Jason Zaman <ja...@perfinion.com>
Date: Wednesday, April 1, 2020 at 1:22 PM
To: Mihai Maruseac <mihaim...@google.com>
Cc: Gabriel de Marmiesse <gabrielde...@gmail.com>, Manuel Klimek <kli...@google.com>, Martin Wicke <wi...@google.com>, Sanjoy Das <san...@google.com>, Artem Belevich <t...@google.com>, Christian Sigg <cs...@google.com>, Günhan Gülsoy <gu...@google.com>, Austin Anderson <ange...@google.com>, SIG Build <bu...@tensorflow.org>, tensorflow-devinfra-team <tensorflow-d...@google.com>, John Kline <jkl...@google.com>
Subject: Re: TF's 500Mb wheels dedicate 300Mb to 6 CUDA compute capabilities

External email: Use caution opening links or attachments

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Austin Anderson

unread,

Apr 8, 2020, 6:56:47 PM4/8/20

to SIG Build, vart...@nvidia.com, gabrielde...@gmail.com, Manuel Klimek, Martin Wicke, Sanjoy Das, Artem Belevich, Christian Sigg, Günhan Gülsoy, Austin Anderson, SIG Build, tensorflow-devinfra-team, John Kline, Jason Zaman, Mihai Maruseac, Cliff Woolley, Nathan Luehr

The TF team met with Nvidia on Wednesday to sync on this topic and discuss ideas for reducing TensorFlow's package sizes. Below is a summary of my meeting notes:

TF wants to reduce space usage for TF on PyPI, but minimize chance of surprise system hang from PTX JIT
- Unclear if upload or download PyPI space/bandwidth is more important
- Would also help everyone with internet limitations
Likely action items to reduce size taken by CUDA support (~300Mb of 500Mb .whl):
- Remove 3.7 from TF's compiled compute capabilities (3.5, 3.7, 5.2, 6.0, 6.1, and 7.0)
  - 7.5 may not be super important; we might not need to add it
  - Others already included have benefits, although generally new minor versions only have marginal changes
- Provide SASS for all capabilities and PTX for only the highest number; currently TF provides both for all compute capabilities (they take roughly the same amount of space)
  - Unclear how wheel zip compression impacts this
Other possible actions:
- Split Python packages by CUDA capability group or by TF use case (like: TF for RNNs and nothing else, etc.)
- Research TF-using hardware for better data on choosing capabilities to include
  - Impacted by e.g. laptop and Colab users who just want to try TF
- Make tf-nightly a wrapper that directs to the real package (similar to PyTorch's package)
- Check and see if compressing the fatbins would help (assumption: zip compression already reaps the benefits)
Other notes:
- Nvidia is looking into methods for packaging software components so that they're more visible to pip

Much thanks to the Nvidia team (Cliff & Nathan on cc) for discussing with us, and for helping me clean up the summary for public sharing.

Austin

Jay Furmanek

unread,

Apr 13, 2020, 5:58:20 PM4/13/20

to SIG Build

Revisiting this topic. In the Build SIG meeting, we discussed the idea of just dropping PTX code altogether:

- its not needed for older Cuda Capabilities because the newer capability levels are also built.

- It makes more sense to keep PTX code for the latest Capability level. Even then, though, a new CUDA is often needed for newer adapters, which would take a newer version of TF anyway.

I did some test builds of our WML CE version of Tensorflow to check package sizes, but with no PTX code added.

(Original)

$ ls -lh

-rw-rw-r-- 1 furmanek furmanek 363M Apr 13 14:04 tensorflow-2.2.0rc2-cp36-cp36m-linux_ppc64le.whl

(with no PTX)

$ ls -lh

-rw-rw-r-- 1 furmanek furmanek 280M Apr 13 14:59 tensorflow-2.2.0rc2-cp36-cp36m-linux_ppc64le.whl

That's about a 30% reduction which is significant, IMO. In addition, our builds only enable: "3.7,6.0,7.0,7.5" (earlier adapters weren't available for ppc64le) so the savings will likely be more when you consider the official pypi.org builds have 2 more capability levels enabled.

Is there a desire to have a separate option plumbed through to configure.py that controls enablement of PTX code? (I believe Pytorch accepts "${CAPABILITY}" and/or "${CAPABILITY}+PTX" for each capability level and emits the proper nvcc command line based on the config input)

Sami Kama

unread,

Apr 13, 2020, 6:09:09 PM4/13/20

to Jay Furmanek, SIG Build

Hi,

For this topic, what would happen if we just do the opposite and drop all sass and keep ptx. During the installation of pip wheel, we can load the library and trigger JIT compilation of the PTX to a predefined path using environment variable and then use the same path when importing the module. This will extend the installation of wheel to O(10) mins but will happen only on first installation and should reduce the pip wheel size significantly. What do you think?

Cheers,

Sami

--

Jason M Furmanek

unread,

Apr 14, 2020, 10:48:20 AM4/14/20

to sami.kam...@gmail.com, bu...@tensorflow.org

Switching from all sass to all ptx wouldn't save space unless you mean only include ptx for a single sm_ ? And then would you pick an old one or new one? Either way there are some pretty big trade offs to consider with that approach. This would also add install time complexity.

Jason M. Furmanek

Power Systems and Open Power Innovation and Solutions
IBM Systems & Technology Group
Mobile: 1-512-638-9692
E-mail: furm...@us.ibm.com

Reply all

Reply to author

Forward