On Mon, Mar 30, 2020 at 12:03 PM Sanjoy Das <san...@google.com> wrote:On Mon, Mar 30, 2020 at 11:32 AM Martin Wicke <wi...@google.com> wrote:On Mon, Mar 30, 2020 at 10:54 AM Günhan Gülsoy <gu...@google.com> wrote:+cc Artem, Sanjoy, who are already discussing this on go/tf-pip-sizeOn Mon, Mar 30, 2020 at 10:50 AM Austin Anderson <ange...@google.com> wrote:Hi all,In March's SIG Build meeting, we discussed the recurring growth of TensorFlow's GPU pip wheels, which had recently grown to over 500Mb. The growth and need for storage causes friction between TensorFlow and PyPI as TF pip packages consume more of their storage space. Amit and I investigated a suggestion from someone (I neglected to record who, but thanks!) and found what appears to be new information: each of the six CUDA compute capabilities in TensorFlow's official wheels adds about 50Mb to the wheel, accounting for all 300Mb of _pywrap_tensorflow_internal.so. I estimated this by extracting and measuring CC-specific files from _pywrap_tensorflow_internal.so with cuobjdump: https://gist.github.com/angerson/668fc6062dc06a6660f5be373d1b2dd1TensorFlow's official wheels are built with the Compute Capabilities 3.5, 3.7, 5.2, 6.0, 6.1, and 7.0. I'm not sure of the original rationale for the list; the earliest code I can find is from October 2017 when wheels were built with 3.0, 3.5, 3.7, 5.2, 6.0, and 6.1. For comparison, PyTorch's 1.5.0 nightly dev wheel is 850 Mb; measured by the same methods, their seven CCs (3.5, 3.7, 5.0, 6.0, 6.1, 7.0, 7.5) account for 615Mb. PyTorch isn't on PyPI so they don't share our size problems. In TF 1.5 (2018), each CC was ~25Mb and each slight CUDA size change since then has been multiplied by 6.Here's more information about compute capabilities:
- Definition: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability
- Features: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
- All Nvidia Products and supported capabilities (no release dates / usage data): https://developer.nvidia.com/cuda-gpus
My questions from this are:
- How can we evaluate which CCs are valuable? IIRC, CCs are backwards-compatible, so as long as 3.5 is there GPU support will not change. I don't know how to measure the benefits of newer capabilities.
Here are the major architectural jumps I can think of.sm_35 -- provides ldg/ldu instructions. Those are fairly commonly used (even if it may no longer be necessary on newer GPUs)sm_6x -- provides fp16 support. Older GPUs will need to promote ops to to fp32. Noticeable performance bump for fp16.Nit pick, and not relevant for the nightly PiPy builds: sm_53 (Jetson, Drive PX) introduced fp16.sm_70 -- provides tensor cores -- major performance bump for fp16sm_75 -- improves tensorcore performance. Will run sm_70 binaries at about 50% peak fp16.sm_80 -- TBDsm_35 is there to support cloud which still has a lot of themsm_6x are the previous generation of consumer cards widely used outside of google.sm_70 -- widely present in cloudsm_75 -- current generation of consumer cards (AKA what many external TF users are likely to buy & use these days).I think we'll need to keep these four around.Starting with CUDA 11, sm_35 (up from sm_30) is the lowest supported target, and sm_52 is the lowest non-deprecated target.Given all this, here is my recommendation:sm_35 (assume we will move to CUDA 11 soon, this is the minimum supported version anyway)sm_50 (otherwise we would need to ship ptx_50, the users will complain about a few minutes of single-threaded JITing on first startup)sm_60 (for sm_6x consumer and workstation products)sm_70 (also for sm_75, assuming all relevant tensor core code is in cuDNN/cuBLAS)Notes:sm_4x does not exist.sm_xy is the SASS (binary) target that is compatible with sm_xz (z>=y).do not ship PTX: JITing is slow, and is only really good for forward compatibility.no ptx_7x either: JITing is too slow, and nightly builds don't need to be long-term forward compatible.
When we upgrade to CUDA 11, we will add sm_80.--Artem
- Could TF publish tf-nightly with fewer capabilities to reduce tf-nightly's space usage? This would improve our standing with PyPI but could slightly lower the usefulness of tf-nightly.
- Would it be feasible and beneficial to host wheels with all capabilities outside of PyPI?
- If valuable but not able to be done officially, could SIG Build do this?
- Are there any other ways we can make large cuts to the wheel size?
Right now I believe we include both PTX *and* SASS for all the listed compute capabilities, but we technically don't need to include PTX for all CC's since PTX is forward compatible. +Christian Sigg is helping us figure out if we can be smarter and only include, say, PTX for 3.5 and 7.0. Including PTX for 3.5 should be sufficient for completeness, but including PTX for 7.0 ensures that we are fast on the newest GPUs we don't have SASS for.Als Art has separately suggested that we could drop 5.2 if we really wanted to but we should keep SASS for the other CCs and possibly add 7.5.-- SanjoyWhat are your thoughts on this? I'm wary that removing a capability could hurt TensorFlow's performance, since (as far as I know) we don't have good data on the effects.Thanks!Austin
----Artem Belevich
But the releases should support newer GPUs, and I'd prefer having the tf-nightly builds be the same as the release builds (within reason).
-- SanjoyWhen we upgrade to CUDA 11, we will add sm_80.--Artem
- Could TF publish tf-nightly with fewer capabilities to reduce tf-nightly's space usage? This would improve our standing with PyPI but could slightly lower the usefulness of tf-nightly.
- Would it be feasible and beneficial to host wheels with all capabilities outside of PyPI?
- If valuable but not able to be done officially, could SIG Build do this?
- Are there any other ways we can make large cuts to the wheel size?
Right now I believe we include both PTX *and* SASS for all the listed compute capabilities, but we technically don't need to include PTX for all CC's since PTX is forward compatible. +Christian Sigg is helping us figure out if we can be smarter and only include, say, PTX for 3.5 and 7.0. Including PTX for 3.5 should be sufficient for completeness, but including PTX for 7.0 ensures that we are fast on the newest GPUs we don't have SASS for.Als Art has separately suggested that we could drop 5.2 if we really wanted to but we should keep SASS for the other CCs and possibly add 7.5.-- SanjoyWhat are your thoughts on this? I'm wary that removing a capability could hurt TensorFlow's performance, since (as far as I know) we don't have good data on the effects.Thanks!Austin----Artem Belevich
--
To unsubscribe from this group and stop receiving emails from it, send an email to build+un...@tensorflow.org.
On Tue, Mar 31, 2020 at 2:03 PM Jason Zaman <ja...@perfinion.com> wrote:Hey,A few random questions about all this, can someone clarify or fill in any gaps in my understanding. First off, I found these definitions:- PTX (Portable Thread eXecution) is a forward-compatible, human-readable intermediate representation. It defines a RISC-like instruction set architecture. The CUDA runtime compiles this to machine-specific SASS.- SASS (Shader ASSembler) is the native, architecture-specific instruction set architecture for NVIDIA GPUs. It is usually generated by ptxas from PTX.
- PTX is forward-compatible to all architectures. SASS is only forward-compatible within the same major family (i.e., within Fermi, within Kepler or within Maxwell).
SASS is forward-compatible in a sense that it will execute the code. It makes no promises about performance.E.g. sm_70 SASS will only get you ~50% of the peak fp16 performance on sm_75.sm_61 will execute fp16 ops, but only at 1/128th of the rate of sm_60, you'd be way better off recompiling the code and promote all fp16 ops to fp32 if you know that the code will be running on sm61.sm_30 had way less support for fp64 compared to sm_35, running fp64 code sm_30 would normally happen either with reduced precision or a slowdown to emulate fp64. Running the same ptx on sm_35 will work, but, again, will be far from optimal.A lot of these arguments apply to PTX, though there's a bit of a wiggle room for ptxas to optimize, but we still typically leave a lot of performance on the table. Presumably performance is something TF users do care about.Bottom line is that if we are serious about claiming that we support particular GPU variant, we should compile for that particular GPU variant. If we only deliver "sort of running, but no promises about performance", we may as well just give users a CPU-only variant and call it a day.
If I only have a 1080Ti, do I only need SASS 6.1?
This is necessary and sufficient. Having PTX for sm_61 or older GPU will give you SASS via in-driver JIT, with various impacts on runtime.
or do I need both PTX and SASS? (I realize PTX would be generated first to output SASS but are both required in the final runtime binary?)
No, SASS is what gets executed on the GPU. If your executable has it, you're good.
If I build for a newer card, will PTX be different than an older card? Eg will the PTX just not have any fp16 instructions at all and will fp16 just be completely unused if i use that binary on Turing?
Yes. If you compile for PTX targeting sm_61 and run it on sm_70/sm_75, tensorcore instructions will not be used.
Is SASS sort of like gcc -march=? Eg a -march=haswell binary will run on skylake but -march=skylake might have instructions that dont exist on haswell?
Worse. It's like targeting ARM versus x86, if you specify a wrong target, your binary will not work at all. Different generations have completely different binary instruction sets.
and then PTX is like -mtune=? Or am I completely wrong to think of GPU arches like CPU arches?
PTX itself would be closer to -march as it controls a sort-of hierarchically expanding instruction set where new architectures tend to be a superset of older ones.
What does exporting $TF_CUDA_COMPUTE_CAPABILITIES=6.1 do? does it produce both PTX and SASS for pascal?
I may be wrong here. I believe for compilations done with clang you'll get both PTX and SASS for sm_61. Compilation with nvcc may be different.
or do I end up with a binary with only SASS for the 1080?
That's what we do internally. We had enough accidental trouble with unintentional in-driver JIT so we explicitly disabled PTX generation to have control over what we run on our GPUs.
It sounds like PTX is different for different capabilities too, and we'd want different PTX/SASS capabilities, do we have a way of controlling that separately?
I'm not sure I follow you. Whether a particular PTX instruction is available is predicated on the PTX version and the targeted GPU's variant. We do have a way to control GPU variant. PTX version is usually controlled by the CUDA version we're using (e.g. ptxas from older CUDA will refuse PTX 6.5 from CUDA-10.2).
How do other compute projects handle this? Eg non-ML, like protein folding or some such? does everyone just have absolutely massive binaries cuz they have to build every possible version?
No idea.
You also mentioned cuDNN/cuBLAS, those are pre-built by nvidia, so all of the above only actually matters for cuda kernels in TF itself, kernels completely from cuDNN will use tensor cores no matter what TF was built with?
Correct.
Or is that such a small percent of all ops to be irrelevant?
We do have substantial number of CUDA sources we complie ourselves. It's quite a bit smaller than the size of precompiled CUDA libraries, but the absolute size is not trivial.
Is there an nvcc equivalent of like -Os instead of -O3? Can we optimize the older 3.x,5.x capabilities for size instead of speed?
That's not going to buy all that much.We still have N different GPU architectures to target. Newer architectures tend to have *much* larger binaries. I believe sm_7x is 16 bytes per instruction, while sm_6x was only 8. It's not exactly apples-to-apples comparison, but the bloat factor is definitely higher than 1.0.Plus, again, performance is heavily dependent on optimal use of GPU resources. We can generate small code, but it will likely run an order of magnitude slower than a completely unrolled humongous version, which is what GPU code typically looks like in the end. Failing to unroll hot loops is a common performance regression I see when CUDA stuff is compiled with clang, which is less aggressive about that compared to nvcc.--Artem
----Artem Belevich
Correct.
Can we just drop all the PTX's and save 50% of the space easily?
We can, and probably should,
though the size saving will not be *that* dramatic. I believe that's what NVIDIA does with their precompiled libraries -- they tend to carry SASS for all currently supported GPUs (or a functional equivalent of. E.g. they used to ship only sm_50 and not sm_52). PTX is text, which is highly compressible and it is embedded in compressed form, so the overall impact on the size is smaller than that of SASS. SASS binaries, on the other hand, are not compressed by default for optimized builds.
We can save some space by compressing the binaries at the cost of some impact on startup performance and, maybe, on memory use.
That sounds easier/better than removing capabilities. (obviously we should use the better recommended list from earlier in the thread not just our current capabilities). To do this we'd probably need to split up the TF_CUDA_COMPUTE_CAPABILITIES variables into separate lists for PTX and SASS?
I think it's an 'all of the above' type of scenario -- settle on the set of GPUs we want to support, remove unnecessary PTX, consider enabling SASS compression based on observed runtime impact.--Artem
----Artem Belevich
There's a nice and easy (well, relatively) intermediate step: put each flavor into a separate pip package, and make TensorFlow depend on all of them. Then, if people want, they can restrict which ones they install (and we circumvent the package size limit, although it doesn't solve for pypi's bandwidth problem).
--
You received this message because you are subscribed to the Google Groups "tensorflow-devinfra-team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tensorflow-devinfr...@google.com.
To view this discussion on the web visit https://groups.google.com/a/google.com/d/msgid/tensorflow-devinfra-team/CADtzJKMHi%2BKT9KHpOwpi61w-VKcHBfYodf_%3DTaoYAON9j-NYnw%40mail.gmail.com.
I'm not thinking about relying on python, this would only use pip for delivery. I'd like these to be .so files the the depended-upon pip package deposits into a plugins/cuda folder, from where they'll be dlopened (the mechanism for that exists).
On Wed, 1 Apr 2020, 09:41 Martin Wicke, <wi...@google.com> wrote:I'm not thinking about relying on python, this would only use pip for delivery. I'd like these to be .so files the the depended-upon pip package deposits into a plugins/cuda folder, from where they'll be dlopened (the mechanism for that exists).And then have a C API to load the content?
On Wed, Apr 1, 2020, 00:49 'Manuel Klimek' via SIG Build <bu...@tensorflow.org> wrote:On Wed, 1 Apr 2020, 09:41 Martin Wicke, <wi...@google.com> wrote:I'm not thinking about relying on python, this would only use pip for delivery. I'd like these to be .so files the the depended-upon pip package deposits into a plugins/cuda folder, from where they'll be dlopened (the mechanism for that exists).And then have a C API to load the content?Yes, that would be the idea. The issue might be that the content isn't particularly stable and we probably need to strongly tie versions of this together anyway (ie release exactly one version per version of TF, down to maybe even patch releases). If so, ABI also doesn't matter much.
From: Jason Zaman <ja...@perfinion.com>
Date: Wednesday, April 1, 2020 at 1:22 PM
To: Mihai Maruseac <mihaim...@google.com>
Cc: Gabriel de Marmiesse <gabrielde...@gmail.com>, Manuel Klimek <kli...@google.com>, Martin Wicke <wi...@google.com>, Sanjoy Das <san...@google.com>, Artem Belevich <t...@google.com>, Christian Sigg <cs...@google.com>, Günhan Gülsoy <gu...@google.com>,
Austin Anderson <ange...@google.com>, SIG Build <bu...@tensorflow.org>, tensorflow-devinfra-team <tensorflow-d...@google.com>, John Kline <jkl...@google.com>
Subject: Re: TF's 500Mb wheels dedicate 300Mb to 6 CUDA compute capabilities
External email: Use caution opening links or attachments |
--