StreamExecutors on multiple GPUs

254 views
Skip to first unread message

art....@gmail.com

unread,
Apr 29, 2021, 8:06:55 AM4/29/21
to XLA development
Hello everyone,

TensorFlow has amazing feature - it distributes the workload of independent branches in a computational graph among a set of available GPU devices. And I always assumed that this feature is also available in XLA.

I have discovered recently interesting notes in docs, and I don't fully understand the implications of this, check out [1], and, in particular this commit [2]. Does it mean that auto-execution on multiple GPUs will be removed from XLA (or not implemented by MLIR)? If that's the case, that would be extremely bad as there are numerous examples where this feature benefits many algorithms (not only in ML). Here is some examples: geometric deep learning [3, 4] (equivariant NNs, graph NNs), all kernel based methods, optimal transport, kNN, probabilistic models (Gaussian mixture model, Gaussian processes) [5, 6]. In those examples, there are independent computations, often memory and computationally expensive, that can be assigned to multiple devices (GPUs) for a speed-up or for breaking a memory limitation. I would advocate for further development and improvement of GPU auto-assigning feature, but not for deleting it.

I'd appreciate any help and clarification on what's going on around this feature in XLA dev and encourage the discussion within the community.

[5] C. Bishop, Pattern recognition and machine learning, 2006
[6] K. Murphy, Machine learning: a probabilistic perspective, 2012

Bairen YI

unread,
Apr 29, 2021, 10:37:17 AM4/29/21
to art....@gmail.com, XLA development
Hi,

What you mentioned are support for multi stream on a single GPU. AFAIK computation on different GPUs are not concerned here; they sync using NCCL and it’s working perfectly fine.

Best,
Bairen

On 29 Apr 2021, at 20:07, art....@gmail.com <art....@gmail.com> wrote:

Hello everyone,
--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/f8c2f220-11c2-4c29-82c4-7a7aa82f3411n%40googlegroups.com.

Sanjoy Das

unread,
Apr 29, 2021, 11:45:51 AM4/29/21
to Bairen YI, Tim Shen, art....@gmail.com, XLA development

art....@gmail.com

unread,
Apr 29, 2021, 3:27:49 PM4/29/21
to XLA development
re: Adrian
> The multi-stream support in XLA was limited to Gemm and RNG, and it was not enabled by default. As far as I know, there is work planned to implement better multi-stream support in XLA.

Does it mean that the support for Gemm and RNG was dropped? I'm curious about the future development, is there any official information or someone here who can shed some light on it?

re: Bairen

> What you mentioned are support for multi stream on a single GPU. AFAIK computation on different GPUs are not concerned here; they sync using NCCL and it’s working perfectly fine.

Not sure, I fully understand what you mean. The NCCL would be helpful for data parallelism, but NCCL doesn't know how to parallelise sophisticated operation interactions. My question is about independent branches of the computational graph, which can be run on multiple GPUs if those are available. TensoFlow supports it by default (But better to ask someone from TF team, and the profiling shows that independent instructions are run on different GPUs).

tre...@nvidia.com

unread,
Apr 29, 2021, 5:19:42 PM4/29/21
to XLA development

Hi Art,

Multistream and multigpu are two orthogonal concepts. Multistream runs kernels on parallel streams on “the same” GPU, while multi-GPUs means that you run kernels on multiple GPU devices.

I wonder, in your TF example, how you achieve multi-gpu support? Is it through manual assignment of graph nodes to different GPU ids? If yes, I think XLA JIT should respect these assigned GPU ids and run these clusters on different GPUs according to the assigned devices. This is not related to the multistream support in XLA though.

Trent

art....@gmail.com

unread,
Apr 29, 2021, 7:01:09 PM4/29/21
to XLA development
re: Trent

Hi. Apologies for mixing terminology in previous posts. I was talking about streams. However, I thought that the executor could exploit multiple GPUs, assign them to different streams, and run an instruction (a kernel) on a GPU from a pool of available GPUs. E.g. consider a sum of functions `f_i: R^{N \times N} -> R^N` of matrix products: F = (f_0(A_0 @ B_0) + f_1(A_1 @ B_1) + ... + f_n(A_n @ B_n)), where each i-th matrix product (and function evaluation f_i) is independent of each other. Also let's assume that the product of A_i and B_i is expensive, s.t. it would not make sense to run the expression F on a single GPU (user does not assign GPUs explicitly, as `n` is unknown in advance and may vary, as well as the number of GPUs). TensorFlow (XLA) builds the computational graph for F and can detect independent branches (a sequence of kernel calls). So, TF (XLA) passes this information about independent branches to the executor, which in turn has access to GPUs and their configurations. The executor should be able to parallelise the expression knowing about independent branches. This is my intuition on what should happen when there are obvious parallelisable parts in the data-flow graph.

> I wonder, in your TF example, how you achieve multi-gpu support?

I will have to re-run profiling once I have access to multiple GPUs again. I did not assign manually graphs nodes to different GPUs.

art....@gmail.com

unread,
Apr 30, 2021, 1:57:39 PM4/30/21
to XLA development
re: Trent. Thanks a lot for clarifying things!

I ran the profiler on a toy example, and I cannot see the assignment of different GPUs to different streams. And, obviously it works with explicit assignment via `tf.device`.
I attached the *.dot file and profiling screenshot.

I'm still curious why the executor does not plan independent computational branches on different GPUs?
I would appreciate it if someone could elaborate on that. Thanks!

Best,
Artem

1619794138162745.module_0000.before_optimizations.pdf
Screenshot 2021-04-30 at 18.29.45.png
Reply all
Reply to author
Forward
0 new messages