On 29 Apr 2021, at 20:07, art....@gmail.com <art....@gmail.com> wrote:
Hello everyone,
--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/f8c2f220-11c2-4c29-82c4-7a7aa82f3411n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/9F9993FB-1CC3-4288-B421-EA6D680E7710%40connect.ust.hk.
Hi Art,
Multistream and multigpu are two orthogonal concepts. Multistream runs kernels on parallel streams on “the same” GPU, while multi-GPUs means that you run kernels on multiple GPU devices.
I wonder, in your TF example, how you achieve multi-gpu support? Is it through manual assignment of graph nodes to different GPU ids? If yes, I think XLA JIT should respect these assigned GPU ids and run these clusters on different GPUs according to the assigned devices. This is not related to the multistream support in XLA though.
Trent
Hi. Apologies for mixing terminology in previous posts. I was talking about streams. However, I thought that the executor could exploit multiple GPUs, assign them to different streams, and run an instruction (a kernel) on a GPU from a pool of available GPUs. E.g. consider a sum of functions `f_i: R^{N \times N} -> R^N` of matrix products: F = (f_0(A_0 @ B_0) + f_1(A_1 @ B_1) + ... + f_n(A_n @ B_n)), where each i-th matrix product (and function evaluation f_i) is independent of each other. Also let's assume that the product of A_i and B_i is expensive, s.t. it would not make sense to run the expression F on a single GPU (user does not assign GPUs explicitly, as `n` is unknown in advance and may vary, as well as the number of GPUs). TensorFlow (XLA) builds the computational graph for F and can detect independent branches (a sequence of kernel calls). So, TF (XLA) passes this information about independent branches to the executor, which in turn has access to GPUs and their configurations. The executor should be able to parallelise the expression F knowing about independent branches. This is my intuition on what should happen when there are obvious parallelisable parts in the data-flow graph.
> I wonder, in your TF example, how you achieve multi-gpu support?
I will have to re-run profiling once I have access to multiple GPUs again. I did not assign manually graphs nodes to different GPUs.