Thanks Paul. We are trying to address some of the issue raised here:
1. Desire to eliminate /contrib in favor of an official API with 3rd party plugins. Is that something other than freezing and blessing some part of the currently used virtual interfaces?
Current usage:
(verbs, mpi, gdr) BaseRemoteRendezvous::RecvFromRemoteAsync for RecvTensor stub
(gdr) GrpcWorker:: GrpcRecvTensorAsync for RecvTensor service
Usage (as we have tried):
CollectiveRemoteAccessDistributed::RecvFromPeer for RecvBuf stub
GrpcWorker::RecvBufAsync for RecvBuf service
Some common dependencies for current plugins:
//tensorflow/core:core_cpu_internal
//tensorflow/core:lib_internal
//tensorflow/core:lib
//tensorflow/core:framework
//tensorflow/core:gpu_runtime
We do feel the currently used virtual interfaces are too low level and subject to TensorFlow internal APIs. We hope to change this (if we want moving plugins to a separated repo, and if possible but of course harder, not recompiling everything with bazel).
a. Simplify end-user configuration experience
b. Presubmit testing of changes is hard
Testing and CI infrastructure is needed. Could be as same as for MKL and Power (community sponsored). Better coordinated by the TF team.
2. Reduce the need to recompile everything with bazel. Would this be worth the trouble?
Necessary if someone try to integrate proprietary plugins with third party TF forks (access to source of fork to core is limited). In favor to proprietary solutions though, arguably not good for the long term development of the SIG, but may be important for some stakeholders.
3. Some kind of (semi-)official MPI or ib/verbs support. What would that look like?
An RFC is probably needed for this. But we think the official support should come in form of a networking plugin, otherwise the SIG will become irrelevant.
4. How does NCCL or a similar non-NVIDIA interconnect related utility fit in?
Beyond current usage, i.e. intra-node multi-GPU interconnect, we frankly do not see that. First, NCCL does not fit well with async/PS style, which we must support. Second, it is difficult for a networking plugin to schedule GPU kernels (which NCCL does). Third and performance-wise, a recent paper, Exascale Deep Learning for Climate Analytics (
https://arxiv.org/abs/1810.01993), has demonstrated good scalability with NCCL 2.0 only for intra-node and MPI for inter-node communication.
We could discuss some of the issues today if we have enough time.
Best,
Bairen
在 2018年10月31日星期三 UTC+8上午3:20:41,Paul Tucker写道: