TF Networking SIG meeting

Skip to first unread message

Paul Tucker

Oct 30, 2018, 3:20:41 PM10/30/18
to SIG Networking
To save time at tomorrow's meeting I've put together a brief outline of my view of the current state of TensorFlow networking, the motivation for formation of the SIG, and some issues we may want to address.   This is just to help start discussion.  I think the main goal of the meeting is just to reach some common awareness of each others interests.

Paul Tucker
TensorFlow Networking Status 2018.10.31.pdf

Anthony Dmitriev

Oct 30, 2018, 5:46:19 PM10/30/18
to SIG Networking
Hi, folks!

I see that gRPC server in scope of your interests, so could we please discuss the following PR:
I'm trying to implement `Stop` method of gRPC server, but I faced with several issues. If you can help me somehow with it would be great.

Best regards,
Anton Dmitriev.

вторник, 30 октября 2018 г., 22:20:41 UTC+3 пользователь Paul Tucker написал:

Bairen Yi

Oct 31, 2018, 12:00:09 PM10/31/18
to SIG Networking
Thanks Paul. We are trying to address some of the issue raised here:

1. Desire to eliminate /contrib in favor of an official API with 3rd party plugins. Is that something other than freezing and blessing some part of the currently used virtual interfaces? 

Current usage:
(verbs, mpi, gdr) BaseRemoteRendezvous::RecvFromRemoteAsync for RecvTensor stub
(gdr) GrpcWorker:: GrpcRecvTensorAsync for RecvTensor service

Usage (as we have tried):
CollectiveRemoteAccessDistributed::RecvFromPeer for RecvBuf stub
GrpcWorker::RecvBufAsync for RecvBuf service

Some common dependencies for current plugins:

We do feel the currently used virtual interfaces are too low level and subject to TensorFlow internal APIs. We hope to change this (if we want moving plugins to a separated repo, and if possible but of course harder, not recompiling everything with bazel).

a. Simplify end-user configuration experience 
b. Presubmit testing of changes is hard

Testing and CI infrastructure is needed. Could be as same as for MKL and Power (community sponsored). Better coordinated by the TF team. 

2. Reduce the need to recompile everything with bazel. Would this be worth the trouble?

Necessary if someone try to integrate proprietary plugins with third party TF forks (access to source of fork to core is limited). In favor to proprietary solutions though, arguably not good for the long term development of the SIG, but may be important for some stakeholders.

3. Some kind of (semi-)official MPI or ib/verbs support. What would that look like?

An RFC is probably needed for this. But we think the official support should come in form of a networking plugin, otherwise the SIG will become irrelevant.

4. How does NCCL or a similar non-NVIDIA interconnect related utility fit in?

Beyond current usage, i.e. intra-node multi-GPU interconnect, we frankly do not see that. First, NCCL does not fit well with async/PS style, which we must support. Second, it is difficult for a networking plugin to schedule GPU kernels (which NCCL does). Third and performance-wise, a recent paper, Exascale Deep Learning for Climate Analytics (, has demonstrated good scalability with NCCL 2.0 only for intra-node and MPI for inter-node communication.

We could discuss some of the issues today if we have enough time.


在 2018年10月31日星期三 UTC+8上午3:20:41,Paul Tucker写道:

Edd Wilder-James

Oct 31, 2018, 5:33:18 PM10/31/18

You received this message because you are subscribed to the Google Groups "SIG Networking" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
Visit this group at

Edd Wilder-James, Open Source Strategy at Google
Reply all
Reply to author
0 new messages