TF Networking SIG meeting

Paul Tucker

unread,

Oct 30, 2018, 3:20:41 PM10/30/18

to SIG Networking

To save time at tomorrow's meeting I've put together a brief outline of my view of the current state of TensorFlow networking, the motivation for formation of the SIG, and some issues we may want to address. This is just to help start discussion. I think the main goal of the meeting is just to reach some common awareness of each others interests.

Paul Tucker

TensorFlow Networking Status 2018.10.31.pdf

Anthony Dmitriev

unread,

Oct 30, 2018, 5:46:19 PM10/30/18

to SIG Networking

Hi, folks!

I see that gRPC server in scope of your interests, so could we please discuss the following PR: https://github.com/tensorflow/tensorflow/pull/23190?

I'm trying to implement `Stop` method of gRPC server, but I faced with several issues. If you can help me somehow with it would be great.

Best regards,

Anton Dmitriev.

вторник, 30 октября 2018 г., 22:20:41 UTC+3 пользователь Paul Tucker написал:

Bairen Yi

unread,

Oct 31, 2018, 12:00:09 PM10/31/18

to SIG Networking

Thanks Paul. We are trying to address some of the issue raised here:

1. Desire to eliminate /contrib in favor of an official API with 3rd party plugins. Is that something other than freezing and blessing some part of the currently used virtual interfaces?

Current usage:

(verbs, mpi, gdr) BaseRemoteRendezvous::RecvFromRemoteAsync for RecvTensor stub

(gdr) GrpcWorker:: GrpcRecvTensorAsync for RecvTensor service

Usage (as we have tried):

CollectiveRemoteAccessDistributed::RecvFromPeer for RecvBuf stub

GrpcWorker::RecvBufAsync for RecvBuf service

Some common dependencies for current plugins:

//tensorflow/core:core_cpu_internal

//tensorflow/core:lib_internal

//tensorflow/core:lib

//tensorflow/core:framework

//tensorflow/core:gpu_runtime

We do feel the currently used virtual interfaces are too low level and subject to TensorFlow internal APIs. We hope to change this (if we want moving plugins to a separated repo, and if possible but of course harder, not recompiling everything with bazel).

a. Simplify end-user configuration experience

b. Presubmit testing of changes is hard

Testing and CI infrastructure is needed. Could be as same as for MKL and Power (community sponsored). Better coordinated by the TF team.

2. Reduce the need to recompile everything with bazel. Would this be worth the trouble?

Necessary if someone try to integrate proprietary plugins with third party TF forks (access to source of fork to core is limited). In favor to proprietary solutions though, arguably not good for the long term development of the SIG, but may be important for some stakeholders.

3. Some kind of (semi-)official MPI or ib/verbs support. What would that look like?

An RFC is probably needed for this. But we think the official support should come in form of a networking plugin, otherwise the SIG will become irrelevant.

4. How does NCCL or a similar non-NVIDIA interconnect related utility fit in?

Beyond current usage, i.e. intra-node multi-GPU interconnect, we frankly do not see that. First, NCCL does not fit well with async/PS style, which we must support. Second, it is difficult for a networking plugin to schedule GPU kernels (which NCCL does). Third and performance-wise, a recent paper, Exascale Deep Learning for Climate Analytics (https://arxiv.org/abs/1810.01993), has demonstrated good scalability with NCCL 2.0 only for intra-node and MPI for inter-node communication.

We could discuss some of the issues today if we have enough time.

Best,

Bairen

在 2018年10月31日星期三 UTC+8上午3:20:41，Paul Tucker写道：

Edd Wilder-James

unread,

Oct 31, 2018, 5:33:18 PM10/31/18

to byr...@clustar.ai, netwo...@tensorflow.org

Do you want to add these items into the agenda?

https://docs.google.com/document/d/1BRZsRUlSH5UlSYKlJa4oPUOguEFHYL031Ziz-AxWYFE/edit#

--
You received this message because you are subscribed to the Google Groups "SIG Networking" group.
To unsubscribe from this group and stop receiving emails from it, send an email to networking+...@tensorflow.org.
Visit this group at https://groups.google.com/a/tensorflow.org/group/networking/.

--

Edd Wilder-James, Open Source Strategy at Google

Github @ewilderj • @tensorflow • @kubeflow

Reply all

Reply to author

Forward