Coordination TF - INTEL::TBB

timothée ewart

unread,

May 21, 2020, 7:41:11 AM5/21/20

to TensorFlow Developers

Dear TF Developers, it will be long post.

For my company I am currently designing an inference TF pipeline (CPU and/or GPU). The tools, I am using are TF (C-backend) and TBB.

I am using the C-backend because it is provided with any package manager, easily pleadable with CMake (and all my software stack),

and I can skip the dependency with Bazel (I consider it, as a hard tool). my pipeline works but I have questions about Threading model and cohabitation

of two pool manager, and I have an issue with GPU selection.

The overview of the program is: a serie of node (A, B, C, ...), one of this will execute the TF inference:

A (IO) -> B (foo) -> C (TF) -> D (fuu) -> E (IO)

This pipeline is designed using TBB::pipeline, in short every node is a simple c++ struct with a constructor and a functor. Node communicate by a message class

using the move semantic. Every node will look like this:

struct node{
    node(){ //whatever }
    message operator()(const & message<){
        super_algo(message); // modify the message
        return std::move(message)
    }
}

For the TF node constructor, I am using the same logic. The construction of the node is sequential, from this Googler, it is not an issue. In shorts, in the construction of the TF node,

I load my saved model, it look likes this:

struct helper_tf {
    explicit helper_tf(const std::string& model)
        : graph_(TF_NewGraph(), TF_DeleteGraph), status_(TF_NewStatus(), TF_DeleteStatus),
          session_opts_(TF_NewSessionOptions(), TF_DeleteSessionOptions), run_opts_(nullptr) {

        const char *tags = "serve";
        int ntags = 1;

        session_ = TF_LoadSessionFromSavedModel(session_opts_.get(), 
                       run_opts_, model.c_str(), &tags, ntags,graph_.get(), nullptr, status_.get());

        if (TF_GetCode(status_.get()) != TF_OK)
            throw std::runtime_error(TF_Message(status_.get()));
    };

    ~helper_tf() { TF_DeleteSession(session_, status_.get()); }

    TF_Session *session_;

    std::unique_ptr<TF_Graph, decltype(&TF_DeleteGraph)> graph_;
    std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status_;
    std::unique_ptr<TF_SessionOptions, decltype(&TF_DeleteSessionOptions)> session_opts_;
    TF_Buffer *run_opts_;
};

With this approach I will have only one session, graph, status, session_opts per TBB thread. Moreover during the inference itself, after setup of the input/output node, I run the function TF_SessionRun().

QUESTION: Should I have one TF_Session or n-TBB-thread TF_Session, TF_Graph, TF_SessionOptions, etc ...?

For a pure CPU version, such approach work, but I have two pool manager working together (TBB one and Eigen one). With default setting results are quite good.

However the TF_SetConfig() function provides options to manipulate the thread of Eigen, I did (from):

   uint8_t inter_op_parallelism_threads = 2;
   uint8_t intra_op_parallelism_threads = 2; 
   uint8_t config[] = {0x10, intra_op_parallelism_threads, 0x28, inter_op_parallelism_threads}; 
   TF_SetConfig(session_opts_.get(), (void *)config, 4, status_.get());

I am using this setting because I have 16 cores on my machine, 4 threads for TBB and 4 thread for every run of TF_SessionRun().

QUESTION: What should be the best value for inter_op_parallelism_threads and intra_op_parallelism_threads ? Should I choose #cores = #intra * #inter * #tbb::thread ?

The second point is GPU, if I am using a TF compiled for GPU, using the same structure. The node TF should be executed on GPU. On my server I have 4 GPUS. I was hoping, the TF_SessionRun() will execute the code on the GPUs, however all the TBB threads executes the GPU code the GPU 0. I have try a few solutions:

export CUDA_VISIBLE_DEVICES=0,1,2,3
use the CUDA driver before TF_SessionRun(), I tried cudaSetDevice(GPU_ID) with the correct GPU_ID, the GPU_ID is dynamic, no guarantee it will be the same at every time I call the function

These two approaches (combined or not) do not work, my GPU code is always executed on the "GPU:0". I thing TF overwrite my setting in TF_SessionRun(). From the source of TF and the Protobufs, it should be possible to set up "/device:GPU:0", "/device:GPU:1", etc ... somewhere. Maybe the second option of TF_SessionRun() could do the job, but I have absolulty no idea how to "create" the protobuff needed, where to plugin , etc.

QUESTION: How to execute a TF_SessionRun() on a selected GPU on a multi thread environment ?

The post is bit long, but I am totally blocked specially for the GPU function.

Message has been deleted

timothée ewart

unread,

May 21, 2020, 7:59:20 AM5/21/20

to TensorFlow Developers

It may be not the good place to publish, I tried Stack OverFlow (no answer), I also just post it on this group. If my questions are inappropriate, please delete the message.

Eugene Zhulenev

unread,

May 21, 2020, 3:26:59 PM5/21/20

to TensorFlow Developers, timothe...@gmail.com

Regarding CPU threading:

By default all TF session will share the same local device with shared inter-op/intra-op thread pools.

Relevant code:

(1) https://github.com/tensorflow/tensorflow/blob/3842f749bc0e4598315886b08812c4e39591e377/tensorflow/core/common_runtime/local_device.cc#L53

(2) https://github.com/tensorflow/tensorflow/blob/3842f749bc0e4598315886b08812c4e39591e377/tensorflow/core/common_runtime/local_device.cc#L126-L148

So if you have 16 physical threads you can partition them like that:

- 8 for TBB

- 8 for TF.

Intra-op thread pool used to parallelize kernels (e.g. execute blocks of the matrix multiplication), and inter-op thread pool used to parallelize concurrent op-execution (e.g. run two matmuls in parallel). If your graph does not have large inter-op concurrency, then it makes sense to give more threads to intra-op thread pool: inter/intra = 2/6. On the other hand if all ops in the graph do not launch a lot of tasks, then you can do 6/2.

Brian Zhao

unread,

May 21, 2020, 3:53:43 PM5/21/20

to timothée ewart, TensorFlow Developers

Hi Timothée,

Regarding "how to execute a TF_SessionRun() on a selected GPU on a multi thread environment?", I think there's two possible approaches using the Graph C API:

1. You can manually call TF_SetDevice on the individual ops when building a graph from scratch.

2. You can call TF_ImportGraphDefOptionsSetDefaultDevice to set the default device for a graph when loading from a graphdef. (You can then load multiple graphs, each with a different default GPU device).

Unfortunately, in your case, since TF_Operations are immutable once built from a TF_OperationDescription, you'll have to either:

1. Copy the graph into a graphdef, to reload with new default devices

2. Or construct a new graph by iterating over the current graph's operations and creating new TF_OperationDescriptions on the new graph with the added device.

Thanks!

--
You received this message because you are subscribed to the Google Groups "TensorFlow Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to developers+...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/c6cf11ef-57ac-464e-a593-6f8547f1fcc4%40tensorflow.org.

timothée ewart

unread,

May 26, 2020, 2:46:44 AM5/26/20

to TensorFlow Developers, timothe...@gmail.com

Hello Brian and Eugene,

Thank you very much for the support. I did nice progress for the threading model, but I am still perplex for the GPU solution.

As you said the graph is immutable so, I should create a new one:

I have tested, your first proposal however it is too powerful, all the node will become part of the gpu ("/device:gpu:?") which a bit too much. Some nodes are not supported by GPU so it crashes
The second proposition is beyond my skill, it means rewrite the full graph in C and pinning node with their associated device. By hand it will take me a long time ... it sounds like a compiler job.

Where I am confused, from my understanding a Tensorflow session manages the execution on CPU and potentially device is the implementation of the node is supported.

So If I create n sessions from TF_LoadSessionFromSavedModel(), using combined with TF_setconfig/TF_NewSessionOptions, I believed an option for the Protobufs will be available to set up the optional device, like inter_op and intra_op.

Maybe my post was not enough clear, but I am just trying to write in C++ (using C-backend) what is straight forward in python

def foo(image, model):
   with tf.device("/device:gpu:3"):<-----------------------------------this is the line !
      infer = model.signatures["serving_default"]
      return infer(tf.constant(image))["input"]

So, does it existe an option in the Protobufs of TF_setconfig or, am I doom to write a graph-convertor ? >^_^<

All the best,

++t

To unsubscribe from this group and stop receiving emails from it, send an email to devel...@tensorflow.org.

t kevin

unread,

May 26, 2020, 8:09:44 PM5/26/20

to timothée ewart, TensorFlow Developers

Hi timothée

This might help about the first "powerful" solution.

// Whether soft placement is allowed. If allow_soft_placement is true,
// an op will be placed on CPU if
// 1. there's no GPU implementation for the OP
// or
// 2. no GPU devices are known or registered
// or
// 3. need to co-locate with reftype input(s) which are from CPU.
bool allow_soft_placement = 7;

details in

https://stackoverflow.com/questions/44873273/what-do-the-options-in-configproto-like-allow-soft-placement-and-log-device-plac

BR
Kevin

timothée ewart <timothe...@gmail.com> 于2020年5月26日周二下午2:46写道：

> To unsubscribe from this group and stop receiving emails from it, send an email to developers+...@tensorflow.org.
> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/8add09a3-d577-4b24-8b3b-69d83209738e%40tensorflow.org.

Message has been deleted

timothée ewart

unread,

May 27, 2020, 6:22:17 AM5/27/20

to TensorFlow Developers, timothe...@gmail.com

Hello Kevin,

Indeed, it helps but I fear I reach a dead end. So combining your proposition and the previous one. I have been able to generate new sessions with new "graph".

As illustrated in the current code:

        const char *tags = "serve";
        int ntags = 1;

 
        // get the session from a saved model


        session_ = TF_LoadSessionFromSavedModel(session_opts_.get(), run_opts_, model.c_str(), &tags, ntags,

 
                                                graph_.get(), meta_buffer_.get(), status_.get()); 
        check_status(status_.get()); 

        std::unique_ptr<TF_Buffer, decltype(&TF_DeleteBuffer)> buffer(TF_NewBuffer(), TF_DeleteBuffer); 
        // transform the graph to graphdef
        TF_GraphToGraphDef(graph_.get(), buffer.get(), status_.get());
        check_status(status_.get()); 

        int ngpu = 0; 
        // get number of gpu
        cuDeviceGetCount(&ngpu); 

        for (int g = 0; g < ngpu; ++g) { 

                 // create a Protobufs to generate the good option, 
            TF_SessionOptions *session_opts = TF_NewSessionOptions();
             std::vector<uint8_t> config = {0x32, 0x1c, 0x9,  0x0,  0x0,  0x0,  0x0,  0x0,  0x0,  0xe0, 0x3f, 
                                           0x20, 0x1,  0x2a, 0xf,  0x30, 0x2c, 0x31, 0x2c, 0x32, 0x2c, 0x33, 
                                           0x2c, 0x34, 0x2c, 0x35, 0x2c, 0x36, 0x2c, 0x37, 0x38, 0x1}; 
            TF_SetConfig(session_opts, (void *)config.data(), config.size(), status_.get()); 
            check_status(status_.get()); 
 
            // new options for the grah 
            std::unique_ptr<TF_ImportGraphDefOptions, decltype(&TF_DeleteImportGraphDefOptions)> graph_options( 
                TF_NewImportGraphDefOptions(), TF_DeleteImportGraphDefOptions);
            // prepare the string for every device
            std::string device = "/device:GPU:" + std::to_string(g); 
            // setup the device option with the correct string
            TF_ImportGraphDefOptionsSetDefaultDevice(graph_options.get(), device.c_str()); 
            // prepare the new modified graph
            TF_Graph *ngraph = TF_NewGraph(); 
            // using the previous buffer + options generate the new graph
            TF_GraphImportGraphDef(ngraph, buffer.get(), graph_options.get(), status_.get()); 
            check_status(status_.get()); 
            // generate a new session with the new modify graph
            TF_Session *device_session = TF_NewSession(ngraph, session_opts, status_.get()); 
            check_status(status_.get()); 
            // push the the new session, opt, graph into concurrent queue, thread will pickup a session to execute TF_SessionRun
            qsession_.push(std::make_tuple(device_session, session_opts, ngraph)); 
        }

However this new session will crash during the execution of TF_SessionRun. Indeed the graph has been modified (all GPU are correctly allocated at least) but data from the saved model are "missing" into the new session.

I get the following error:

Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource:

This "feature" is independent of what I am trying to do. Currently, from the current C-API, I think it is not possible to restart a working new Session from the Graph obtained from TF_LoadSessionFromSavedModel, the graph is ok but the meta data will be always missing.

        auto session = TF_LoadSessionFromSavedModel(session_opts_.get(), run_opts_, model.c_str(), &tags, ntags,
                                                graph_.get(), nullptr, status_.get());

        auto new_session = TF_NewSession(graph_.get(), session_opts_.get(), status_.get()); // Usage of this will fail because metadata are missing

If you look the source or TF_LoadSessionFromSavedModel the TF_Session is built using all the content of the directory from the saved model. So game over or not, but reading all the functions in c_api.h I do not find anything to connect the "data" of the saved model to the new graph, I have generated.

Thank you very much for the support

++t

timothée ewart

unread,

May 28, 2020, 5:18:44 AM5/28/20

to TensorFlow Developers

Hello all,

TF 1 : Tim 0,

Currently the using a modification of the Protobufs by specifying the device for every new session. I get the nice error message:

what(): TensorFlow device (GPU:0) is being mapped to multiple CUDA devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see https://github.com/tensorflow/tensorflow/issues/19083

Currently, a ticket was open two years ago for multi session GPU under multithread env (C++). Well it is not supported, I am sad :(

t kevin

unread,

May 28, 2020, 9:12:45 PM5/28/20

to timothée ewart, TensorFlow Developers

Hi Tim

This happens when you have an illegal set of
(ConfigProto.gpu_options.visible_device_list) when running multiple
sessions.
I can't tell what's the problem from those hex code you pasted in previous mail.

Kevin

timothée ewart <timothe...@gmail.com> 于2020年5月28日周四下午5:18写道：

> --
> You received this message because you are subscribed to the Google Groups "TensorFlow Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to developers+...@tensorflow.org.

> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/28ab3068-9fb8-4e8d-b236-8141dc7cd13a%40tensorflow.org.

timothée ewart

unread,

Jun 3, 2020, 4:37:23 PM6/3/20

to TensorFlow Developers, timothe...@gmail.com

Hello all,

I have been finally successful. In python I have transformed my saved model to old fashion way: I freeze the model and get a signle protobuf where graph + data are. Using this protobuf I have been able to add the specific device, I need gpu:0, etc ... using the method which has been proposed in this post. At the end I created n session one per gpu. Each session is managed by a single TBB thread. I have been able to run on my pipeline on 8 GPU, async, etc ... it works quite well.

Coordination TF - INTEL::TBB - Nvidia

timothée ewart

timothée ewart

Eugene Zhulenev

Brian Zhao

timothée ewart

t kevin

timothée ewart

timothée ewart

t kevin

timothée ewart