Hi,
I was wondering if anyone tried to serve the model by partitioning it to different devices. Looking through GitHub issues and serving tutorials, community seems to suggests that the best way is to run the whole model on the same device (without model splitting). To quote from an example GitHub
issue "Probably your best solution is to build a script which loads your graph once per GPU".
However, I'd guess model splitting to provide higher inference throughput. Consider an analogy in model training. ICML'17
paper shows that splitting Inception-V3 model to 1 CPU and 4GPUs achieves 19% faster training time compared to the single-GPU setup (see table 2). Figure 5 shows (see attached) how graph nodes are split between CPU and 4 GPUs. Since both training and serving are data flow systems where faster training (and serving) is the result of tensors taking shorter time to pass through the graph, intuitively, I would expect model splitting to also accelerate serving. Does anyone see why this might not be true?
Since TF serving aims to provide high performance inference, I was wondering if anyone already tried or knows an ongoing work to do model splitting.
Note: my question is about
serving not
training. I know training already supports graph splitting. I'm adding this clarification as most of the mailing list questions are about the training. Also, there is an old (20 month ago)
post about model splitting between multiple GPUs for serving. But it does not describe how splitting can be done in general, it roughly says "if model was trained in N GPUs, serving will also use N GPUs". What if training was done with 2 GPUs and I want to serve with 6 GPUs? General answer is what I am looking for here.
Thanks!