Serving caffe models from GPU . Achieving parallelism .

155 views
Skip to first unread message

senthil kumaran

unread,
Jun 22, 2016, 4:08:32 PM6/22/16
to Caffe Users
I am looking for options to serve parallel predictions using caffe model from GPU . Since GPU comes with limited memory , what are the options available to achieve parallelism by loading the net only once . 

Things I have tried . Successfully wrapped my segmentation net with tornado wsgi + flask . But at the end of the day this is most equivalent serving from a single process . 

Is having own copy of net for each process a strict requirement , since the net is read-only after the training is done  . Is it possible to rely on fork for parallelism . I am working on a sample app which serves result from segmentation model . It utilizes copy on write and loads the net in the master once and serve memory references for the forked children . I am having trouble starting this setup in a web server setting . I get Memory Error when i try to initialize the model . The webserver I am using here is uwsgi . 

Have anyone achieved parallelism by loading the net only once (Since GPU memory is limited) and achieved parallelism for serving layer . I would be grateful if any one of you can point me in the right direction . 


xiang...@gmail.com

unread,
Jun 29, 2016, 6:30:02 AM6/29/16
to Caffe Users
Save to me

Nicolae Titus

unread,
Jun 29, 2016, 10:52:46 AM6/29/16
to Caffe Users
if I recall correctly somebody at Baidu mentioned that for their machine learning prediction server would wait until 3-4 requests came and only then sent all of these at the GPU to achieve good performance.
Sorry, I don't know the details.

Georgiy Slobodyrev

unread,
May 18, 2017, 8:40:18 AM5/18/17
to Caffe Users
Hi senthil kumaran,

Have you achieved any results in serving one caffe model in parallel processes? We are trying to do the same. Very interesting in your solution.
Thank you for any details.
Reply all
Reply to author
Forward
0 new messages