I am looking for options to serve parallel predictions using caffe model from GPU . Since GPU comes with limited memory , what are the options available to achieve parallelism by loading the net only once .
Things I have tried . Successfully wrapped my segmentation net with tornado wsgi + flask . But at the end of the day this is most equivalent serving from a single process .
Is having own copy of net for each process a strict requirement , since the net is read-only after the training is done . Is it possible to rely on fork for parallelism . I am working on a sample app which serves result from segmentation model . It utilizes copy on write and loads the net in the master once and serve memory references for the forked children . I am having trouble starting this setup in a web server setting . I get Memory Error when i try to initialize the model . The webserver I am using here is uwsgi .
Have anyone achieved parallelism by loading the net only once (Since GPU memory is limited) and achieved parallelism for serving layer . I would be grateful if any one of you can point me in the right direction .