Hey everyone,
I am currently trying to deploy a semantig segmentaion DL net (
U-net) for semantic segmentation and want to know if there are possible ways to get run-time improvements during deployment?
What I've done so far is cleaning up the Net.prototxt so I only have all layers i need for deployment. I also have maximized input tile size to what can be maximally processed per pass by the GPU (Geforce GTX 970).
Since the images i want to segment are quite large, I still need a total of 90 seconds total including I/O and stitching for 8000x8000 output (196 passes through the Net). I am assigning to and reading from network using
net.blobs['data'].data[...] = data
net.forward()
output = net.blobs['argmax'].data[...]
in a loop.
Apart from switching to C++, installing
fb-caffe-exts to further increase input tile size) as well as upgrading the GPU (I get around 50 seconds for GeForce Titan X), I am out of ideas on what to do.
Any help or tips are much appreciated! :)