Hi all,
I'm currently working with an embedded device with limited onboard memory. I've already got
https://github.com/BVLC/caffe/pull/2009, which helps a little. I'm using a fully convolutional model, and right now the memory seems to peak around 600MB. I'd need that under 450MB so I'm considering the following options and wondering if they are possible/implemented somewhere I can't see.
Using half-precision floating point: this seems like the simplest, and from what I understand it doesn't affect classification error too significantly. I just have no idea how to do it.
Getting rid of lower blob data as the network evaluates: seems terrible, but maybe a good idea nonetheless.
Chopping up my network's convolutional fc6 layer: I'd rather not, but may have to.
Sparse matrices: Something I've been reading about, not sure how hard it is to implement, but it seems like it might reduce the memory but still take quite a while to evaluate.
Low rank approximations: This seems like it would reduce the parameter memory space but not the data or buffer size?
I also can't explain where all the memory is coming from - my calculations don't quite add up to the total usage. The data size is ~113MB, the param size is ~227MB, and I guess there's an col_buffer somewhere that's eating a lot. Is that it though? Because python says it has 616MB of memory just to feed in a single image and do a forward pass. And do the split layers also take memory? Or are they just pointers to the same address?
#disclaimer - I'm not a computer scientist, I studied physics. Hopefully these aren't idiotic questions.
Thanks for your help
Ellery