> You might be able to train, if your batch size is very small (increase
> iter_size to offset this reduction). The fully connected layers usually
> take up the most memory.
For images, a 256x256 image is 64K pixels, and should take 192 kilobytes
(one byte per color channel). A batch with size 256 would require about
50 megabytes, which is miniscule compared to any GPU memory, I think.
On the other hand, the connect between two 4096-neuron FC (IP) layers
require 4096² = 16M weights, which I guess will be 32 bit floats on a
GPU. So that's 64 megabytes. So unless you have very large input data,
and/or a very simple network, the bulk of the memory is likely to go
into the network itself.
Oh wait: the batch is forward-fed as a tensor, isn't it? So
intermediate values are generated for the whole batch at once,
so while you only need to store two copies of the weights (the extra one
for updating), you need to store batch-size times layer size of data,
one layer at a time - here 4096x256, so the weights still dominate in
most cases. And you only need to store this one layer at a time.
This is all very back-of-envelope guesstimates, I'd be curious to hear
about practical experiences. I've currently used a Titan X with 12GB
and not run into any problems (using AlexNet). What kind of network and
data gave you (previous poster) an out of memory error on 2GB? Did
anybody run out with 6GB or 8GB cards?
-k
--
If I haven't seen further, it is by standing in the footprints of giants