I answered your question in two parts.
Part 1, clarifies any confusion of what the detect utility does and how scales work.
Part 2 gives you a memory calculation for a 985x1292x3 image (3 color channels) for the SVHN architecture (eblearn/demos/svhn/)
The detection utility basically tries to detect responses of a trained convnet at multiple scales, so that an object can be found, whatever size it might be (minimum size to maximum size, it tries to find the object at scale steps).
If you have only the full image as one scale, what it will try to do is find all objects of training network input size, at that scale.
At scale 985x1292, it will try to find all faces of size 32x32 (which is the input size with which the network was trained) in that huge image.
You can see an example of it in the attached example_1.png , where the input image of 288x288 with a single 288x288 scale, it tried to find all faces of 32x32.
If you are trying to detect a large object, that fills the whole image (for example a face that covers the entire image), then you would want a scale that is near equal to your trained input size, i.e. a single scale of 32x32 (in the face detector case). The detector then resizes the input image to 32x32 and then runs detection on it.
Memory usage for a double precision detection for a 1282x985x3 input image for 32x32 training input size with 10 classes (like svhn)
Total usage (double precision) = 8351 MB
Total usage (float precision) = 4175 MB
Calculation
===========
* Input (each pixel takes 8 bytes, because double precision)
===> 1282x985x3x8 = 29 MB
* Contrast normalization
Input dimensions = 1282 x 985 x 3
consists of
1. global mean subtraction and variance divison (done in-place)
2. local subtractive normalization
inmean = sum_j (w_j * in_j) where w_j is a gaussian kernel, inmean is a temporary buffer and in_j is the input buffer
out = in - inmean, so out buffer is as big as in buffer
So, there are two buffers (inmean and out) as big as input
Total memory relative to input size = 2 x input
3. temporary buffer (1 x input)
4. local divisive normalization
insq = in^2
invar = sum_j (w_j insq^2)
instd = sqrt(sum_j (w_j in^2))
thstd = std(std<mean(std))
out = in / thstd
So, there are 5 buffers as big as the input buffer
Total memory relative to the input size = 5 x input
Total memory overhead of Contrast normalization relative to input size = 8 x input
Output dimensions = 1282 x 985 x 3
===> 29 MB * 8 = 232 MB
* Convolution layer with 16 output feature maps and 5x5 kernel
Input dimensions = 1282 x 985 x 3
Output size = ((1282 - 5 + 1) x (985 - 5 + 1) x 16 x 8) = 154 MB
Output dimensions = 1278 x 981 x 16
===> 154 MB
* Bias module (addc)
Input dimensions = 1278 x 981 x 16
Just has one output buffer
Output dimensions = 1278 x 981 x 16
===> 154 MB
* Tanh module
Input dimensions = 1278 x 981 x 16
Just has one output buffer
Output dimensions = 1278 x 981 x 16
===> 154 MB
* L2-Pooling with 2x2 stride (means that output is half the size of input in both dimensions)
Input dimensions = 1278 x 981 x 16
Consists of
1. squared = in^2
2. convolved = gaussian-weighted neighborhood of in^2 (half size of input)
3. out = in^1/p(gaussian-weighted neighborhood of in^p) (half size of input)
Total memory overhead relative to input = 1 + 0.5 + 0.5 = 2 x input
Output dimensions = 639 x 490 x 16
===> 154 MB * 2 = 308 MB
* Subtractive Normalization
Input dimensions = 639 x 490 x 16
The input buffer to this is 77 MB in size (see input dimensions)
Output dimensions = 639 x 490 x 16
===> 77 MB x 2 = 144 MB
* Convolution layer with 512 output feature maps and 7x7 kernel
Input dimensions = 639 x 490 x 16
Output size = ((639 - 7 + 1) x (490 - 7 + 1) x 512 x 8) = 1196 MB
Output dimensions = 633 x 484 x 512
===> 1196 MB
* Bias module (addc)
Input dimensions = 633 x 484 x 512
Output dimensions = 633 x 484 x 512
===> 1196 MB
* Tanh module
Input dimensions = 633 x 484 x 512
Output dimensions = 633 x 484 x 512
===> 1196 MB
* L2-Pooling with 2x2 stride
Input dimensions = 633 x 484 x 512
Total memory overhead relative to input = 1 + 0.5 + 0.5 = 2 x input
Output dimensions = 316 x 242 x 512
===> 1196 MB * 2 = 2392 MB
* Subtractive Normalization
Input dimensions = 316 x 242 x 512
Output dimensions = 316 x 242 x 512
===> 598 MB * 2 = 1196 MB
* Linear layer with 20 hidden units
Input dimensions = 316 x 242 x 512
The linear layer is replicated for every input window
===> too lazy to calculate
Total memory consumption (excluding the linear layers) = 29 + 232 + 154 + 154 + 154 + 308 + 144 + 1196 + 1196 + 1196 + 2392 + 1196 = 8351 MB