memory problem with detect tool

100 views
Skip to first unread message

Ted

unread,
Jan 28, 2013, 5:07:39 AM1/28/13
to ebl...@googlegroups.com
Hello,

I am having problem running the detect tool (EBLearn 1.12\bin - windows):
I cannot process large images. Every time I tried some larger input the detector will exit with an error:

Error: cannot grow storage to 30324672 bytes (probably out of memory), in ebl::idx<float>::idx at c:\eblearn\core\libidx\include\idx.hpp:142

The detector consumes lot of memory and at the point of approx. 1.7GB of RAM it crashes. There seems to be problem with maximum resolution of the image. I can have plenty of input resolutions (for example 18 different resolution) and if the biggest one is not large (600x800 for example) it works. On the other hand if I have only 1 resolution, but it is large (985x1292 for example) it crashes. Also it seems that the error is not affected by the number of bounding boxes / detections found in the image. Have you any idea why is this happening? I believe that this is too much memory used for such a small task (like processing one 985x1292 image).

Thank you for any help.

the_minion

unread,
Jan 28, 2013, 10:49:46 AM1/28/13
to ebl...@googlegroups.com, defau...@hotmail.com
Hello Ted,

I answered your question in two parts.
Part 1, clarifies any confusion of what the detect utility does and how scales work.
Part 2 gives you a memory calculation for a 985x1292x3 image (3 color channels) for the SVHN architecture (eblearn/demos/svhn/)

Part 1
====
The detection utility basically tries to detect responses of a trained convnet at multiple scales, so that an object can be found, whatever size it might be (minimum size to maximum size, it tries to find the object at scale steps).

If you have only the full image as one scale, what it will try to do is find all objects of training network input size, at that scale.
For example, taking the face detector (in eblearn/demos/face/face.conf),

At scale 985x1292, it will try to find all faces of size 32x32 (which is the input size with which the network was trained) in that huge image.
You can see an example of it in the attached example_1.png , where the input image of 288x288 with a single 288x288 scale, it tried to find all faces of 32x32.

If you are trying to detect a large object, that fills the whole image (for example a face that covers the entire image), then you would want a scale that is near equal to your trained input size, i.e. a single scale of 32x32 (in the face detector case). The detector then resizes the input image to 32x32 and then runs detection on it.

Part 2
=====
Memory usage for a double precision detection for a 1282x985x3 input image for 32x32 training input size with 10 classes (like svhn)
Total usage (double precision) = 8351 MB
Total usage (float precision)  = 4175 MB

Calculation
===========
* Input (each pixel takes 8 bytes, because double precision)
===> 1282x985x3x8 = 29 MB 
* Contrast normalization
Input dimensions = 1282 x 985 x 3
consists of 
1. global mean subtraction and variance divison (done in-place)
2. local subtractive normalization
    inmean = sum_j (w_j * in_j) where w_j is a gaussian kernel, inmean is a temporary buffer and in_j is the input buffer
    out = in - inmean, so out buffer is as big as in buffer
    So, there are two buffers (inmean and out) as big as input
    Total memory relative to input size = 2 x input
3. temporary buffer (1 x input)
4. local divisive normalization
   insq = in^2
   invar = sum_j (w_j insq^2)
   instd = sqrt(sum_j (w_j in^2))
   thstd = std(std<mean(std))
   out = in / thstd
   So, there are 5 buffers as big as the input buffer
   Total memory relative to the input size = 5 x input
Total memory overhead of Contrast normalization relative to input size =  8 x input
Output dimensions = 1282 x 985 x 3
===> 29 MB * 8 = 232 MB
* Convolution layer with 16 output feature maps and 5x5 kernel
Input dimensions = 1282 x 985 x 3
    Output size = ((1282 - 5 + 1) x (985 - 5 + 1) x 16 x 8) = 154 MB
    Output dimensions = 1278 x 981 x 16
===> 154 MB
* Bias module (addc)
Input dimensions = 1278 x 981 x 16
Just has one output buffer
Output dimensions = 1278 x 981 x 16
===> 154 MB
* Tanh module
Input dimensions = 1278 x 981 x 16
Just has one output buffer
Output dimensions = 1278 x 981 x 16
===> 154 MB
* L2-Pooling with 2x2 stride (means that output is half the size of input in both dimensions)
Input dimensions = 1278 x 981 x 16
Consists of
1. squared = in^2
2. convolved = gaussian-weighted neighborhood of in^2  (half size of input)
3. out = in^1/p(gaussian-weighted neighborhood of in^p) (half size of input)
Total memory overhead relative to input = 1 + 0.5 + 0.5  = 2 x input
Output dimensions = 639 x 490 x 16
===> 154 MB * 2 = 308 MB
* Subtractive Normalization
Input dimensions = 639 x 490 x 16
The input buffer to this is 77 MB in size (see input dimensions)
Output dimensions = 639 x 490 x 16
===> 77 MB x 2 = 144 MB
* Convolution layer with 512 output feature maps and 7x7 kernel
Input dimensions = 639 x 490 x 16
Output size = ((639 - 7 + 1) x (490 - 7 + 1) x 512 x 8) = 1196 MB
Output dimensions = 633 x 484 x 512
===> 1196 MB
* Bias module (addc)
Input dimensions = 633 x 484 x 512
Output dimensions = 633 x 484 x 512
===> 1196 MB
* Tanh module
Input dimensions = 633 x 484 x 512
Output dimensions = 633 x 484 x 512
===> 1196 MB
* L2-Pooling with 2x2 stride
Input dimensions = 633 x 484 x 512
Total memory overhead relative to input = 1 + 0.5 + 0.5  = 2 x input
Output dimensions = 316 x 242 x 512
===> 1196 MB * 2 = 2392 MB
* Subtractive Normalization
Input dimensions = 316 x 242 x 512
Output dimensions = 316 x 242 x 512
===> 598 MB * 2 = 1196 MB
* Linear layer with 20 hidden units
Input dimensions = 316 x 242 x 512
The linear layer is replicated for every input window
===> too lazy to calculate
Total memory consumption (excluding the linear layers) = 29 + 232 + 154 + 154 + 154 + 308 + 144 + 1196 + 1196 + 1196 + 2392 + 1196 = 8351 MB

=============================================
example_1.png

the_minion

unread,
Jan 28, 2013, 10:53:51 AM1/28/13
to ebl...@googlegroups.com, defau...@hotmail.com
And part 2 calculations were done using the SVHN architecture described in eblearn/demos/svhn/svhn.conf (there is cool ascii art that shows the architecture :) )

Ted

unread,
Jan 28, 2013, 11:42:24 AM1/28/13
to ebl...@googlegroups.com
Thank you very much for the comprehensive reply.

I understand part 1, but for part 2 I thought that you are using some scanning window which moves through the image and classify each window one by one. So you replicated the classifier all around the image and then classify it in one step? If it’s so, is there any possibility to do this in the sequential way even if it is only in one thread? Or is there any chance of getting x64 release for win? Or are there some other tricks like not judging all positions in the image?

the_minion

unread,
Jan 28, 2013, 12:03:52 PM1/28/13
to ebl...@googlegroups.com, defau...@hotmail.com
So, the way convnets work, and convolutions in general, is that you reuse a computation for the next n-window computations.

For example, if you have a 32x32 window over a 100x100 image. When you compute the first window, parts of the calculation are used for the next 31 windows. This is one of the reason that we do the detections this way, as it is computationally waaay faster, even though it takes a little more memory. This advantage also extends to the middle layers of the convnet and not just the top layer.

Yo can do computation sequentially like traditional CV sliding window, but it becomes infeasible.

What you can however do is train the convnet itself with strides of the convolution layers greater than 1, on a slightly bigger input image. 

For example, for the face detector, instead of training with 32x32 and having each of the two convolution layers with 1x1 strides, train with a 48x48 or 64x64 window and stride of 2x2 (or if you'd like, you can train with a 32x32 convnet with stride 2x2 or any bigger stride for that matter). This would have much lower memory requirements (you can calculate the memory needed, but to give you a rough idea, the requirements would become half for a 2x2 stride network, one-third for a 3x3 stride).

The only stopping factor to get an x64 windows release is getting all the dependent libraries compiled in 64-bit mode. This is easy in linux and osx, but for windows, all the libraries that are shipped seem to be in 32-bits. EBLearn code itself is written to be 64-bit compatible and we use it in 64-bit in linux and OSX.

-- s

the_minion

unread,
Jan 30, 2013, 12:53:59 AM1/30/13
to ebl...@googlegroups.com, defau...@hotmail.com
I actually just got all the dependency libs compiled in 64-bit. Will post a 64-bit eblearn windows binary release soon.
--S
Message has been deleted

defau...@hotmail.com

unread,
Jan 30, 2013, 7:10:46 AM1/30/13
to ebl...@googlegroups.com, defau...@hotmail.com
That's great. Thank you!

the_minion

unread,
Jan 30, 2013, 2:17:05 PM1/30/13
to ebl...@googlegroups.com, defau...@hotmail.com
Here you go, 64-bit binaries posted here
http://eblearn.sourceforge.net/install.html#install_from_binaries

On Wednesday, January 30, 2013 7:10:46 AM UTC-5, defau...@hotmail.com wrote:
That's great. Thank you!
Reply all
Reply to author
Forward
0 new messages