SqueezeNet disappointingly slow

1,171 views
Skip to first unread message

Alexander Weiss

unread,
Aug 17, 2016, 4:04:00 PM8/17/16
to torch7
I've been a big fan of VGG-style networks for a while, but I recently decided to try out a SqueezeNet-style network for classification and see if I could reduce the size of my nets and get a nice speed boost.  As promised, the networks are significantly smaller, while still maintaining very good accuracy.  In my case, I was able to reduce my networks from ~130MB down to ~5MB.  I'm not doing any additional compression on the networks (like pruning, quantization, etc.).  All of this was implemented in Torch.

I expected that with such a large reduction in the number of neural net parameters (25x reduction), the nets would run significantly faster than my VGG-style nets.  Disappointingly, I'm only getting a 33% speed boost.  Running on an AWS K520 GPU, inference time for my old nets was ~7.5ms for a 128x128 image.  For my new nets, inference time is ~5ms.  Maybe this isn't surprising; the number of layers in my old and new networks are pretty similar and the GPU can't parallelize computation over different layers.  (On the CPU, the SqueezeNet-style network runs 3x faster than the VGG-style network.)  So maybe for increased speed (w/ GPU), I really need to go wider instead of deeper, but are there any tricks to improve the speed of my SqueezeNet-style network within the Torch framework (or otherwise)?

soumith

unread,
Aug 17, 2016, 4:11:45 PM8/17/16
to torch7 on behalf of Alexander Weiss
Usually for GPU the batch size has to be larger than 1 image.

Try a batch size of 256 inputs, and you'll see a wider gap in performance.

On Wed, Aug 17, 2016 at 4:03 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30PfpJTwn3luLMBUUyH0EVaRyRp1q7UJHGY@googlegroups.com> wrote:
I've been a big fan of VGG-style networks for a while, but I recently decided to try out a SqueezeNet-style network for classification and see if I could reduce the size of my nets and get a nice speed boost.  As promised, the networks are significantly smaller, while still maintaining very good accuracy.  In my case, I was able to reduce my networks from ~130MB down to ~5MB.  I'm not doing any additional compression on the networks (like pruning, quantization, etc.).  All of this was implemented in Torch.

I expected that with such a large reduction in the number of neural net parameters (25x reduction), the nets would run significantly faster than my VGG-style nets.  Disappointingly, I'm only getting a 33% speed boost.  Running on an AWS K520 GPU, inference time for my old nets was ~7.5ms for a 128x128 image.  For my new nets, inference time is ~5ms.  Maybe this isn't surprising; the number of layers in my old and new networks are pretty similar and the GPU can't parallelize computation over different layers.  (On the CPU, the SqueezeNet-style network runs 3x faster than the VGG-style network.)  So maybe for increased speed (w/ GPU), I really need to go wider instead of deeper, but are there any tricks to improve the speed of my SqueezeNet-style network within the Torch framework (or otherwise)?

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+unsubscribe@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at https://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

Alexander Weiss

unread,
Aug 17, 2016, 4:33:36 PM8/17/16
to torch7
My GPU doesn't have enough memory to batch 256 images.  However, the inference time of the SqueezeNet-style network appears to scale linearly with batch size up to 64 images.  Interestingly, it seems like I can push through larger images (like 256x256) without any significant increase in inference time over the 128x128 images.  (Just to be clear, this is a fully convolutional architecture, so I can input images of different sizes.)



On Wednesday, August 17, 2016 at 3:11:45 PM UTC-5, smth chntla wrote:
Usually for GPU the batch size has to be larger than 1 image.

Try a batch size of 256 inputs, and you'll see a wider gap in performance.
On Wed, Aug 17, 2016 at 4:03 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30PfpJTwn3luLMBUUyH0EVaRyRp1q7UJHGY@googlegroups.com> wrote:
I've been a big fan of VGG-style networks for a while, but I recently decided to try out a SqueezeNet-style network for classification and see if I could reduce the size of my nets and get a nice speed boost.  As promised, the networks are significantly smaller, while still maintaining very good accuracy.  In my case, I was able to reduce my networks from ~130MB down to ~5MB.  I'm not doing any additional compression on the networks (like pruning, quantization, etc.).  All of this was implemented in Torch.

I expected that with such a large reduction in the number of neural net parameters (25x reduction), the nets would run significantly faster than my VGG-style nets.  Disappointingly, I'm only getting a 33% speed boost.  Running on an AWS K520 GPU, inference time for my old nets was ~7.5ms for a 128x128 image.  For my new nets, inference time is ~5ms.  Maybe this isn't surprising; the number of layers in my old and new networks are pretty similar and the GPU can't parallelize computation over different layers.  (On the CPU, the SqueezeNet-style network runs 3x faster than the VGG-style network.)  So maybe for increased speed (w/ GPU), I really need to go wider instead of deeper, but are there any tricks to improve the speed of my SqueezeNet-style network within the Torch framework (or otherwise)?

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.

soumith

unread,
Aug 17, 2016, 4:51:35 PM8/17/16
to torch7 on behalf of Alexander Weiss
are you using cudnn, and are you using the option: "cudnn.benchmark = true"  ?

On Wed, Aug 17, 2016 at 4:33 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30P...@googlegroups.com> wrote:
My GPU doesn't have enough memory to batch 256 images.  However, the inference time of the SqueezeNet-style network appears to scale linearly with batch size up to 64 images.  Interestingly, it seems like I can push through larger images (like 256x256) without any significant increase in inference time over the 128x128 images.  (Just to be clear, this is a fully convolutional architecture, so I can input images of different sizes.)


On Wednesday, August 17, 2016 at 3:11:45 PM UTC-5, smth chntla wrote:
Usually for GPU the batch size has to be larger than 1 image.

Try a batch size of 256 inputs, and you'll see a wider gap in performance.

On Wed, Aug 17, 2016 at 4:03 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30PfpJTwn3luLMBUUyH0EVaRyRp1q7UJ...@googlegroups.com> wrote:
I've been a big fan of VGG-style networks for a while, but I recently decided to try out a SqueezeNet-style network for classification and see if I could reduce the size of my nets and get a nice speed boost.  As promised, the networks are significantly smaller, while still maintaining very good accuracy.  In my case, I was able to reduce my networks from ~130MB down to ~5MB.  I'm not doing any additional compression on the networks (like pruning, quantization, etc.).  All of this was implemented in Torch.

I expected that with such a large reduction in the number of neural net parameters (25x reduction), the nets would run significantly faster than my VGG-style nets.  Disappointingly, I'm only getting a 33% speed boost.  Running on an AWS K520 GPU, inference time for my old nets was ~7.5ms for a 128x128 image.  For my new nets, inference time is ~5ms.  Maybe this isn't surprising; the number of layers in my old and new networks are pretty similar and the GPU can't parallelize computation over different layers.  (On the CPU, the SqueezeNet-style network runs 3x faster than the VGG-style network.)  So maybe for increased speed (w/ GPU), I really need to go wider instead of deeper, but are there any tricks to improve the speed of my SqueezeNet-style network within the Torch framework (or otherwise)?

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at https://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+unsubscribe@googlegroups.com.

Alexander Weiss

unread,
Aug 17, 2016, 5:12:33 PM8/17/16
to torch7
No.  I gave up on cudnn because of mysterious memory leaks (which apparently only effect me).  I can try reinstalling it and then running some tests, but it's not an ideal solution for me.  Can I really expect a significant boost from this?  I honestly never noticed significant speed differences with cudnn, at least not during inference time.  Maybe it's because I wasn't making large enough batches.


On Wednesday, August 17, 2016 at 3:51:35 PM UTC-5, smth chntla wrote:
are you using cudnn, and are you using the option: "cudnn.benchmark = true"  ?
On Wed, Aug 17, 2016 at 4:33 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30PfpJTwn3luLMBUUyH0EVaRyRp1q7UJHGY@googlegroups.com> wrote:
My GPU doesn't have enough memory to batch 256 images.  However, the inference time of the SqueezeNet-style network appears to scale linearly with batch size up to 64 images.  Interestingly, it seems like I can push through larger images (like 256x256) without any significant increase in inference time over the 128x128 images.  (Just to be clear, this is a fully convolutional architecture, so I can input images of different sizes.)


On Wednesday, August 17, 2016 at 3:11:45 PM UTC-5, smth chntla wrote:
Usually for GPU the batch size has to be larger than 1 image.

Try a batch size of 256 inputs, and you'll see a wider gap in performance.

On Wed, Aug 17, 2016 at 4:03 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30PfpJTwn3luLMBUUyH0EVaRyRp1q7UJHGY@googlegroups.com> wrote:
I've been a big fan of VGG-style networks for a while, but I recently decided to try out a SqueezeNet-style network for classification and see if I could reduce the size of my nets and get a nice speed boost.  As promised, the networks are significantly smaller, while still maintaining very good accuracy.  In my case, I was able to reduce my networks from ~130MB down to ~5MB.  I'm not doing any additional compression on the networks (like pruning, quantization, etc.).  All of this was implemented in Torch.

I expected that with such a large reduction in the number of neural net parameters (25x reduction), the nets would run significantly faster than my VGG-style nets.  Disappointingly, I'm only getting a 33% speed boost.  Running on an AWS K520 GPU, inference time for my old nets was ~7.5ms for a 128x128 image.  For my new nets, inference time is ~5ms.  Maybe this isn't surprising; the number of layers in my old and new networks are pretty similar and the GPU can't parallelize computation over different layers.  (On the CPU, the SqueezeNet-style network runs 3x faster than the VGG-style network.)  So maybe for increased speed (w/ GPU), I really need to go wider instead of deeper, but are there any tricks to improve the speed of my SqueezeNet-style network within the Torch framework (or otherwise)?

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at https://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

soumith

unread,
Aug 17, 2016, 5:22:29 PM8/17/16
to torch7 on behalf of Alexander Weiss
cudnn is totally worth a try. I heard that they added special inference code-paths that speed things up.


On Wed, Aug 17, 2016 at 5:12 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30P...@googlegroups.com> wrote:
No.  I gave up on cudnn because of mysterious memory leaks (which apparently only effect me).  I can try reinstalling it and then running some tests, but it's not an ideal solution for me.  Can I really expect a significant boost from this?  I honestly never noticed significant speed differences with cudnn, at least not during inference time.  Maybe it's because I wasn't making large enough batches.


On Wednesday, August 17, 2016 at 3:51:35 PM UTC-5, smth chntla wrote:
are you using cudnn, and are you using the option: "cudnn.benchmark = true"  ?

On Wed, Aug 17, 2016 at 4:33 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30PfpJTwn3luLMBUUyH0EVaRyRp1q7UJ...@googlegroups.com> wrote:
My GPU doesn't have enough memory to batch 256 images.  However, the inference time of the SqueezeNet-style network appears to scale linearly with batch size up to 64 images.  Interestingly, it seems like I can push through larger images (like 256x256) without any significant increase in inference time over the 128x128 images.  (Just to be clear, this is a fully convolutional architecture, so I can input images of different sizes.)


On Wednesday, August 17, 2016 at 3:11:45 PM UTC-5, smth chntla wrote:
Usually for GPU the batch size has to be larger than 1 image.

Try a batch size of 256 inputs, and you'll see a wider gap in performance.

On Wed, Aug 17, 2016 at 4:03 PM, Alexander Weiss via torch7 <torch7+APn2wQcsaR8bcOeppHpjYu30PfpJTwn3luLMBUUyH0EVaRyRp1q7UJ...@googlegroups.com> wrote:
I've been a big fan of VGG-style networks for a while, but I recently decided to try out a SqueezeNet-style network for classification and see if I could reduce the size of my nets and get a nice speed boost.  As promised, the networks are significantly smaller, while still maintaining very good accuracy.  In my case, I was able to reduce my networks from ~130MB down to ~5MB.  I'm not doing any additional compression on the networks (like pruning, quantization, etc.).  All of this was implemented in Torch.

I expected that with such a large reduction in the number of neural net parameters (25x reduction), the nets would run significantly faster than my VGG-style nets.  Disappointingly, I'm only getting a 33% speed boost.  Running on an AWS K520 GPU, inference time for my old nets was ~7.5ms for a 128x128 image.  For my new nets, inference time is ~5ms.  Maybe this isn't surprising; the number of layers in my old and new networks are pretty similar and the GPU can't parallelize computation over different layers.  (On the CPU, the SqueezeNet-style network runs 3x faster than the VGG-style network.)  So maybe for increased speed (w/ GPU), I really need to go wider instead of deeper, but are there any tricks to improve the speed of my SqueezeNet-style network within the Torch framework (or otherwise)?

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at https://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at https://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+unsubscribe@googlegroups.com.

Bartosz Ludwiczuk

unread,
Aug 18, 2016, 2:01:34 AM8/18/16
to torch7
I think it can be related to other stuff: remember that size of the model is not highly correlated with its speed. Just compare the AlexNet and GoogleNet. 
AlexNet size at MB: 232 MB              AlexNet speed of forward pass: 25 ms
GoogleLeNet size at MB:   51 MB      GoogLeNet speed of forward pass: 140 ms
(size of Caffe model, speed taken from https://github.com/soumith/convnet-benchmarks)
 
I think that in SquezeNet is the same situation like in GoogLeNet: they use many 1x1 and 3x3 Conv, which does not need many parameters, but need computation:)
So, even you will increase the batch size, the gain will be not the same magnitude like size reduction. 


Eugenio Culurciello

unread,
Aug 18, 2016, 9:50:45 AM8/18/16
to torch7
Try also our ENEt, which is better than Squeezenet and has great performance:

Alexander Weiss

unread,
Aug 18, 2016, 11:48:40 AM8/18/16
to torch7
Thanks for all the great responses.  I haven't tried reinstalling cudnn yet, since I'm in the middle of a training session and will need to upgrade my NVIDIA drivers as well.

@Bartosz:  I think you make a good point.  Maybe the expand part of the fire modules, which runs 3x3 convolutions in parallel with 1x1, should be implemented in a smarter way.  It doesn't make sense to launch separate kernels for those two sets of convolutions.  It would be nice if Torch had an efficient module for stacking convolutions with different sized kernels.  (I'm not sure how that would mesh with cudnn.)

@Eugenio:  I saw your ENet paper, but hadn't thought to use it for classification.  Obviously it's straight forward to convert it to a classifier, as demonstrated in your training guide.  I'll definitely try it out.  Thanks for the links!
Reply all
Reply to author
Forward
0 new messages