The first run of forward inference is slow on the GPU. Any ideas why?

529 views
Skip to first unread message

Chris Padwick

unread,
Sep 9, 2016, 12:58:52 AM9/9/16
to torch7
Hi - I am trying to understand the "warmup" behavior in a simple network.  I have tested it with both the cunn and the cudnn backends and find the same result.  The first run is very slow compared to subsequent runs.

Here are the timing results:

cunn timings


0.01 *

 7.5459

 0.0938

 0.0825

 0.0796

 0.0792

 0.0811

 0.0787

 0.0789

 0.0778

 0.0877

[torch.DoubleTensor of size 10]


cudnn times


 1.3346

 0.0008

 0.0008

 0.0008

 0.0008

 0.0008

 0.0008

 0.0008

 0.0008

 0.0009

[torch.DoubleTensor of size 10]


My interest is understanding what is going on in that first inference call, and how long the "warmup" period lasts.  E.g. if I want to deploy an app with this forward inference should I do at least one "warmup" call and then I'm ok?  Or do I need to do one every X seconds?  Or is there something deeper going on?  It's acting an awful lot like there is some state being loaded onto the GPU during the first run and used for subsequent runs, which is fine, but I just want some confirmation that this is expected behavior and my "warmup" idea will work.  Platform is TX1.


Here is my lua file:

require 'nn'

require 'cunn'

require 'cutorch'

require 'cudnn'



net = nn.Sequential()

net:add(nn.SpatialConvolution(3, 6, 5, 5)) 

net:add(nn.ReLU())

net:add(nn.SpatialMaxPooling(2,2,2,2))  

net:add(nn.SpatialConvolution(6, 16, 5, 5))

net = net:cuda()


input = torch.rand(1,3,1000,1000)

input = input:cuda()


nruns = 10

times = torch.zeros(nruns)


for i=1,nruns do

  start = os.clock()

  out=net:forward(input)

  stoptime = os.clock()

  times[i] = stoptime - start

end


print('\n\ncunn timings\n\n')

print(times)


cudnn.fastest = true

cudnn.convert(net, cudnn)


for i=1,nruns do

  start = os.clock()

  out=net:forward(input)

  stoptime = os.clock()

  times[i] = stoptime - start

end


print('\n\ncudnn times\n\n')

print(times)

Florient Chouteau

unread,
Sep 9, 2016, 4:29:34 AM9/9/16
to torch7
Wild guess as I noticed the same: Isn't it because the model state is blank except for the weights and so it needs to allocate memory for each output tensor at every level etc...?

I guess that if you run model:clearState() between all inferences you are going to reproduce the same "first inference lag".

Vislab

unread,
Sep 9, 2016, 5:13:09 AM9/9/16
to torch7
When you first do a forward pass on a model that doesn't have all the necessary buffers allocated yet, the first time you do it then some allocation will have to happen, which costs time. If you save the model with the pre-allocated buffers then you won't need to do a first forward pass on your model to initialize the necessary buffers whenever you load it, but generally this takes too much extra space in disk and doing a pre-emptive forward pass on some random data on your model is so much "cheaper" that you can do it almost every time without worrying about it.
Reply all
Reply to author
Forward
0 new messages