Hi all,
I'm observing some curious behavior between Nvidia driver 319.37 and
331.89. Here's the setup:
I have two machines, each with two K20's. Machine A has Nvidia driver
version 319.37 installed, and machine B has driver version 331.89. I
compile a train_net.bin binary using Cuda 5.5 and copy it to each
machine, along with identical leveldb and *.prototxt files. Because
each machine has two GPUs available, I first export
CUDA_VISIBLE_DEVICES=0 before running train_net.bin.
When I run on machine A with 319.37, I get the following output (with
some lines around errors for context):
...
I0811 14:35:01.310225 6389 net.cpp:74] Creating Layer conv1
I0811 14:35:01.310230 6389 net.cpp:84] conv1 <- data
I0811 14:35:01.310241 6389 net.cpp:110] conv1 -> conv1
E0811 14:35:01.379361 6390 common.cpp:28] Cannot create Cublas
handle. Cublas won't be available.
E0811 14:35:01.379375 6390 common.cpp:29] Error is: 1
E0811 14:35:01.380518 6390 common.cpp:36] Cannot create Curand
generator. Curand won't be available.
E0811 14:35:01.380601 6390 common.cpp:37] Error is: 203
E0811 14:35:01.380612 6390 common.cpp:38] Error is: 101
I0811 14:35:01.711769 6389 net.cpp:125] Top shape: 256 96 55 55 (74342400)
I0811 14:35:01.711781 6389 net.cpp:151] conv1 needs backward computation.
I0811 14:35:01.711789 6389 net.cpp:74] Creating Layer relu1
...
<Training proceeds, taking 35 seconds per 20 iterations, as expected>
(Note: I added the "Error is: " logging lines in common.cpp like so:
https://gist.github.com/yosinski/dfd0c0c19258003e40ea#file-common-cpp-L9)
But, despite these errors, training of the network proceeds on the
GPU. Specifically, I observe ~100% GPU utilization using nvidia-smi
and training takes 35 seconds per 20 iterations, as expected.
The surprising part is that when I run on machine B with driver
331.89, I do not get any Cublas errors, but training is very slow:
I0811 14:37:21.392355 5748 net.cpp:74] Creating Layer conv1
I0811 14:37:21.392360 5748 net.cpp:84] conv1 <- data
I0811 14:37:21.392382 5748 net.cpp:110] conv1 -> conv1
<no errors>
I0811 14:37:21.726902 5748 net.cpp:125] Top shape: 256 96 55 55 (74342400)
I0811 14:37:21.726914 5748 net.cpp:151] conv1 needs backward computation.
I0811 14:37:21.726927 5748 net.cpp:74] Creating Layer relu1
...
<Training proceeds, taking 146 seconds per 20 iterations, much slower
than expected>
Training is using the GPU at least somewhat, because I observe 10%-15%
GPU utilization using nvidia-smi. But training is 4.1x slower (146
seconds vs 35).
Why is training so slow? Why is the GPU not fully utilized? Why is
slow training associated with cublas init succeeding and fast training
with it failing?
Any ideas or other tests I could run to debug this slowness? Or
suggestions just to fix the slowness?
Thanks for any help,
jason
---------------------------
Jason Yosinski, Cornell Computer Science Ph.D. student
http://yosinski.com/ +1.719.440.1357