Cublas / initialization trouble depending on Nvidia driver versions

1,165 views
Skip to first unread message

Jason Yosinski

unread,
Aug 11, 2014, 5:24:10 PM8/11/14
to caffe...@googlegroups.com
Hi all,

I'm observing some curious behavior between Nvidia driver 319.37 and
331.89. Here's the setup:

I have two machines, each with two K20's. Machine A has Nvidia driver
version 319.37 installed, and machine B has driver version 331.89. I
compile a train_net.bin binary using Cuda 5.5 and copy it to each
machine, along with identical leveldb and *.prototxt files. Because
each machine has two GPUs available, I first export
CUDA_VISIBLE_DEVICES=0 before running train_net.bin.



When I run on machine A with 319.37, I get the following output (with
some lines around errors for context):

...
I0811 14:35:01.310225 6389 net.cpp:74] Creating Layer conv1
I0811 14:35:01.310230 6389 net.cpp:84] conv1 <- data
I0811 14:35:01.310241 6389 net.cpp:110] conv1 -> conv1
E0811 14:35:01.379361 6390 common.cpp:28] Cannot create Cublas
handle. Cublas won't be available.
E0811 14:35:01.379375 6390 common.cpp:29] Error is: 1
E0811 14:35:01.380518 6390 common.cpp:36] Cannot create Curand
generator. Curand won't be available.
E0811 14:35:01.380601 6390 common.cpp:37] Error is: 203
E0811 14:35:01.380612 6390 common.cpp:38] Error is: 101
I0811 14:35:01.711769 6389 net.cpp:125] Top shape: 256 96 55 55 (74342400)
I0811 14:35:01.711781 6389 net.cpp:151] conv1 needs backward computation.
I0811 14:35:01.711789 6389 net.cpp:74] Creating Layer relu1
...
<Training proceeds, taking 35 seconds per 20 iterations, as expected>

(Note: I added the "Error is: " logging lines in common.cpp like so:
https://gist.github.com/yosinski/dfd0c0c19258003e40ea#file-common-cpp-L9)

But, despite these errors, training of the network proceeds on the
GPU. Specifically, I observe ~100% GPU utilization using nvidia-smi
and training takes 35 seconds per 20 iterations, as expected.



The surprising part is that when I run on machine B with driver
331.89, I do not get any Cublas errors, but training is very slow:

I0811 14:37:21.392355 5748 net.cpp:74] Creating Layer conv1
I0811 14:37:21.392360 5748 net.cpp:84] conv1 <- data
I0811 14:37:21.392382 5748 net.cpp:110] conv1 -> conv1
<no errors>
I0811 14:37:21.726902 5748 net.cpp:125] Top shape: 256 96 55 55 (74342400)
I0811 14:37:21.726914 5748 net.cpp:151] conv1 needs backward computation.
I0811 14:37:21.726927 5748 net.cpp:74] Creating Layer relu1
...
<Training proceeds, taking 146 seconds per 20 iterations, much slower
than expected>

Training is using the GPU at least somewhat, because I observe 10%-15%
GPU utilization using nvidia-smi. But training is 4.1x slower (146
seconds vs 35).



Why is training so slow? Why is the GPU not fully utilized? Why is
slow training associated with cublas init succeeding and fast training
with it failing?

Any ideas or other tests I could run to debug this slowness? Or
suggestions just to fix the slowness?



Thanks for any help,
jason


---------------------------
Jason Yosinski, Cornell Computer Science Ph.D. student
http://yosinski.com/ +1.719.440.1357

Evan Shelhamer

unread,
Aug 11, 2014, 6:10:27 PM8/11/14
to Jason Yosinski, caffe...@googlegroups.com
There is a known driver issue with the 331.* series. NVIDIA identified this as a driver bug due to the introduction of unified virtual memory that is fixed in the 340.* series driver. This notice is included in the latest Caffe installation documentation and detailed at this issue: https://github.com/BVLC/caffe/issues/687

I don't know why your cuBLAS initialization is failing with 319.* -- I haven't observed that on any hardware I've tried. You could try upgrading to 340.*.

Good luck,

Evan Shelhamer



--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/CAPhAues%3Dzi6wV7Am0FYs10BNYBQgS_N5CaPZe%2BqO%3DrRx42ThSA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jason Yosinski

unread,
Aug 11, 2014, 7:38:48 PM8/11/14
to Evan Shelhamer, caffe...@googlegroups.com
> There is a known driver issue with the 331.* series. NVIDIA identified this
> as a driver bug due to the introduction of unified virtual memory that is
> fixed in the 340.* series driver.

Got it; thanks for the info. I'll upgrade to 340.*

> I don't know why your cuBLAS initialization is failing with 319.* -- I
> haven't observed that on any hardware I've tried. You could try upgrading to
> 340.*.

I'll try 340.* first and repost if that doesn't fix it.

Leaving aside why the cuBlas init is failing, there's a remaining
small confusing issue: the Caffe constructor posts the error message
about cuBlas init failing, but then training proceeds to work on the
GPU anyway.

This comment
https://github.com/BVLC/caffe/blob/master/src/caffe/common.cpp#L88-L89
seems to indicate that this shouldn't be the case. If init fails, it's
assumed that CPU-only mode will be used. Thus, I found this behavior
confusing and wondered whether it hints at a bug with the way the init
check is being performed.

Thanks,
jason


---------------------------
Jason Yosinski, Cornell Computer Science Ph.D. student
http://yosinski.com/ +1.719.440.1357


Jason Yosinski

unread,
Aug 12, 2014, 1:14:19 AM8/12/14
to Evan Shelhamer, caffe...@googlegroups.com
Hi Evan,

I have one more question:

> There is a known driver issue with the 331.* series. NVIDIA identified this
> as a driver bug due to the introduction of unified virtual memory that is
> fixed in the 340.* series driver.

How did you eventually conclude this? Do you happen to have a
reference for this handy? I googled around but couldn't find anywhere
that Nvidia addresses this bug directly. I ask because those managing
our cluster are hesitant to upgrade the driver given only
Caffe-centric references to the bug.

Thanks,
jason


---------------------------
Jason Yosinski, Cornell Computer Science Ph.D. student
http://yosinski.com/ +1.719.440.1357


On Mon, Aug 11, 2014 at 3:10 PM, Evan Shelhamer
<shel...@eecs.berkeley.edu> wrote:

Yangqing Jia

unread,
Aug 12, 2014, 1:33:42 AM8/12/14
to Jason Yosinski, Evan Shelhamer, caffe...@googlegroups.com
We actually only have email threads with Nvidia people and don't have
public references :) Caffe's issue 687 talks about this:

https://github.com/BVLC/caffe/issues/687

Specifically, what we got from Nvidia was:

"A bug was introduced with the new UVM capability that CAFFE hit upon.
Thus with the older (319.82) everything was fine, but the newer
drivers (331.XX) there was a performance regression due to the bug
interaction with UVM. ... Just last week a public driver that
contains the fix was released (340.24). Any driver from 340.19 or
newer (on Linux) should have the fix."

Hope this helps. I understand that sysadmins are some times quite
conservative, having experienced that in the past too :)

Yangqing


On Mon, Aug 11, 2014 at 10:14 PM, Jason Yosinski
> To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/CAPhAuet9QU%3DoY9D3CRsfCKhs22UYBUj9KB2KLjppY3WCA2TD%3DQ%40mail.gmail.com.

Jason Yosinski

unread,
Aug 12, 2014, 1:12:04 PM8/12/14
to Yangqing Jia, Evan Shelhamer, caffe...@googlegroups.com
Ok, thanks for the info, Yangqing. I'll pass it along.

jason


---------------------------
Jason Yosinski, Cornell Computer Science Ph.D. student
http://yosinski.com/ +1.719.440.1357


KING CHUNG HO

unread,
Oct 2, 2015, 3:01:17 AM10/2/15
to Caffe Users
I installed 340.93 in ubuntu 14.04, with Tesla K40.

I obtained the same error: 

E1001 23:58:21.889209 19147 common.cpp:93] Cannot create Cublas handle. Cublas won't be available.
E1001 23:58:21.895387 19147 common.cpp:100] Cannot create Curand generator. Curand won't be available.

Can anyone help?
Reply all
Reply to author
Forward
0 new messages