Tensorflow on GPU worrking inefficiently

7,699 views
Skip to first unread message

leoni...@gmail.com

unread,
Jun 22, 2016, 2:04:05 PM6/22/16
to Discuss
Hi guys,

I have been using tensorflow for some research I'm doing on Boltmzann Machines, and while running a program I noticed that the average GPU utilization is very low (around 10-20%).
While I am relatively new to tensorflow, I have quite an extensive background in efficient programming in C++, and I am assuming that my program is spending much time on communication between CPU and GPU, which is pretty bad.

I would really like to know how can I tell if some function I am using in tensorflow runs directly on the GPU, or requires some computation on the CPU.
More generally, are there any tips for better utilization of the GPU / tips for avoiding data being unnecessarily transferred between CPU and GPU?

Thanks a lot!

Yaroslav Bulatov

unread,
Jun 22, 2016, 2:32:23 PM6/22/16
to leoni...@gmail.com, Discuss
1. Run your session with log device placement
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

That'll show if you have some errant ops placed on CPU. This can cause large transfers, for instance, if your dataset is a tensor on CPU, and `Rank` op is on GPU, then it'll copy entire dataset to GPU during each invocation.

Also, there's a small number of ops which are placed on GPU but actually reside on CPU. That's a "host memory" hack which was done for efficiency to avoid crossing logical device boundary for certain frequently used tiny ops like addition of shapes. AFAIK those only happen for integer ops like integer Add [here](https://github.com/tensorflow/tensorflow/blob/d42facc3cc9611f0c9722c81551a7404a0bd3f6b/tensorflow/core/kernels/cwise_op_add.cc#L30) so if you avoid large integer tensors, you should be OK.

2. Don't use feed_dict. Anything that goes into feed_dict is in Python-land, hence on CPU and will require GPU copy. If you have some Python values you need to reuse, save them into TensorFlow variable and use the variable value later. For instance, here's a snippet that saved MNIST batch_sizex10 labels matrix into variable

  targets = tf.Variable(tf.zeros_initializer((batchSize, 10), dtype=dtype))
  targets_init = targets.assign(labels_onehot)
  sess.run(targets_init, feed_dict={labels_placeholder: train_labels[:batchSize]})


3. Use session tracing to save your session timeline and visualize it, Paul Bar has a good summary here: https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659



--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/6eb78066-30e5-4742-b636-27b9cc0b01fb%40tensorflow.org.

Cristian Garcia

unread,
Jun 22, 2016, 5:32:31 PM6/22/16
to Yaroslav Bulatov, leoni...@gmail.com, Discuss
Yaroslav Bulatov,

I was a little shocked about the not using "feed_dict" since its on every example out there. Is there an "optimization manual" one can consult?

Thanks!

Cristian Garcia

unread,
Jun 22, 2016, 5:32:57 PM6/22/16
to Yaroslav Bulatov, leoni...@gmail.com, Discuss
Yaroslav Bulatov,

I was a little shocked about the not using "feed_dict" since its on every example out there. Is there an "optimization manual" one can consult?

Thanks!

On Wed, Jun 22, 2016 at 1:32 PM Yaroslav Bulatov <yaros...@gmail.com> wrote:

Yaroslav Bulatov

unread,
Jun 22, 2016, 6:05:23 PM6/22/16
to Cristian Garcia, leoni...@gmail.com, Discuss
Feed dict is OK if you don't mind copying your data from CPU to GPU at each run call. There are actually two copies -- a single threaded memcpy to copy numpy array into tensorflow, in order to simplify memory ownership, and then another memory transfer to place it on GPU. So if you are dealing with a couple of MBs of data and a hundred session.run calls, this overhead shouldn't be significant.

Most of the official examples are unoptimized in order to keep things simple. For instance, I recall an issue with mnist_fully_connected_preloaded.py where it would copy entire dataset from GPU onto CPU on each iteration. I'm not aware of any optimization manual.

Yuxin Wu

unread,
Jun 22, 2016, 8:17:36 PM6/22/16
to Discuss, leoni...@gmail.com
I'm also a bit surprised to know that feed_dict will copy the data twice. But most of the applications I worked on needs data to be generated on the fly in Python, so I guess I still need feed_dict?

Also, I usually would have one thread feed data into a FIFOQueue through feed_dict, and have another training thread compute on dequeued data. In this case, is it correct to say that the numpy->tensorflow copy happens in one thread, and the CPU->GPU copy happens in another thread, therefore the overhead of copy won't be as much?

zhao...@gmail.com

unread,
Jun 22, 2016, 10:46:40 PM6/22/16
to Discuss, yaros...@gmail.com, leoni...@gmail.com, cris...@aristadev.com
 
I guess by saying 'don't use feed_dict' he actually mean 'only use feed_dict if necessary', when training a big neural network like image-net, you almost cannot hold data on GPU that big, so you have to feed data to computational graph batch by batch.

But when training with small dataset like cifar10 or mnist, most of GPUs have a memory that enough to hold the whole dataset in its memory, at this circumstance 'feed_dic' is deprecated.

Yaroslav Bulatov

unread,
Jun 23, 2016, 12:53:36 PM6/23/16
to zhao...@gmail.com, Discuss, Leonid Geller, cris...@aristadev.com
For a big neural network where data has to stay in main memory or on disk, the efficient solution is to feed the data in using TensorFlow input pipelines rather than feed_dict.

I think a bigger problem with feed_dict is Python's lack of concurrency. If you manipulate your data in Python-land you have to work hard to avoid getting stuck using only one thread because of GIL.

leoni...@gmail.com

unread,
Jul 7, 2016, 10:10:23 AM7/7/16
to Discuss, leoni...@gmail.com
Finally found some time to investigate my problem. I used the method that you suggested ( https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659) which turned out to be very useful, and I would recommend it to anyone having performance issues with Tensorflow!
My problem turned out to be the fact that I used the MatMul operation with float64 variables. The profiling I did with the method above showed clearly that these ops are running on the CPU. After some inquiry, I learned that the current version installed on the server I'm running on (0.8.0) does not have a GPU kernel for this operation, which means that it must run on CPU only. Clearly matrix multiplication is rather heavy for a CPU, and plus there is a communication overhead between the devices.
The solution was fairly easy - simple casts to float32 did the trick, and everything runs on the GPU, loading it to a constant 99%.

Thanks a lot for your kind help!!

sau...@gmail.com

unread,
Oct 27, 2017, 8:56:59 AM10/27/17
to Discuss, leoni...@gmail.com
Wow, what a useful thread this one is!!
Thanks a lot! I've had the same issue with my GPU load (max ~30-40%, and very fragmentated, while CPU load also higher as expected).

Three important things helped a lot:
1. replacing feed_dict with tensor variable
2. using float 32 instead of 64 (which speed everything up anyway, without loosing accuracy)
3. session tracing showed me other things to improve

Thanks again to all of you!
Reply all
Reply to author
Forward
0 new messages