2GPUs Not as fast as 1 GPU !?

Steven

unread,

Mar 10, 2017, 1:30:43 PM3/10/17

to Caffe Users

I'm trying to measure the speed-up given by 2 GPU's with Caffe instead of 1. Currently, I'm observing that 1 GPU is faster, which confuses me.
My observations were made with the 1.0.0-rc3 version of Caffe.

The GPU's were configured like this:

me@ubuntu:~$ nvidia-smi topo -m
       GPU0     GPU1    GPU2    GPU3    CPU Affinity
GPU0     X       PIX    PHB     PHB      0-11
GPU1    PIX      X      PHB     PHB      0-11
GPU2    PHB     PHB       X     PIX      0-11
GPU3    PHB     PHB     PIX      X       0-11

Legend:
X   = Self
SOC = PCI path traverses a socket-level link (e.g. QPI)
PHB = PCI path traverses a host bridge
PXB = PCI path traverses multiple internal switches
PIX = PCI path traverses an internal switch
NV# = Path traverses # NVLinks

These are the processes they were handling before running Caffe :

me@ubuntu:~$ nvidia-smi pmon
      # gpu     pid type    sm   mem   enc   dec   command
      # Idx       #   C/G     %     %     %     %   name
         0      1679   G      0     0     0     0   X
         0      2740   G      0     1     0     0   compiz
         0      3600   G      0     0     0     0   firefox
         1       -     -      -     -     -     -   -
         2       -     -      -     -     -     -   -
         3      3328   C     0     0     0     0   python

I compared 1 vs. 2 GPU's using the CIFAR10 Quick example provided with Caffe, which I ran with this script:

#!/usr/bin/env sh
set -e

TOOLS=./build/tools

$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_quick_solver.prototxt --gpu=1,2 $@ >> ~/Desktop/caffe_2GPUa_out.txt 2>&1

with the slight variations of:

--gpu=2,3

and

--gpu=2

I would've expected the fastest result to be obtained by --gpu=2,3, followed by --gpu=1,2, then by --gpu=2. Instead, I saw the exact opposite.

What I saw was this,

For --gpu=2:

I0227 14:41:26.948098 7712 caffe.cpp:251] Starting Optimization
I0227 14:42:04.841394 7712 caffe.cpp:254] Optimization Done.

For --gpu=1,2:

I0227 15:22:56.675775 7946 parallel.cpp:425] Starting Optimization
I0227 15:23:39.097970 7946 caffe.cpp:254] Optimization Done.

For --gpu=2,3:

I0227 14:43:13.466243 7742 parallel.cpp:425] Starting Optimization
I0227 14:43:56.215469 7742 caffe.cpp:254] Optimization Done.

So, my resulting times to train are:

gpu=2      35 sec
gpu=1,2   42 sec
gpu=2,3   43 sec

I had expected using 2 GPU's would give me a faster running time. What am I failing to see?

Patrick McNeil

unread,

Mar 10, 2017, 3:48:13 PM3/10/17

to Caffe Users

As I understand it, when you use more than one GPU, it increases the effective batch size for the training (so you are training twice as many iterations with the 2 GPU versus the single GPU). Could it be that you are running into a different limit (disk I/O for example)?

Patrick

Steven

unread,

Mar 11, 2017, 2:04:15 PM3/11/17

to Caffe Users

On the one hand, I suppose I must be running into some sort of limit, but if I am, why do I hit it on what (I think) is a fairly vanilla test case? If I don't see a speed-up here, where will I see one? Do I need to worry about the hardware of my machine (I'd be surprised), do I need some other library to be installed (I'd be surprised), or is there something else going on that I don't understand (I think so)?

Patrick McNeil

unread,

Mar 14, 2017, 10:15:57 AM3/14/17

to Caffe Users

Since you are using the default, your batch_size is 100, max iterations is 4,000, and the CIFAR-10 contains 60,000 images.

Using a single GPU, you execute 100 images per iteration * 4,000 iterations = 400,000 images processed in 35 seconds (~11,428 images/sec)

Using two GPU, you execute 200 images per iteration * 4000 iterations = 800,000 images processed in 42 seconds (~19,047 images/sec)

So, in the two GPU case, you process twice as many images for a 20% increase in time. You are processing 167% more images per second.

If you want a same-same type comparison, change the batch_size parameter to 50 and then execute on both GPUs (so you have the same number of images processed). It will likely not be a 2X improvement in performance, but it will show a lower overall time.

I think part of the reason the results look odd is there is some additional overhead required to setup each GPU and the length of training time is so short. If you try training a larger number of iterations or a larger network (GoogLeNet or Alexnet), you should see the increase more dramatically.

Patrick

Reply all

Reply to author

Forward