Caffe GPU Dockers are running slow when classifying 1 image with 3 x P100

91 views
Skip to first unread message

Hideaki Heng Lee

unread,
Feb 28, 2018, 2:54:51 PM2/28/18
to Caffe Users

Hi everyone,



I am running into an issue that I do not know how to solve. I am trying to classify 1 image from the following python code I attached. I am using NVIDIA Caffe GPU docker containers. I have tried all of the following images. Event the official one from NGC, but still the performance for classifying one image is taking more than 1 seconds. The model is an AlexNet model that I trained with NVIDIA DIGITS. When I converted the model into CoreML model, it runs pretty fast even on mobile.



However, when I tried to run on Python, it take at least 1.9 seconds from the bvlc/caffe:gpu container.


And all other containers takes more than 3 seconds or even 5 seconds. I have already specified to turn on GPU mode.


I am wondering if there is a default initial delays when classifying a single image? Is there any way to improve the performance even I am using 3 P100s? And I have also tried running just 1 docker container.


Maybe if there is any thing done wrong with my CUDA configuration at the host?


I am desperately need help. Greatly appreciated.





Here are the specs:

Docker Containers I have tried
nvidia/caffe:latest
yangcha/caffe-gpu-conda:latest
nvcr.io/nvidia/caffe:18.01-py2
bvlc/caffe:gpu



nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61



nvidia-smi
Wed Feb 28 19:06:44 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:05.0 Off | 0 |
| N/A 25C P0 28W / 250W | 2289MiB / 12198MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:00:06.0 Off | 0 |
| N/A 27C P0 29W / 250W | 2289MiB / 12198MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:00:07.0 Off | 0 |
| N/A 24C P0 29W / 250W | 2289MiB / 12198MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+



docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64

Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:53 2017
OS/Arch: linux/amd64
Experimental: false



dpkg -l 'nvidia'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============================================================-====================================-====================================-=================================================================================================================================
ii libnvidia-container-tools 1.0.0alpha.3-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.0
alpha.3-1 amd64 NVIDIA container runtime library
rc nvidia-384 384.111-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 384.111
ii nvidia-390 390.30-0ubuntu1 amd64 NVIDIA binary driver - version 390.30
ii nvidia-390-dev 390.30-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-container-runtime 1.1.1+docker17.12.0-1 amd64 NVIDIA container runtime
pi nvidia-cuda-dev 7.5.18-0ubuntu1 amd64 NVIDIA CUDA development files
un nvidia-cuda-doc (no description available)
un nvidia-cuda-toolkit (no description available)
un nvidia-current (no description available)
un nvidia-docker (no description available)
ii nvidia-docker2 2.0.2+docker17.12.0-1 all nvidia-docker CLI wrapper
un nvidia-driver-binary (no description available)
un nvidia-legacy-340xx-vdpau-driver (no description available)
un nvidia-libopencl1 (no description available)
un nvidia-libopencl1-384 (no description available)
un nvidia-libopencl1-390 (no description available)
un nvidia-libopencl1-dev (no description available)
ii nvidia-modprobe 390.30-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-dev:amd64 7.5.18-0ubuntu1 amd64 NVIDIA OpenCL development files
un nvidia-opencl-icd (no description available)
rc nvidia-opencl-icd-384 384.111-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-390 390.30-0ubuntu1 amd64 NVIDIA OpenCL ICD
un nvidia-persistenced (no description available)
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-profiler 7.5.18-0ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 390.30-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary (no description available)
un nvidia-smi (no description available)
un nvidia-vdpau-driver (no description available)
ii nvidia-visual-profiler 7.5.18-0ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL

nvidia-container-cli -V
version: 1.0.0
build date: 2018-01-11T00:16+00:00
build revision: 4a618459e8ba522d834bb2b4c665847fae8ce0ad
build compiler: gcc-5 5.4.0 20160609
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections





Here is the python the code

`import argparse
import numpy as np
import time
import os
os.environ['GLOG_minloglevel'] = '2'
import caffe
import skimage
import cv2

def classify(caffemodel, deploy_file, image_files,
mean_file=None, labels_file=None, batch_size=None, use_gpu=True):

# caffe.set_mode_gpu()    # caffe.set_mode_gpu()


caffe.set_mode_gpu()
caffe.set_device(0)
caffe.set_device(1)
caffe.set_device(2)

net = caffe.Net(deploy_file,caffemodel, caffe.TEST)
meanData = caffe.io.load_image(mean_file, color=True)
# print('mean shape =====>>>>>>>>>>>>', meanData.shape)

transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2, 0, 1))
transformer.set_raw_scale('data', 255)

transformer.set_channel_swap('data', (2, 1, 0))

im = caffe.io.load_image(image_files, color=True)
im3 = cv2.imread(image_files)
im2 = skimage.transform.resize(im3, (256, 256))

# print('img shape =====>>>>>>>>>>>>', im3.shape)
dataaa = im2 - meanData

net.blobs['data'].data[...] = transformer.preprocess('data', dataaa)

out = net.forward()

labels = np.loadtxt(labels_file, str, delimiter='\s')
print (labels)
prob1= net.blobs['softmax'].data[0].flatten().argsort()[-1: -6: -1]

print (out['softmax'][-1])
orderProb = out['softmax'][-1].argsort()
print ('prob1: ' , prob1)
# order=prob1.argsort()[0]
highestProb = out['softmax'][-1][prob1][0]
print ('hightestProb: ' , highestProb)

if (highestProb < 0.6):
    print ('UNKNOWN')
else:
    print (labels[prob1][0])


# im4 = cv2.imshow('image',im2)
# cv2.waitKey()

if name == 'main':
script_start_time = time.time()

parser = argparse.ArgumentParser(description='Classification example - DIGITS')

# Positional arguments
parser.add_argument('caffemodel', help='Path to a .caffemodel')
parser.add_argument('deploy_file', help='Path to the deploy file')
parser.add_argument('image_file', help='Path[s] to an image')

# Optional arguments
parser.add_argument('-m', '--mean', help='Path to a mean jpg (*.jpg)')
parser.add_argument('-l', '--labels', help='Path to a labels file')
parser.add_argument('--batch-size', type=int)
parser.add_argument('--nogpu', action='store_true', help="Don't use the GPU")

args = vars(parser.parse_args())

classify(
    args['caffemodel'],
    args['deploy_file'],
    args['image_file'],
    args['mean'],
    args['labels'],
    args['batch_size'],
    not args['nogpu'],
)

print ('Script took %f seconds.' % (time.time() - script_start_time,))`

Przemek D

unread,
Mar 1, 2018, 2:36:15 AM3/1/18
to Caffe Users
You should set_device before calling set_mode_gpu. And as far as I know you can only set a single device, so calling this several times will not activate multiple GPUs.

Another thing is that you measure the execution of the whole script: import caffe, setting up a transformer, loading data, preprocessing etc. - this all introduces a high overhead. Since you're only running a single image, it might be that 95% of the whole time is spent on those operations, and only 5% is the actual inference. In this case even speeding up this operation 100x will only yield a less than 5% overall script execution speedup. To measure only the inference speed, I suggest:
  • measuring time immediately before and immediately after net.forward(), to get rid of the overhead time component,
  • running net.forward() multiple times (1000 or so) between the time measurements, to get a larger sample,
  • running net.forward() once before starting the test run, to let Caffe allocate all resources (blobs are allocated as they are needed, so the first run will always take longer).
Also, check nvidia-smi during the execution of a script (i.e. when it performs the inference) to see if the card is in use.

Alternatively, you could use caffe time to use Caffe's own benchmark mechanism to make those tests for you, including the per-layer measurements.

Hideaki Heng Lee

unread,
Mar 1, 2018, 11:18:28 AM3/1/18
to Caffe Users
Hi Przemek,

Thank you so much for your reply.  I realized that this can be a issue due to the initialization time. And since I my goal is using this classification script at the web server to classify incoming images one by one.  I know from the Tensorflow side there is Tensorflow serving to handle this production model image processing.

Is there there any way to run the caffe classification model as daemon background process to process incoming images one by one? I try to look up resources online but didn't fine any yet. Can you help?


Thank you,
Heng 

Przemek D

unread,
Mar 2, 2018, 2:50:21 AM3/2/18
to Caffe Users
Sure you can! I did something like this in a quick and dirty way by running one python script which instantiated the model and waited for a specific signal (use python's standard signal module). I then manually ran another script which wrote a path to an image to the special file and sent a signal to the first process. Upon receiving a signal, the "daemon" would read the path from this file, load the image and process it. Probably you could make it neater by spawning the server process using multiprocessing and communicating the image path using pipes or sockets.

W dniu środa, 28 lutego 2018 20:54:51 UTC+1 użytkownik Hideaki Heng Lee napisał:
Reply all
Reply to author
Forward
0 new messages