OpenCL Future Support

62 views
Skip to first unread message

Artyom Beilis

unread,
Aug 21, 2021, 6:05:50 PM8/21/21
to Caffe Users
Hello Caffe Users,

One of the main reasons I started using Caffe in the first place was out of box decent OpenCL support.

Unfortunately Caffe isn't developed any more. Another project that provided OpenCL support plaidml+keras was killed by Google once Keras dropped multiple-backends support.

So I started working on dlprimitives:  https://github.com/artyom-beilis/dlprimitives

The project with goals

1. Create an OpenCL alternative to cuDNN that provides decent performance
2. Create an inference library with minimal dependencies
3. Provide miro-deep-learning framework as POC
4. Long shot goal: integrate OpenCL backend to existing frameworks like pytorch/tf/mxnet

It is similar in many ways to caffe: static graph, with layers that provide FW/BW functionality but with some major improvements: better memory management optimization, minimal dependencies (cblas+opencl sdk), out of box windows support, JSON formats instead of prototxt and some more.

It is early work in progress but it is already:

1. Outperforms both Caffe OpenCL and Keras OpenCL by significant margins
2. Provides similar performance to caffe+cudnn on some networks like ResNet18 while using much less memory and gives comparable performance to tensorflow.

One of the reasons caffe is a very valuable project is OpenCL support. I think this information may be interesting to Caffe Users.

Artyom Beilis

P.S.: Sorry if you think it isn't appropriate, but current situation with Caffe/OpenCL is one of the reasons I started the project after contributing several fixed to Caffe.

Artyom Beilis

unread,
Aug 29, 2021, 4:34:13 PM8/29/21
to Caffe Users
Small update.

I started integrating dlprim library as alternative to libdnn for OpenCL branch. 

So far so good. These are benchmarks, time in ms.. Batch size column bs. Comparing to native
cudnn caffe and full dlprim. Currently as POC I integrated dlprim for conv layer only. I'll continue with
deconv and batch normalization since Caffe's BN implementation has very poor performance

network     bs  gpu     caffe/cudnn     caffe/libdnn    caffe/dlprim    dlprim
alexnet     16  gtx960  49.6            118.697         90.0764         83.942
resnet18    16  gtx960  211.372         410.811         311.831         197.387
vgg         8   gtx960  399.394         1030.95         582.993         574.845


In summary caffe+dlprim gives, 55%, 67% and 68% of performance of cudnn on alexnet, resnet18 and vgg16 
and improves caffe/libdnn by 31%, 32% and 77% for these networks.


To be continued.

Artyom

Artyom Beilis

unread,
Sep 22, 2021, 3:39:30 PM9/22/21
to Caffe Users
Additional update

I tested Caffe/OpenCL with dlprimitives [1] convolutions improvements[2] on several
GPUs and compared to several frameworks including dlprimitives microframework itself.
(pytorch, tensorflow2, caffe+cudnn, caffe-opencl and caffe-opencl+dlprim, and keras/plaidml)
All times in ms. for batchsize=16.

Summary
=======

I got this performance boost by using dlprimitives convolution over caffe-opencl.

gpu        alexnet resnet18  vgg16
rtx2060s    9%     26%       41%
gtx1080    23%     34%       84%
rx6600xt   16%     11%       54%

I still don't get cuda+cudnn performance but at least in some cases I reach ~75% of performance

gpu         alexnet resnet18    vgg16/16
rtx2060s    42%     70%         54%
gtx1080     54%     78%         77%

I also must note that one of the issues with ResNet-like networks that make caffe much
weaker than tf/pytorch and dlprim itself is implementation of batch normalization
in caffe that is split into two independent layers.


Full Benchmarks
================

RTX 2060S       alexnet/16  resnet18/16     vgg16/16
pt/cuda         11.078      30.969          148.916
tf/cuda         26.42       55.62           157.13
caffe/cuda      14.36       61.06           232.97  
dlprim          35.58       66.30           421.93
caffe/dlprim    33.83       86.79           431.19
caffe/ocl       36.71       109.21          606.29
keras/plaidm    69.1        199.53          911.29


GTX 1080        alexnet/16  resnet18/16     vgg16/16
pt/cuda         15.763     38.359          125.902
tf/cuda         33.38       69.52           196.94
caffe/cuda      17.39       79.97           274.12
dlprim          30.65       69.32           353.19
caffe/dlprim    32.24       102.54          358.25
caffe/ocl       39.58       137.12          657.81
keras/plaidm    89.57       229.14          972.82

RT 6600XT       alexnet/16  resnet18/16     vgg16/16
dlprim          26.760      59.881          295.190
caffe/dlprim    26.2379     85.8798         302.403
caffe/ocl       30.3641     95.5228         465.64
keras/plaidm    183.23      429.874         3612.744

Best Regards,
Artyom Beilis

Reply all
Reply to author
Forward
0 new messages