Streams - Results

12 views
Skip to first unread message

Johannes Gilger

unread,
Feb 2, 2011, 9:04:49 AM2/2/11
to engine-cuda
Hi Paolo,

I thought I'd give you an update on streams, since I recently
implemented them in engine-cuda.

First of, implementing streams is not that straightforward and does
_not_ work with the existing transferDtoH and transferHtoD functions.
That's because the call to the single encrypt/decrypt functions of the
CUDA modules (bf, des, etc) calls a copy to the device, executes the
kernel, and then calls a copy from the device. With page-locked memory,
this copy-calls also include a host-based memcpy, which is always
blocking, so no call to a crypt-function can start before the previous
has been finished, i.e. has called it's own transferDtoH memcpy.

What I did was to create an array of pointers (one entry for each
stream) with pointers to host/device memory each, initialize that array
like usual, and then manually memcpy input data to the single pointer
locations in a for-loop in e_cuda.c. After this, all the host-side
memory resides in page-locked destinations and the non-blocking calls to
transferHtoD and subsequent kernel-calls can begin. When all the streams
return (calling cudaThreadSynchronize) I simply memcpy the output from
page-locked memory to the output-area given by OpenSSL.

Now, another approach I tried was simply mlocking the memory pages
supplied by OpenSSL. This requires super-user privileges and did not
turn out to be any faster.

So, I did some test using streams and have uploaded the corresponding
graphs to http://avalon.hoffentlich.net/~heipei/tmp/engine-cuda/. The
00_streams is a plot of before, using page-locked memory. As you can
see, the performance gain is tiny at best. The test were performed over
an average of 5 runs using a GTX 295.

Since I don't really see any benefit of using streams, I'm gonna store
that commit in a dormant branch and not further pursue the idea.
Implementing multi-gpu support would make more sense for usability, and
the only reason I'm not doing it is because I'm only going to measure
using a single GPU anyway.

So far,
greetings,
Jojo

--
Johannes Gilger <hei...@hackvalue.de>
http://heipei.net
GPG-Key: 0xD47A7FFC
GPG-Fingerprint: 5441 D425 6D4A BD33 B580 618C 3CDC C4D0 D47A 7FFC

Paolo Margara

unread,
Feb 3, 2011, 10:24:16 AM2/3/11
to engine-...@googlegroups.com
Hi Johannes,
I could figure that the transferDtoH and transferHtoD functions don't work with steam, but I couldn't imagine a performance increase so low.
I find interesting your solution; it would be even more interesting to look at the code.
I was wondering when you will publish the changes you made to the code. Do you already have a date in mind?
I'm very curious to see your work, please let me know (at least a preview for me).

greetings,
    Paolo Margara

Reply all
Reply to author
Forward
0 new messages