Like what I see and some feedback

Christ Schlacta

unread,

Jun 13, 2011, 2:53:44 AM6/13/11

to Engine-cudamrg discussion group

I'm very interrested in getting ahold of engine OpenCL for my backup
server, which includes an opencl friendly integrated graphics chip and
backs up to amazon S3. I hope to be able to encrypt my backups before
sending them to amazon.

Knowing that CBC can't work with cuda or opencl, can you add CTR
support ? it's a good alternative to CBC that's parallel-friendly.

Using data from /dev/urandom or /dev/random for any benchmark will of
course perform poorly. {u}random dedicates cycles on a crypto engine
in kernel space to generate numbers, which even at 100% CPU takes
significant time, and is therefore a bottleneck. compare dd if=/dev/
sda of=/dev/null to dd if=/dev/urandom of=/dev/null to see the
difference. I read that you were experiencing slowdowns using /dev/
urandom for benchmarks, and I hope this helps you normalize your
results. Use a decently fast raw disk drive to store your data as the
OS caching on filesystems will also skew your results.

Johannes Gilger

unread,

Jun 13, 2011, 3:08:52 AM6/13/11

to engine-...@googlegroups.com

On 12/06/11 23:53, Christ Schlacta wrote:
> I'm very interrested in getting ahold of engine OpenCL for my backup
> server, which includes an opencl friendly integrated graphics chip and
> backs up to amazon S3. I hope to be able to encrypt my backups before
> sending them to amazon.
>
> Knowing that CBC can't work with cuda or opencl, can you add CTR
> support ? it's a good alternative to CBC that's parallel-friendly.

Hi Christ,

well, that's another of these nice problems: OpenSSL does not support
the CTR mode for any cipher. There are some hints that people were
planning to support it, but nothing has happened yet. I don't know how
easy it would be for an engine to support a cipher and blockmode which
is not present in OpenSSL, and which needs arguments besides the key.
This is one of the reasons I'm stuck with my work after having completed
my thesis. I want to implement something which has an immediate use, but
neither ECB nor CBC-only-decryption fit that profile. The CUDA/OpenCL
kernels could be used in other contexts with little effort, but right
now I'm not sure where to start integrating support for CUDA/OpenCL into
existing software.

> Using data from /dev/urandom or /dev/random for any benchmark will of
> course perform poorly. {u}random dedicates cycles on a crypto engine
> in kernel space to generate numbers, which even at 100% CPU takes
> significant time, and is therefore a bottleneck. compare dd if=/dev/
> sda of=/dev/null to dd if=/dev/urandom of=/dev/null to see the
> difference. I read that you were experiencing slowdowns using /dev/
> urandom for benchmarks, and I hope this helps you normalize your
> results. Use a decently fast raw disk drive to store your data as the
> OS caching on filesystems will also skew your results.

Ok, to clarify this point: I'm not that stupid ;)

I wrote a file-generator which uses dd to generate files from /dev/zero
or /dev/urandom before any call to OpenSSL takes place, it's just a
small script. Furthermore, when benchmarking the pure kernel performance
nothing else besides the kernel call was taken into account. So, the
time needed to copy to/from background memory is completely irrelevant
to my micro-benchmarks.

The reason for the difference in /dev/zero and /dev/urandom is very
simple: If you use /dev/zero, all your payload will be zero bytes, which
means that for every byte in every block, the same entry in the lookup
tables (T-Table for AES) will be queried. Now, if you use constant
memory this means that the entry is cached once and then the cache is
hit for every next byte. If you use shared memory it means that all
threads access the same memory bank (at the same spot) and no bank
conflicts occur, resulting in no warp serialization.

With random data you will have cache misses with constant memory and
bank conflicts with shared memory. No way around that.

Hope I explained it sufficiently.

Greetings,
Jojo

--
Johannes Gilger <hei...@hackvalue.de>
http://heipei.net
GPG-Key: 0xD47A7FFC
GPG-Fingerprint: 5441 D425 6D4A BD33 B580 618C 3CDC C4D0 D47A 7FFC

Christ Schlacta

unread,

Jun 13, 2011, 3:32:00 AM6/13/11

to engine-...@googlegroups.com

As for CBC being useless because of lack of parallel, I'd like to see
how it compares to the CPU mode, without parallel. if it sucks, it
sucks.. but if it's marginal, the CPU usage may be a deciding factor.
I know for my purposes, anything I can do to free up the CPU to compress
is a huge help!

Johannes Gilger

unread,

Jun 13, 2011, 4:01:00 AM6/13/11

to engine-...@googlegroups.com

On 13/06/11 00:32, Christ Schlacta wrote:
> As for CBC being useless because of lack of parallel, I'd like to see
> how it compares to the CPU mode, without parallel. if it sucks, it
> sucks.. but if it's marginal, the CPU usage may be a deciding factor.
> I know for my purposes, anything I can do to free up the CPU to compress
> is a huge help!

Hm, well, I haven't done any benchmarks, but let me put it like that:

- On a really good GPU you can achieve about a 3-5x speedup when taking
into account all the necessary steps (data transfer etc). Depending on
the number of GPUs, a lot of threads are run in parallel. My GPU is
really old, but still it has four streaming multiprocessors, which
keep in flight 768 threads each, meaning 3072 threads. Now compare
that to one single thread and you get the idea.

- The sheer processing speed of a CUDA core is no measure either. While
it may be clocked at 1Ghz or faster, it will take more cycles for
operations the CPU can do in one cycle. Furthermore, consider this:
global memory access (where your payload data resides) takes 400
cycles. On a GPU, if one thread requests access, it's simply suspended
and another thread executed instead. If your thread is the only one
running though, it will experience frequent gaps of 400 cycles during
which no other work is done.

- Keep in mind that executing stuff on the GPU also entails a hefty CPU
use, at least for one core. The CPU keeps polling the GPU for a return
of the data, so you'd lose on CPU core to employ your GPU, not a good
tradeoff.

Sorry, but I don't see any easy solution for your case. If we had a
different framework it might be easy. But you really have to take into
account a lot of details when considering using the GPU, it is not a
one-size-fits-all solution for existing problems.

Paolo Margara

unread,

Jun 13, 2011, 4:25:11 AM6/13/11

to engine-...@googlegroups.com

Hi at all,
I think that Johannes has explained you quite well the core of the
problem: OpenSSL does not support the CTR mode for any cipher.
I think that is not a big issue implement a cipher not supported by
OpenSSL but this task require some time that I currently not have (as
you can see by the "commit" frequency into the SVN repository).

About the testing methodology that we have used: also here johannes has
explained them in detail, including why the plugin has slightly
different performance depending from the data source used, a
random-filled file or a zero-filled file.

Regards,
Paolo Margara

Paolo Margara

unread,

Jun 13, 2011, 4:28:15 AM6/13/11

to engine-...@googlegroups.com

I completely agree with your list ;-)

Reply all

Reply to author

Forward