Hi Christ,
well, that's another of these nice problems: OpenSSL does not support
the CTR mode for any cipher. There are some hints that people were
planning to support it, but nothing has happened yet. I don't know how
easy it would be for an engine to support a cipher and blockmode which
is not present in OpenSSL, and which needs arguments besides the key.
This is one of the reasons I'm stuck with my work after having completed
my thesis. I want to implement something which has an immediate use, but
neither ECB nor CBC-only-decryption fit that profile. The CUDA/OpenCL
kernels could be used in other contexts with little effort, but right
now I'm not sure where to start integrating support for CUDA/OpenCL into
existing software.
> Using data from /dev/urandom or /dev/random for any benchmark will of
> course perform poorly. {u}random dedicates cycles on a crypto engine
> in kernel space to generate numbers, which even at 100% CPU takes
> significant time, and is therefore a bottleneck. compare dd if=/dev/
> sda of=/dev/null to dd if=/dev/urandom of=/dev/null to see the
> difference. I read that you were experiencing slowdowns using /dev/
> urandom for benchmarks, and I hope this helps you normalize your
> results. Use a decently fast raw disk drive to store your data as the
> OS caching on filesystems will also skew your results.
Ok, to clarify this point: I'm not that stupid ;)
I wrote a file-generator which uses dd to generate files from /dev/zero
or /dev/urandom before any call to OpenSSL takes place, it's just a
small script. Furthermore, when benchmarking the pure kernel performance
nothing else besides the kernel call was taken into account. So, the
time needed to copy to/from background memory is completely irrelevant
to my micro-benchmarks.
The reason for the difference in /dev/zero and /dev/urandom is very
simple: If you use /dev/zero, all your payload will be zero bytes, which
means that for every byte in every block, the same entry in the lookup
tables (T-Table for AES) will be queried. Now, if you use constant
memory this means that the entry is cached once and then the cache is
hit for every next byte. If you use shared memory it means that all
threads access the same memory bank (at the same spot) and no bank
conflicts occur, resulting in no warp serialization.
With random data you will have cache misses with constant memory and
bank conflicts with shared memory. No way around that.
Hope I explained it sufficiently.
Greetings,
Jojo
--
Johannes Gilger <hei...@hackvalue.de>
http://heipei.net
GPG-Key: 0xD47A7FFC
GPG-Fingerprint: 5441 D425 6D4A BD33 B580 618C 3CDC C4D0 D47A 7FFC
Hm, well, I haven't done any benchmarks, but let me put it like that:
- On a really good GPU you can achieve about a 3-5x speedup when taking
into account all the necessary steps (data transfer etc). Depending on
the number of GPUs, a lot of threads are run in parallel. My GPU is
really old, but still it has four streaming multiprocessors, which
keep in flight 768 threads each, meaning 3072 threads. Now compare
that to one single thread and you get the idea.
- The sheer processing speed of a CUDA core is no measure either. While
it may be clocked at 1Ghz or faster, it will take more cycles for
operations the CPU can do in one cycle. Furthermore, consider this:
global memory access (where your payload data resides) takes 400
cycles. On a GPU, if one thread requests access, it's simply suspended
and another thread executed instead. If your thread is the only one
running though, it will experience frequent gaps of 400 cycles during
which no other work is done.
- Keep in mind that executing stuff on the GPU also entails a hefty CPU
use, at least for one core. The CPU keeps polling the GPU for a return
of the data, so you'd lose on CPU core to employ your GPU, not a good
tradeoff.
Sorry, but I don't see any easy solution for your case. If we had a
different framework it might be easy. But you really have to take into
account a lot of details when considering using the GPU, it is not a
one-size-fits-all solution for existing problems.
About the testing methodology that we have used: also here johannes has
explained them in detail, including why the plugin has slightly
different performance depending from the data source used, a
random-filled file or a zero-filled file.
Regards,
Paolo Margara