Short status-update

Johannes Gilger

unread,

Mar 25, 2011, 8:14:50 AM3/25/11

to engine-cuda

Hi Paolo,

quick status update, im super busy at the moment.

I've implemented all my ciphers (AES, BF, CAST, Camellia, DES, IDEA) in
CUDA and OpenCL, for ECB for now. They all work reasonably fast, even
with OpenCL.

I'm currently exploring a lot of details about the algorithms and
especially they way they are benchmarked, all of which will go into my
thesis.

I just wanted to inform you that I discovered a sad reality about your
AES implementation, which also exists for other ciphers but not as much.
When 'openssl speed' is used it will benchmark the algorithms using a
buffer of zero-bytes. I suspect it does so to conserve time when
creating the buffers, and it doesn't matter for CPUs because memory
access is equally expensive. However, for the GPU this makes a
significant difference. I've found that AES-128 performs 3.5 times
better if its used with zero-bytes than it does for random data
(/dev/urandom). I don't know what you'll do with this information, just
wanted to let you know, in case you weren't aware of it.

Anyway, here is a table which I will include in my thesis, not to
discredit your implementation but rather to point out that all the
papers I've read about cryptography on GPUs seem to miss this fact.

http://avalon.hoffentlich.net/~heipei/tmp/zero_random.png

Time is in ms, speed in megabit/s, just for the kernel invocation, no
memory transfer to or fro.

Greetings,
Jojo

P.S.: My AES for OpenCL performs better for random data because I don't
use texture memory, at least thats what I suspect. This advantage is
obliviated when taking the whole chain of encryption into consideration
again.

--
Johannes Gilger <hei...@hackvalue.de>
http://heipei.net
GPG-Key: 0xD47A7FFC
GPG-Fingerprint: 5441 D425 6D4A BD33 B580 618C 3CDC C4D0 D47A 7FFC

Paolo Margara

unread,

Mar 25, 2011, 12:51:53 PM3/25/11

to engine-...@googlegroups.com

Hi Johannes,

Il 25/03/2011 13:14, Johannes Gilger ha scritto:
> Hi Paolo,
>
> quick status update, im super busy at the moment.

As you can see by the commit frequency into the svn of the project, I'm
busy at this time too ;-)

> I've implemented all my ciphers (AES, BF, CAST, Camellia, DES, IDEA) in
> CUDA and OpenCL, for ECB for now. They all work reasonably fast, even
> with OpenCL.

Do you have implemented them for both encrypt and decrypt?

> I'm currently exploring a lot of details about the algorithms and
> especially they way they are benchmarked, all of which will go into my
> thesis.

This is a good thing, it's a problem for you to send me a pdf copy of
your work to insert into the wiki of the project (or for my personal use
if it's a problem for you make public available your thesis)?

> I just wanted to inform you that I discovered a sad reality about your
> AES implementation, which also exists for other ciphers but not as much.
> When 'openssl speed' is used it will benchmark the algorithms using a
> buffer of zero-bytes.

I was already aware that 'openssl speed' command uses a zero-filled
buffer for the benchmark.

> I suspect it does so to conserve time when
> creating the buffers, and it doesn't matter for CPUs because memory
> access is equally expensive. However, for the GPU this makes a
> significant difference.

I think this is true when you build the engine with the default option
that put T-table into constant memory, if you build the engine with the
option '--disable-ttableconstant' (that put the T-table into shared
memory) I think there won't be much difference. If you have some time to
spend to verify it...

> I've found that AES-128 performs 3.5 times
> better if its used with zero-bytes than it does for random data
> (/dev/urandom).

This is because the T-table resides into constant memory that are cached
and performance are better by using the principle of data locality (and
if the buffer is zero-filled all the buffer contains the same data), if
you put the T-table into shared memory (that are faster but uncached)
you haven't performance gain by data locality but performance are not
depending from the kind of the input file.

> I don't know what you'll do with this information, just
> wanted to let you know, in case you weren't aware of it.

This is because the option '--disable-ttableconstant' is available.

> Anyway, here is a table which I will include in my thesis, not to
> discredit your implementation but rather to point out that all the
> papers I've read about cryptography on GPUs seem to miss this fact.
>
> http://avalon.hoffentlich.net/~heipei/tmp/zero_random.png
>
> Time is in ms, speed in megabit/s, just for the kernel invocation, no
> memory transfer to or fro.
>
> Greetings,
> Jojo
>
> P.S.: My AES for OpenCL performs better for random data because I don't
> use texture memory, at least thats what I suspect. This advantage is
> obliviated when taking the whole chain of encryption into consideration
> again.

I don't think that is for my use of texture memory, into texture memory
(that are cached) I put only the round keys and the rounds number.
I'd be interested to repeat your test, if I can, the code that you have
used is into your git repository? My access is always valid?

Greetings,
Paolo Margara

Paolo Margara

unread,

Mar 28, 2011, 4:54:12 AM3/28/11

to engine-...@googlegroups.com

Hi Johannes,

last night I made a quick test to sustain what I told to you, the
computer that I have used is always the mine.
For my test I used two 1GB's files created by dd while reading from
/dev/zero and /dev/urandom. I used the test-enc script provided with the
project, I ran it four times: zero-filled data and urandom-filled data
with T-table in constant memory (requires running configure with the
default option) and with T-table in shared memory (requires running
configure with the --disable-ttableconstant option), at the end I
plotted the result with a different gnuplot script.
These are the link to the result:
*
http://engine-cuda.googlecode.com/svn/wiki/constant-vs-shared/aes-128-ecb.png
*
http://engine-cuda.googlecode.com/svn/wiki/constant-vs-shared/aes-192-ecb.png
*
http://engine-cuda.googlecode.com/svn/wiki/constant-vs-shared/aes-256-ecb.png
Since I consider this test "quick and dirt" I suggest you not to insert
it, or cite it, into your thesis, since this test encrypts a file from
the disc and writes result to /dev/null only one time we cannot consider
this result very reliable since there are many factors to consider.
However I think that is enough to see that when the engine uses constant
memory, that is cached, there are different performances from
zero-filled and urandom-filled data and that this difference virtually
doesn't exist when using shared memory.
But, in the real world the difference from using zero-filled data versus
using urandom-filled data is about 80%, that is high but not as high as
what you reported.
Could I know how you have produced the result that you show me in the
table that you have previously linked? Do you think that we could agree
on a common test methodology?

Greetings,
Paolo Margara

Johannes Gilger

unread,

Mar 28, 2011, 5:19:01 AM3/28/11

to engine-...@googlegroups.com

> Could I know how you have produced the result that you show me in the
> table that you have previously linked? Do you think that we could agree
> on a common test methodology?

Hi Paolo,

thanks for the test. I just sent an email about half an hour ago, and
now I'll follow up since I did some further testing. I already mentioned
my methodology in the last email, but I will repeat since it really is
important: Right now I'm only measuring kernel execution time using
cudaEventRecord. I put differently sized blocks into the engine, which
are copied to the device, and only the kernel execution itself is timed.
So, these speeds are theoretical, a fact I stress strongly in my thesis.
I've found that for developing and improving kernels, kernel execution
time is crucial, since it stays constant across runs and speed can
quickly be measured by looking at the execution time.

So, I've added another table which, just for AES-128-ECB, tests constant
and shared memory with random data and zero bytes. The results speak for
themselves. The "Kernel" column is in milliseconds, the Speedup column
is simply column $6 / $4, meaning the speedup by using zero-bytes
instead of random bytes. This also shows why kernel-execution is
interesting: The small differences for shared memory would completely
vanish if one measured the whole chain of encryption, including
operations like memcpy which really do differ across runs.

The table can be seen here:
http://avalon.hoffentlich.net/~heipei/tmp/zero_byte_performance.png

As for methodogoly regarding engine-cuda: I suggest you keep advertising
the speed of the engine using the whole chain of computation, like you
do now. For developing new algorithms, comparing just the execution time
of the kernel is the only way to notice subtle differences. When you
change one tiny statement and want to know whether it performs faster,
you just can't use the 'openssl speed' command imho.

Greetings,
Jojo

Johannes Gilger

unread,

Mar 28, 2011, 4:07:22 AM3/28/11

to engine-...@googlegroups.com

On 25/03/11 17:51, Paolo Margara wrote:
> > I've implemented all my ciphers (AES, BF, CAST, Camellia, DES, IDEA) in
> > CUDA and OpenCL, for ECB for now. They all work reasonably fast, even
> > with OpenCL.
> Do you have implemented them for both encrypt and decrypt?

I've implemented ECB encrypt for all of the algorithms, but only
implemented decryption for all AES modes as well as DES. I didn't spend
further time on the other algorithms, but in most cases decryption works
out to simply reversing the operations performed for encryption, so in
most cases it should be half an hour worth of programming.

> > I'm currently exploring a lot of details about the algorithms and
> > especially they way they are benchmarked, all of which will go into my
> > thesis.
> This is a good thing, it's a problem for you to send me a pdf copy of
> your work to insert into the wiki of the project (or for my personal use
> if it's a problem for you make public available your thesis)?

No, I don't think this will be a problem. I hope I'll be happy enough
with my results to make them public ;)

> > I suspect it does so to conserve time when
> > creating the buffers, and it doesn't matter for CPUs because memory
> > access is equally expensive. However, for the GPU this makes a
> > significant difference.
> I think this is true when you build the engine with the default option
> that put T-table into constant memory, if you build the engine with the
> option '--disable-ttableconstant' (that put the T-table into shared
> memory) I think there won't be much difference. If you have some time to
> spend to verify it...

Yes, I did those test with a the constant-table version of AES. I've
repeated these tests with the shared-table approach. Just as an example:

AES-128-ECB with zero-bytes:
- Constant table: 3376 Mb/s
- Shared table: 2861 Mb/s

AES-128-ECB with random bytes:
- Constant table: 956 Mb/s
- Shared table: 2932 Mb/s

These test were performed on my 8600 GT using 8MB-blocks of data. I only
measure the kernel execution time and the megabit/s reflect the
theoretical performance, if memcpy to and from the device are not taken
into account (I'm doing this for my thesis at the moment, to supplement
the benchmarks using openssl-speed).

So, yeah, for zero bytes, which is openssl speed, the constant tables
work better. But for production use the constant tables are next to
useless and we should really consider making shared tables the default,
so that configure needs an explicity argument to enable constant tables.

> > P.S.: My AES for OpenCL performs better for random data because I don't
> > use texture memory, at least thats what I suspect. This advantage is
> > obliviated when taking the whole chain of encryption into consideration
> > again.
> I don't think that is for my use of texture memory, into texture memory
> (that are cached) I put only the round keys and the rounds number.
> I'd be interested to repeat your test, if I can, the code that you have
> used is into your git repository? My access is always valid?

Yes, your access is still valid. However, I've talked to my supervisor
and he feels that my contribution to the engine-cuda project should
remain closed for a while after I hand in my thesis, in order to have
some kind of advantage over other teams of researchers. It's not like
I'll be working on engine-cuda after I finish my thesis, and I'll bring
this up again with my professor once I'm done, since I don't see anyone
else at the chair picking up my work. In any case, you'll have access
and can use whatever I've done as a source of inspiration, if you so
wish to ;)

I worked on the AES recently and mostly cleaned up the code using
preprocessor-macros and the like. The result is a smaller file which
should be more readable (have a a look at the AES_ENC_ROUND macro for
example). The aes_cuda.cu is now half the lines of your last version,
while retaining the same functionality. Another thing I did was to split
up kernels for AES-128, AES-192 and AES-256 and got rid of the rounds
variable. This will increase throughput and free at least one register.
But you're free to do your own test of course.

Greetings,
Jojo

Paolo Margara

unread,

Mar 30, 2011, 5:05:35 AM3/30/11

to engine-...@googlegroups.com

Hi Johannes,
for some unknown reasons your message was sent to the moderating queue,
I'm sorry but I saw this only this morning, I apologise for the
inconvinience.

For now, constant memory is the default because I made more tests with
this option and kernel functions require less registers and shared
memory but I agree with you when you say that shared memory should be
the default, I think that this will be the default for future releases
(I remind you that a major bug that affects the shared version was fixed
just before the release of version 0.1.1 and more testing was done after
that release).

For me it's not a problem to wait some time to merge your contribution
into the project, I know how things works into a university, but
remember that contribution to an open source project should be made
publically available.

I was also thinking about splitting the kernel in three versions for
AES-128, AES-192 and AES-256 to eliminate the rounds variable but for
now I haven't had time to do so, what is the percentage improvement that
you have achieved? How many registers you've saved for each
implementation? How much grows the resulting library?

I agree that the 'openssl speed' command should not be the only test
done while developing, but for now I haven't time to develop something
better. A thing that could be done in few time is patching the speed.c
source file into the apps directory into the openssl project by
replacing the zero-filled buffer with random generated value. But, by
doing this, when comparing the performance of engine-cudamrg with any
other engine that can accelerate the currently supported ciphers results
cannot be directly compared: for this reason I think that is better for
the project to show benchmarks done with 'openssl speed'.

Greetings,
Paolo Margara

Johannes Gilger

unread,

Apr 4, 2011, 3:34:55 AM4/4/11

to engine-...@googlegroups.com

On 30/03/11 11:05, Paolo Margara wrote:
> For now, constant memory is the default because I made more tests with
> this option and kernel functions require less registers and shared
> memory but I agree with you when you say that shared memory should be
> the default, I think that this will be the default for future releases
> (I remind you that a major bug that affects the shared version was fixed
> just before the release of version 0.1.1 and more testing was done after
> that release).

In my latest commits I finally learned how to use autoconf and made the
shared-memory approach the default, so now you have to request constant
memory explicitly. Here is a list of the new options I included, since
I found myself passing them through CFLAGS and NVCFLAGS all the time:

--enable-timing enable timing support (default=disabled)

--enable-ttableconstant choose where T-tables are stored (constant or shared
memory) (default=disabled)

--disable-libopencl disable building libOpenCL.so (default=enabled)

--with-rregcount specify the max number of regs that a thread can use
(default no restriction)

--with-maxthreads specify the number of threads per block (default:
256 threads)

--with-gpuarch supply the lowest compute capability the build
product should run on (default: sm_10)

I also tweaked the configure to run on OS X, since the library there
will be called libcudamrg.dylib.

> I was also thinking about splitting the kernel in three versions for
> AES-128, AES-192 and AES-256 to eliminate the rounds variable but for
> now I haven't had time to do so, what is the percentage improvement that
> you have achieved? How many registers you've saved for each
> implementation? How much grows the resulting library?

I wrote that in the private mail to you. For pre-2.0 cards, register use
is the factor limiting occupancy for AES. So, to restate it: As far as
registers go: Your latest patch to aes_cuda.cu had the aes-kernel for
Compute Capability 1.1 with 18/19 registers (enc/dec) ECB and 22
registers CBC. I was able to bring that down to 9 registers ECB
(encryption and decryption) and 11 registers CBC.

> I agree that the 'openssl speed' command should not be the only test
> done while developing, but for now I haven't time to develop something
> better. A thing that could be done in few time is patching the speed.c
> source file into the apps directory into the openssl project by
> replacing the zero-filled buffer with random generated value. But, by
> doing this, when comparing the performance of engine-cudamrg with any
> other engine that can accelerate the currently supported ciphers results
> cannot be directly compared: for this reason I think that is better for
> the project to show benchmarks done with 'openssl speed'.

Yes, I think that is completely OK. But tests with random data should
always ensure that we are not optimizing our algorithms to zero-bytes
unknowingly ;)

In my commits I've recently included a simple timing mechanism for CUDA
and OpenCL in the form of preprocessor-macros (CUDA_START_TIME and
CUDA_STOP_TIME). These only have an effect if ./configure
--enable-timing is active, and they will measure kernel execution time
and print the number of bytes, the microseconds for the kernel and the
calculated performance in Mb/s. This has the advantage that test-runs
are easily possible by just encrypting files (which can be filled with
zero-bytes or random data) and that the results stay fairly constant
across runs, so it is a good technique to evalutate small changes.

I was able to detect really small improvements when simply reordering
some operations of a Feistel-function which were independent. Or I was
able to quickly determine whether using __mul24 made an impact for IDEA
(it did), without having to run five runs of 'openssl speed' which takes
forever ;) And I was able to directly compare my OpenCL port, which was
exactly as fast as CUDA in some cases, and the only performance hit
stems from the lack of page-locked host memory and DMA.

So much for now,
greetings,

Reply all

Reply to author

Forward