cuda_init and other architecture questions

38 views

Skip to first unread message

Johannes Gilger

unread,

Dec 21, 2010, 12:22:41 PM12/21/10

to engine-cuda

Hi Paolo,

I feel like I have to ask a few questions about the general architecture
of this OpenSSL engine. I've managed to get DES working (ECB for now,
encryption and decryption), and even with my basic skills and
entry-level GPU (8600 GT) it outperforms my CPU for big blocks.

I do have a few questions though:

- How does one best use the cuda_init method? It is set with
ENGINE_set_init_function and only takes an ENGINE pointer. This was
fine as long as there was only one cipher in the engine, but now that
there are more it's hard to just allocate memory which will be usable
by every cipher later on, and not waste any. As far as I can tell
there is no way to detect which cipher is being requested as early as
cuda_init, so I'm wondering how I should proceed (for now I simply
commented all the AES init calls).

- Why did you use texture memory with AES? I didn't study your code that
closely, but I don't remember any emphasis on texture memory in the
books I read so for. For DES I mostly use constant mem (SBoxes and
Round-Key) and plain device memory for the data (needs to read and
written only once anyway, no point in caching).

- Did the --enable-gprof configure option ever work? It seems to me like
OpenSSL would have to be compiled using profiling as well.

- Why do the verbose-flags not work? When using -v (or -v -v) nothing
happens.

- Any idea why the 'ciphers' command of OpenSSL doesn't work?

- Did you plan on implementing multi-GPU support with host threads? The
way I see it, the spot where the input is broken into chunks would
make that relatively easy, judging from superficial inspection.

Other than that I've found my way around and hope that after I clear
these issues I can release my early stab at DES (at least to you for the
time being).

Greetings,
Jojo

--
Johannes Gilger <hei...@hackvalue.de>
http://heipei.net
GPG-Key: 0xD47A7FFC
GPG-Fingerprint: 5441 D425 6D4A BD33 B580 618C 3CDC C4D0 D47A 7FFC

Paolo Margara

unread,

Dec 22, 2010, 9:23:16 AM12/22/10

to engine-...@googlegroups.com

Il 21/12/2010 18:22, Johannes Gilger ha scritto:
> Hi Paolo,
>
> I feel like I have to ask a few questions about the general architecture
> of this OpenSSL engine. I've managed to get DES working (ECB for now,
> encryption and decryption), and even with my basic skills and
> entry-level GPU (8600 GT) it outperforms my CPU for big blocks.
>
> I do have a few questions though:
>
> - How does one best use the cuda_init method? It is set with
> ENGINE_set_init_function and only takes an ENGINE pointer. This was
> fine as long as there was only one cipher in the engine, but now that
> there are more it's hard to just allocate memory which will be usable
> by every cipher later on, and not waste any. As far as I can tell
> there is no way to detect which cipher is being requested as early as
> cuda_init, so I'm wondering how I should proceed (for now I simply
> commented all the AES init calls).

I think this point will require some work in future commit.
Since this piece of code will be changed soon I think it is better for
you to continue your work by simply commenting the code that manages the
AES initialization.
At that point we will work together when will be the time to merge your
code into the existing one.

>
> - Why did you use texture memory with AES? I didn't study your code that
> closely, but I don't remember any emphasis on texture memory in the
> books I read so for. For DES I mostly use constant mem (SBoxes and
> Round-Key) and plain device memory for the data (needs to read and
> written only once anyway, no point in caching).

You have used constant memory also for the key?!
I choose to store the expanded key in texture memory because is cached
and a random texture memory access is available directly by all the GPU
threads and by the CPU to store the key, furthermore the limited size of
the expanded key ensures it is always completely stored in the texture
cache.

> - Did the --enable-gprof configure option ever work? It seems to me like
> OpenSSL would have to be compiled using profiling as well.

This an implemented but untested feature.
It should work only for the plugin, not for the teststate program (at
least for now), obviously you should rebuild openssl with that option
for get it work properly.

> - Why do the verbose-flags not work? When using -v (or -v -v) nothing
> happens.

Do you mean the verbose flag into the plugins configuration?

>
> - Any idea why the 'ciphers' command of OpenSSL doesn't work?

No, in my build of openssl it works correctly.

> - Did you plan on implementing multi-GPU support with host threads? The
> way I see it, the spot where the input is broken into chunks would
> make that relatively easy, judging from superficial inspection.

I had already thought about the possibility of implementing this feature.
In my opinion, for have real benefits, you must have two or more cards
of comparable speed. At that point probably would be more useful to
drive the cards in SLI, which is also easier to manage.
I think it might be more interesting implement the encryption and
decryption using multiple streams, but this is only my opinion.

> Other than that I've found my way around and hope that after I clear
> these issues I can release my early stab at DES (at least to you for the
> time being).
>
> Greetings,
> Jojo

I'm available for any other questions.

Greetings,
Paolo Margara

Johannes Gilger

unread,

Jan 9, 2011, 4:50:06 PM1/9/11

to engine-...@googlegroups.com

Hi Paolo,

maybe a short update from me as well as some further questions.

Right now I feel like on the right path. I have implemented DES (ECB for
now) and IDEA and have managed to get DES to more than 4x speed on the
GPU, and IDEA to almost 10x speedup using the GPU. FYI, I run a 8600GT
(1.1) on a Pentium D 2.8 Ghz. Nice prospects!

Anyway, on to the technical stuff:

On 22/12/10 15:23, Paolo Margara wrote:
> I think this point will require some work in future commit.
> Since this piece of code will be changed soon I think it is better for
> you to continue your work by simply commenting the code that manages the
> AES initialization.

I have some further headaches about the whole initalization. First of,
I'm using constant memory for my DES implementation (which I'm thinking
of replacing by texture memory, but that's for later), and like other
variables, it just hit me that this memory is probobably alloced and
initialized every time the whole engine is started, even if its just to
run IDEA or AES. Since the DES cuda file is compiled into the whole
file. How would one even start to divide this? Dynamic memory allocation
is simple, and might even work out for us since we mostly use the same
buffer size for the different algorithms, as far as I can tell, but all
these variables (__device__ uint32_t *rk, stuff like that) are present
every time some component of the engine is started.

> You have used constant memory also for the key?!
> I choose to store the expanded key in texture memory because is cached
> and a random texture memory access is available directly by all the GPU
> threads and by the CPU to store the key, furthermore the limited size of
> the expanded key ensures it is always completely stored in the texture
> cache.

Ok, I admit I didn't start with texture memory because it looks rather
complicated, at least compared to plain old constant memory. Right now
DES uses some kind of Sbox (stolen from the OpenSSL guys) which is in
constant memory on startup and gets copied to shared memory on the
kernel invocation. This turned out to perform better than just leaving
it in constant memory, especially because none of the accesses to this
memory are coalesced in any way.

> > - Why do the verbose-flags not work? When using -v (or -v -v) nothing
> > happens.
> Do you mean the verbose flag into the plugins configuration?

I mean the fact that there seems to be some way to pass verbosity to the
program on invocation. Right now, if I want verbose output, I change
"verbose = 2;" in e_cuda.c

> > - Any idea why the 'ciphers' command of OpenSSL doesn't work?
> No, in my build of openssl it works correctly.

To clarify: I mean "openssl ciphers -engine cudamrg", which gives me
Error in cipher list
140292317238952:error:1410D0B9:SSL routines:SSL_CTX_set_cipher_list:no cipher match:ssl_lib.c:1282:

> I had already thought about the possibility of implementing this feature.
> In my opinion, for have real benefits, you must have two or more cards
> of comparable speed. At that point probably would be more useful to
> drive the cards in SLI, which is also easier to manage.

Hm, I do have 8 identical cards available at the machine in the lab, but
seeing the performance my single el-cheapo-card delivers, I reserver the
multi-GPU functionality if I really need to crank out more speed ;)

> I think it might be more interesting implement the encryption and
> decryption using multiple streams, but this is only my opinion.

I haven't looked at streams yet.

By the way, why do you CPU-only modes for the engine? When I compare my
CUDA implementation with the CPU algorithms I simply omit the -engine
cudamrg, and the original OpenSSL implementation takes over, right? Why
the hassle with these preprocessor conditionals?

Another thing that I find rather strange is that running openssl with
engine cuda gives me 100% CPU load as well. Since the memory is pinned,
copying it shouldn't impact CPU, right? And the key-schedule doesn't
take that long as well. Any idea where that might come from?

It worries me that nvidia-smi -a reports 0% GPU and 5% memory usage
while I'm running my kernels on it. Is the computation / memory transfer
ratio really that low? The visual compute profiler suggest that the GPU
is used by my kernel for at least 50% of the time, while the rest is
memcpy. OK, running "glxgears" has the same effect, i.e. none, but its
still strange.

I think I'll be able to give you private access to my repo so far at
some point this week, the sooner the better. I saw your changes on the
main repo but didn't merge them yet, since they will require some manual
intervention.

Thanks for your answers so far,
greetings,

Paolo Margara

unread,

Jan 11, 2011, 8:58:18 AM1/11/11

to engine-...@googlegroups.com

Il 09/01/2011 22:50, Johannes Gilger ha scritto:
> Hi Paolo,
>
> maybe a short update from me as well as some further questions.
>
> Right now I feel like on the right path. I have implemented DES (ECB for
> now) and IDEA and have managed to get DES to more than 4x speed on the
> GPU, and IDEA to almost 10x speedup using the GPU. FYI, I run a 8600GT
> (1.1) on a Pentium D 2.8 Ghz. Nice prospects!

Hi Johannes,
congratulations for the achievements! But you plan to support also CBC
(at least for decrypt)?

> Anyway, on to the technical stuff:
>
> On 22/12/10 15:23, Paolo Margara wrote:
>> I think this point will require some work in future commit.
>> Since this piece of code will be changed soon I think it is better for
>> you to continue your work by simply commenting the code that manages the
>> AES initialization.
> I have some further headaches about the whole initalization. First of,
> I'm using constant memory for my DES implementation (which I'm thinking
> of replacing by texture memory, but that's for later), and like other
> variables, it just hit me that this memory is probobably alloced and
> initialized every time the whole engine is started, even if its just to
> run IDEA or AES. Since the DES cuda file is compiled into the whole
> file. How would one even start to divide this? Dynamic memory allocation
> is simple, and might even work out for us since we mostly use the same
> buffer size for the different algorithms, as far as I can tell, but all
> these variables (__device__ uint32_t *rk, stuff like that) are present
> every time some component of the engine is started.
>

What you say is true but we cannot do otherwise, I would not worry for
few wasted bytes.

>> You have used constant memory also for the key?!
>> I choose to store the expanded key in texture memory because is cached
>> and a random texture memory access is available directly by all the GPU
>> threads and by the CPU to store the key, furthermore the limited size of
>> the expanded key ensures it is always completely stored in the texture
>> cache.
> Ok, I admit I didn't start with texture memory because it looks rather
> complicated, at least compared to plain old constant memory. Right now
> DES uses some kind of Sbox (stolen from the OpenSSL guys) which is in
> constant memory on startup and gets copied to shared memory on the
> kernel invocation. This turned out to perform better than just leaving
> it in constant memory, especially because none of the accesses to this
> memory are coalesced in any way.
>
>>> - Why do the verbose-flags not work? When using -v (or -v -v) nothing
>>> happens.
>> Do you mean the verbose flag into the plugins configuration?
> I mean the fact that there seems to be some way to pass verbosity to the
> program on invocation. Right now, if I want verbose output, I change
> "verbose = 2;" in e_cuda.c
>

That option does not work very well, in the near future I plan to fix
that and other configuration options.

>>> - Any idea why the 'ciphers' command of OpenSSL doesn't work?
>> No, in my build of openssl it works correctly.
> To clarify: I mean "openssl ciphers -engine cudamrg", which gives me
> Error in cipher list
> 140292317238952:error:1410D0B9:SSL routines:SSL_CTX_set_cipher_list:no cipher match:ssl_lib.c:1282:
>

If I remember correctly 'openssl ciphers' does not support the '-engine'
option, you can use the command 'openssl ciphers -help' to see the
supported options.

>> I had already thought about the possibility of implementing this feature.
>> In my opinion, for have real benefits, you must have two or more cards
>> of comparable speed. At that point probably would be more useful to
>> drive the cards in SLI, which is also easier to manage.
> Hm, I do have 8 identical cards available at the machine in the lab, but
> seeing the performance my single el-cheapo-card delivers, I reserver the
> multi-GPU functionality if I really need to crank out more speed ;)
>

>> I think it might be more interesting implement the encryption and
>> decryption using multiple streams, but this is only my opinion.
> I haven't looked at streams yet.
>
> By the way, why do you CPU-only modes for the engine? When I compare my
> CUDA implementation with the CPU algorithms I simply omit the -engine
> cudamrg, and the original OpenSSL implementation takes over, right? Why
> the hassle with these preprocessor conditionals?

What you say it's true but when I started to develop the engine code I
have used that option for testing purpose, currently it's quite useless.

> Another thing that I find rather strange is that running openssl with
> engine cuda gives me 100% CPU load as well. Since the memory is pinned,
> copying it shouldn't impact CPU, right? And the key-schedule doesn't
> take that long as well. Any idea where that might come from?

I don't think that could be the key schedule, if you want to try you
could build the engine with '--enable-pageable' option to disable pinned
memory.

> It worries me that nvidia-smi -a reports 0% GPU and 5% memory usage
> while I'm running my kernels on it. Is the computation / memory transfer
> ratio really that low? The visual compute profiler suggest that the GPU
> is used by my kernel for at least 50% of the time, while the rest is
> memcpy. OK, running "glxgears" has the same effect, i.e. none, but its
> still strange.

I suggest you to use the nvidia Compute Visual Profiler and ignore the rest.

> I think I'll be able to give you private access to my repo so far at
> some point this week, the sooner the better. I saw your changes on the
> main repo but didn't merge them yet, since they will require some manual
> intervention.

Very good! Thanks.

> Thanks for your answers so far,
> greetings,
> Jojo
>

greetings,
Paolo Margara

Johannes Gilger

unread,

Jan 22, 2011, 6:27:16 AM1/22/11

to engine-...@googlegroups.com

Hi Paolo,

still no open repo from me, and for that I'm sorry. Right now I don't
feel like the repository is in a state to be released, there are a lot
of files and debug-statements I still have to get rid of. Furthermore,
the algorithms still are unoptimized, working only in encryption and
only with ECB. But I try to keep you updated with this mail.

An overview over what works:
- DES ECB encryption
- Blowfish ECB encryption
- Camellia-128 ECB encryption
- IDEA ECB encryption
- CAST5 ECB encryption

Each of these is faster than my CPU by at least a factor of 2, most of
the time even faster. It heavily depends on what values to trust. I do
tests using 'openssl speed' which are very fast, but also plain
file-encryption tests with a 100MB file, which probably contain some
overhead and don't yield results quite as impressive.

Most importantly: I have merged all your changes into my repo, so up
until now no work has been lost on either side. I would greatly
appreciate better commit messages, meaning that the first line should be
a short description (<= 50 chars), followed by an empty line and then a longer
description, wrapped to 80 chars. This is the git style for commit
messages. I rewrote your commit-messages and included some additional
info, like in r31, where you fixed bugs but also included AES
key-scheduling on the GPU, as far as I could tell.

On 11/01/11 14:58, Paolo Margara wrote:
> Hi Johannes, congratulations for the achievements! But you plan to
> support also CBC (at least for decrypt)?

Yes, I do, but that will be a while. I try to optimize each algorithm
_before_ I do decryption and CBC, since most of the code is copied
anyway, and doing CBC/decryption to early would mean that I have to
re-copy-paste everything if I change one kernel.

About the whole initialization:

> What you say is true but we cannot do otherwise, I would not worry for
> few wasted bytes.

Yesterday I wrote a unified init and finish-function for the engine.
What it does is alloc memory (buffer_size) on the device and host in
e_cuda.c and then it passes these pointers to the single encrypt-decrypt
wrapper functions for each of the algorithms. So, I can now test all the
algorithms without commenting anything in between ;)

About constant memory and pre-initialized variables: I am pretty sure
that constant memory (say in des_cuda.cu) only is copied when I invoke a
kernel from this file. I base this on the observation that my shared
memory uses only the amount specified in the corresponding .cu file when
I call a kernel like CAST5-encryption. The Visual Profiler says so.

> > By the way, why do you CPU-only modes for the engine? When I compare my
> > CUDA implementation with the CPU algorithms I simply omit the -engine
> > cudamrg, and the original OpenSSL implementation takes over, right? Why
> > the hassle with these preprocessor conditionals?
> What you say it's true but when I started to develop the engine code I
> have used that option for testing purpose, currently it's quite useless.

If gotten rid of all the CPU_ONLY statements for the algorithms I
implemented but I haven't touched AES yet. This is because I consider it
"your code" and because it will make it easier for me to merge your
changes in the future ;)

I really hope I'll be able to give you a glimpse of my repo in the next
few weeks. If the code I produce should ever be released/included in the
original project, we would have to think about licensing. To implement
the algorithms I used a lot of code-snippets (like preprocessor macros,
variable names, etc) from the original OpenSSL. Basically, I copied
everything I needed from OpenSSL and then made it run on CUDA. After
that I optimized it with the profiler. I don't think it will be a
problem, but it should be kept in mind for the future.

Looking forward to questions and comments,
regards, Jojo

Paolo Margara

unread,

Jan 25, 2011, 5:32:31 AM1/25/11

to engine-...@googlegroups.com

Il 22/01/2011 12:27, Johannes Gilger ha scritto:
> Hi Paolo,
>
> still no open repo from me, and for that I'm sorry.

Hi Johannes,
don't worry, no problem.

> Right now I don't
> feel like the repository is in a state to be released, there are a lot
> of files and debug-statements I still have to get rid of. Furthermore,
> the algorithms still are unoptimized, working only in encryption and
> only with ECB. But I try to keep you updated with this mail.
>
> An overview over what works:
> - DES ECB encryption
> - Blowfish ECB encryption
> - Camellia-128 ECB encryption
> - IDEA ECB encryption
> - CAST5 ECB encryption
>
> Each of these is faster than my CPU by at least a factor of 2, most of
> the time even faster.

> It heavily depends on what values to trust. I do
> tests using 'openssl speed' which are very fast, but also plain
> file-encryption tests with a 100MB file, which probably contain some
> overhead and don't yield results quite as impressive.
>
> Most importantly: I have merged all your changes into my repo, so up
> until now no work has been lost on either side. I would greatly
> appreciate better commit messages, meaning that the first line should be
> a short description (<= 50 chars), followed by an empty line and then a longer
> description, wrapped to 80 chars. This is the git style for commit
> messages.

As I was the only developer I've never lost a long time on it, obviously
if the number of active developer is going to grow in future I will
write more detailed commit messages.

> I rewrote your commit-messages and included some additional
> info, like in r31, where you fixed bugs but also included AES
> key-scheduling on the GPU, as far as I could tell.

To tell the true the AES key-scheduling is executed on the CPU, this was
done to resolve once and for all the problem of endianness related to
encryption that had appeared in some cases (for example when openssl is
built with the 'no-asm' option).

> On 11/01/11 14:58, Paolo Margara wrote:
>> Hi Johannes, congratulations for the achievements! But you plan to
>> support also CBC (at least for decrypt)?
> Yes, I do, but that will be a while. I try to optimize each algorithm
> _before_ I do decryption and CBC, since most of the code is copied
> anyway, and doing CBC/decryption to early would mean that I have to
> re-copy-paste everything if I change one kernel.

Ok, just to know.

> About the whole initialization:
>> What you say is true but we cannot do otherwise, I would not worry for
>> few wasted bytes.
> Yesterday I wrote a unified init and finish-function for the engine.
> What it does is alloc memory (buffer_size) on the device and host in
> e_cuda.c and then it passes these pointers to the single encrypt-decrypt
> wrapper functions for each of the algorithms. So, I can now test all the
> algorithms without commenting anything in between ;)

Good.

> About constant memory and pre-initialized variables: I am pretty sure
> that constant memory (say in des_cuda.cu) only is copied when I invoke a
> kernel from this file. I base this on the observation that my shared
> memory uses only the amount specified in the corresponding .cu file when
> I call a kernel like CAST5-encryption. The Visual Profiler says so.
>
>>> By the way, why do you CPU-only modes for the engine? When I compare my
>>> CUDA implementation with the CPU algorithms I simply omit the -engine
>>> cudamrg, and the original OpenSSL implementation takes over, right? Why
>>> the hassle with these preprocessor conditionals?
>> What you say it's true but when I started to develop the engine code I
>> have used that option for testing purpose, currently it's quite useless.
> If gotten rid of all the CPU_ONLY statements for the algorithms I
> implemented but I haven't touched AES yet. This is because I consider it
> "your code" and because it will make it easier for me to merge your
> changes in the future ;)

Right choice.

> I really hope I'll be able to give you a glimpse of my repo in the next
> few weeks.

I look forward for it.

> If the code I produce should ever be released/included in the
> original project, we would have to think about licensing. To implement
> the algorithms I used a lot of code-snippets (like preprocessor macros,
> variable names, etc) from the original OpenSSL. Basically, I copied
> everything I needed from OpenSSL and then made it run on CUDA. After
> that I optimized it with the profiler. I don't think it will be a
> problem, but it should be kept in mind for the future.