I feel like I have to ask a few questions about the general architecture
of this OpenSSL engine. I've managed to get DES working (ECB for now,
encryption and decryption), and even with my basic skills and
entry-level GPU (8600 GT) it outperforms my CPU for big blocks.
I do have a few questions though:
- How does one best use the cuda_init method? It is set with
ENGINE_set_init_function and only takes an ENGINE pointer. This was
fine as long as there was only one cipher in the engine, but now that
there are more it's hard to just allocate memory which will be usable
by every cipher later on, and not waste any. As far as I can tell
there is no way to detect which cipher is being requested as early as
cuda_init, so I'm wondering how I should proceed (for now I simply
commented all the AES init calls).
- Why did you use texture memory with AES? I didn't study your code that
closely, but I don't remember any emphasis on texture memory in the
books I read so for. For DES I mostly use constant mem (SBoxes and
Round-Key) and plain device memory for the data (needs to read and
written only once anyway, no point in caching).
- Did the --enable-gprof configure option ever work? It seems to me like
OpenSSL would have to be compiled using profiling as well.
- Why do the verbose-flags not work? When using -v (or -v -v) nothing
happens.
- Any idea why the 'ciphers' command of OpenSSL doesn't work?
- Did you plan on implementing multi-GPU support with host threads? The
way I see it, the spot where the input is broken into chunks would
make that relatively easy, judging from superficial inspection.
Other than that I've found my way around and hope that after I clear
these issues I can release my early stab at DES (at least to you for the
time being).
Greetings,
Jojo
--
Johannes Gilger <hei...@hackvalue.de>
http://heipei.net
GPG-Key: 0xD47A7FFC
GPG-Fingerprint: 5441 D425 6D4A BD33 B580 618C 3CDC C4D0 D47A 7FFC
Greetings,
Paolo Margara
maybe a short update from me as well as some further questions.
Right now I feel like on the right path. I have implemented DES (ECB for
now) and IDEA and have managed to get DES to more than 4x speed on the
GPU, and IDEA to almost 10x speedup using the GPU. FYI, I run a 8600GT
(1.1) on a Pentium D 2.8 Ghz. Nice prospects!
Anyway, on to the technical stuff:
On 22/12/10 15:23, Paolo Margara wrote:
> I think this point will require some work in future commit.
> Since this piece of code will be changed soon I think it is better for
> you to continue your work by simply commenting the code that manages the
> AES initialization.
I have some further headaches about the whole initalization. First of,
I'm using constant memory for my DES implementation (which I'm thinking
of replacing by texture memory, but that's for later), and like other
variables, it just hit me that this memory is probobably alloced and
initialized every time the whole engine is started, even if its just to
run IDEA or AES. Since the DES cuda file is compiled into the whole
file. How would one even start to divide this? Dynamic memory allocation
is simple, and might even work out for us since we mostly use the same
buffer size for the different algorithms, as far as I can tell, but all
these variables (__device__ uint32_t *rk, stuff like that) are present
every time some component of the engine is started.
> You have used constant memory also for the key?!
> I choose to store the expanded key in texture memory because is cached
> and a random texture memory access is available directly by all the GPU
> threads and by the CPU to store the key, furthermore the limited size of
> the expanded key ensures it is always completely stored in the texture
> cache.
Ok, I admit I didn't start with texture memory because it looks rather
complicated, at least compared to plain old constant memory. Right now
DES uses some kind of Sbox (stolen from the OpenSSL guys) which is in
constant memory on startup and gets copied to shared memory on the
kernel invocation. This turned out to perform better than just leaving
it in constant memory, especially because none of the accesses to this
memory are coalesced in any way.
> > - Why do the verbose-flags not work? When using -v (or -v -v) nothing
> > happens.
> Do you mean the verbose flag into the plugins configuration?
I mean the fact that there seems to be some way to pass verbosity to the
program on invocation. Right now, if I want verbose output, I change
"verbose = 2;" in e_cuda.c
> > - Any idea why the 'ciphers' command of OpenSSL doesn't work?
> No, in my build of openssl it works correctly.
To clarify: I mean "openssl ciphers -engine cudamrg", which gives me
Error in cipher list
140292317238952:error:1410D0B9:SSL routines:SSL_CTX_set_cipher_list:no cipher match:ssl_lib.c:1282:
> I had already thought about the possibility of implementing this feature.
> In my opinion, for have real benefits, you must have two or more cards
> of comparable speed. At that point probably would be more useful to
> drive the cards in SLI, which is also easier to manage.
Hm, I do have 8 identical cards available at the machine in the lab, but
seeing the performance my single el-cheapo-card delivers, I reserver the
multi-GPU functionality if I really need to crank out more speed ;)
> I think it might be more interesting implement the encryption and
> decryption using multiple streams, but this is only my opinion.
I haven't looked at streams yet.
By the way, why do you CPU-only modes for the engine? When I compare my
CUDA implementation with the CPU algorithms I simply omit the -engine
cudamrg, and the original OpenSSL implementation takes over, right? Why
the hassle with these preprocessor conditionals?
Another thing that I find rather strange is that running openssl with
engine cuda gives me 100% CPU load as well. Since the memory is pinned,
copying it shouldn't impact CPU, right? And the key-schedule doesn't
take that long as well. Any idea where that might come from?
It worries me that nvidia-smi -a reports 0% GPU and 5% memory usage
while I'm running my kernels on it. Is the computation / memory transfer
ratio really that low? The visual compute profiler suggest that the GPU
is used by my kernel for at least 50% of the time, while the rest is
memcpy. OK, running "glxgears" has the same effect, i.e. none, but its
still strange.
I think I'll be able to give you private access to my repo so far at
some point this week, the sooner the better. I saw your changes on the
main repo but didn't merge them yet, since they will require some manual
intervention.
Thanks for your answers so far,
greetings,
still no open repo from me, and for that I'm sorry. Right now I don't
feel like the repository is in a state to be released, there are a lot
of files and debug-statements I still have to get rid of. Furthermore,
the algorithms still are unoptimized, working only in encryption and
only with ECB. But I try to keep you updated with this mail.
An overview over what works:
- DES ECB encryption
- Blowfish ECB encryption
- Camellia-128 ECB encryption
- IDEA ECB encryption
- CAST5 ECB encryption
Each of these is faster than my CPU by at least a factor of 2, most of
the time even faster. It heavily depends on what values to trust. I do
tests using 'openssl speed' which are very fast, but also plain
file-encryption tests with a 100MB file, which probably contain some
overhead and don't yield results quite as impressive.
Most importantly: I have merged all your changes into my repo, so up
until now no work has been lost on either side. I would greatly
appreciate better commit messages, meaning that the first line should be
a short description (<= 50 chars), followed by an empty line and then a longer
description, wrapped to 80 chars. This is the git style for commit
messages. I rewrote your commit-messages and included some additional
info, like in r31, where you fixed bugs but also included AES
key-scheduling on the GPU, as far as I could tell.
On 11/01/11 14:58, Paolo Margara wrote:
> Hi Johannes, congratulations for the achievements! But you plan to
> support also CBC (at least for decrypt)?
Yes, I do, but that will be a while. I try to optimize each algorithm
_before_ I do decryption and CBC, since most of the code is copied
anyway, and doing CBC/decryption to early would mean that I have to
re-copy-paste everything if I change one kernel.
About the whole initialization:
> What you say is true but we cannot do otherwise, I would not worry for
> few wasted bytes.
Yesterday I wrote a unified init and finish-function for the engine.
What it does is alloc memory (buffer_size) on the device and host in
e_cuda.c and then it passes these pointers to the single encrypt-decrypt
wrapper functions for each of the algorithms. So, I can now test all the
algorithms without commenting anything in between ;)
About constant memory and pre-initialized variables: I am pretty sure
that constant memory (say in des_cuda.cu) only is copied when I invoke a
kernel from this file. I base this on the observation that my shared
memory uses only the amount specified in the corresponding .cu file when
I call a kernel like CAST5-encryption. The Visual Profiler says so.
> > By the way, why do you CPU-only modes for the engine? When I compare my
> > CUDA implementation with the CPU algorithms I simply omit the -engine
> > cudamrg, and the original OpenSSL implementation takes over, right? Why
> > the hassle with these preprocessor conditionals?
> What you say it's true but when I started to develop the engine code I
> have used that option for testing purpose, currently it's quite useless.
If gotten rid of all the CPU_ONLY statements for the algorithms I
implemented but I haven't touched AES yet. This is because I consider it
"your code" and because it will make it easier for me to merge your
changes in the future ;)
I really hope I'll be able to give you a glimpse of my repo in the next
few weeks. If the code I produce should ever be released/included in the
original project, we would have to think about licensing. To implement
the algorithms I used a lot of code-snippets (like preprocessor macros,
variable names, etc) from the original OpenSSL. Basically, I copied
everything I needed from OpenSSL and then made it run on CUDA. After
that I optimized it with the profiler. I don't think it will be a
problem, but it should be kept in mind for the future.
Looking forward to questions and comments,
regards, Jojo