Open SSL and CUDA

Miele Andrea

unread,

Nov 10, 2012, 5:38:43 AM11/10/12

to

Dear all,
I am a PhD student at EPFL Lausanne and I implemented, some time ago, RSA 1024/2048 decryption on NVIDIA GPUs.
My software achieved a quite high throughput when decryption involves a single private key or a few.
The latency is not very low unfortunately.
I would like to integrate my code in open SSL to allow GPU acceleration of RSA decryption.
The problem is that to benefit from that, it should be possible to batch decryptions.
Provided that it is realistic to assume that real SSL based applications may have thousands of handshake requests at once (could you shed some light on this?), would it be hard to allow batch decryption in SSL?.
I am working on reducing the latency of my code make worth offloading on the GPU just a few decryptions, but even if I succeed in that I would need some batching facility...

Cheers,

Andrea

Andy Polyakov

unread,

Nov 10, 2012, 4:19:56 PM11/10/12

to

> I am a PhD student at EPFL Lausanne and I implemented, some time ago,
> RSA 1024/2048 decryption on NVIDIA GPUs.
> My software achieved a quite high throughput when decryption involves a
> single private key or a few.
> The latency is not very low unfortunately.
> I would like to integrate my code in open SSL to allow GPU acceleration
> of RSA decryption.
> The problem is that to benefit from that, it should be possible to batch
> decryptions.

To minimize confusion it's probably more appropriate to refer to
operation as "private key operation" or "sign" rather than "decryption".

> Provided that it is realistic to assume that real SSL based applications
> may have thousands of handshake requests at once (could you shed some
> light on this?), would it be hard to allow batch decryption in SSL?.
> I am working on reducing the latency of my code make worth offloading on
> the GPU just a few decryptions, but even if I succeed in that I would
> need some batching facility...

Do see discussion at http://marc.info/?t=118825449500017&r=1&w=2.
Personally I'm skeptical that it's feasible in general SSL case such as
web server in sense that it would be hard [if even possible] to justify
the effort and additional complexity. It probably would be more
appropriate to target specific cases. DNSSEC comes to mind...
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List opens...@openssl.org
Automated List Manager majo...@openssl.org

andm...@gmail.com

unread,

Nov 12, 2012, 10:49:16 AM11/12/12

to

Thanks a lot for your Andy.
I actually tried to come up with a proof of concept multi-threaded implementation.
The GPU is not used if the system load (measured through getloadvg) is below a certain threshold.
Otherwise each thread puts its message on which the private key operation has to be performed in a buffer and if the buffer is full run the batch on the GPU.
If the buffer is not full it sleeps for some time.
The first thread to wake up runs the batch on the GPU even if the buffer is not full.
Then it wakes up the other threads.
What do you think about this approach?
Can you give more insight about developing the idea for DNSSEC?
How can I go about that?

Miele Andrea

unread,

Nov 14, 2012, 5:21:53 AM11/14/12

to

Thanks a lot for your reply, Andy.
Some time ago I came up with a proof of concept multi-threaded implementation.
The GPU is not used if the system load (measured through getloadvg under linux) is below a certain threshold.
Otherwise each thread puts its message (on which the private key operation has to be performed) into a shared buffer.
If the buffer is full after inserting the message, the current thread runs the private key operation batch on the GPU.

If the buffer is not full it sleeps for some time.
The first thread to wake up runs the batch on the GPU even if the buffer is not full.

There thread running the batch wakes up the others afterwards.

Andy Polyakov

unread,

Nov 15, 2012, 11:40:58 AM11/15/12

to

As I wrote in 2007 "I don't mean to discourage anybody from looking for
answer." It implies that I don't actually provide answers either. My
objective is discussion, not debunking or anything like that. I merely
attempt to provide additional perspective on the problem.

> Some time ago I came up with a proof of concept multi-threaded
> implementation.
> The GPU is not used if the system load (measured through getloadvg under
> linux) is below a certain threshold.
> Otherwise each thread puts its message (on which the private key
> operation has to be performed) into a shared buffer.
> If the buffer is full after inserting the message, the current thread
> runs the private key operation batch on the GPU.
> If the buffer is not full it sleeps for some time.
> The first thread to wake up runs the batch on the GPU even if the buffer
> is not full.
> There thread running the batch wakes up the others afterwards.
> What do you think about this approach?

Problem here is that you're likely to end up in time domain that
contradict users' expectations. Well, you didn't put a number on your
latency but in previous thread ~200ms was mentioned for [up to] 2048
operations. Note that for CPU it takes ~1ms to perform one operation. As
you have no way of knowing whether or not you'll have 200 additional
requests (per core) within next millisecond you probably would like to
go for 1ms operation. Because you don't want to sleep for a say 100ms
just to figure out that amount of requests is not high enough and you'd
do better running on CPU. I mean in your example let's say the thread
that woke up found that it's still the only one. Should it go for GPU
and let user wait additional 200ms or perform operation in 1ms? But of
course, as you mentioned this doesn't account for load. But even then
you probably would bet on GPU only when the load is ... at least 200
(per core). But then question is if users percept system under such load
as usable? You don't spend most of the time on encryption, you spend it
generating content, and load of 200 would mean that particular user
would experience it as 200 times slower than "normally." Well, normal
load might be 50...

> Can you give more insight about developing the idea for DNSSEC?
> How can I go about that?

Once again, I'm not claiming that I possess answers. DNSSEC is simply
the case when you know number of operations to be performed in
*advance*. I.e. unlike SSL you don't have to guess whether or not there
are 200 additional requests coming in next moment.