Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Montgomery multiplication on nVidia GPUs
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  9 messages - Expand all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
"Eben"  
View profile  
 More options Aug 27 2007, 6:46 pm
Newsgroups: mailing.openssl.dev
From: e...@podfun.com ("Eben")
Date: Tue, 28 Aug 2007 06:46:03 +0800 (CST)
Local: Mon, Aug 27 2007 6:46 pm
Subject: Montgomery multiplication on nVidia GPUs
I've put together some code to do parallel 512-bit montgomery
multiplication for nVidia GPUs. On my 8800GTX I get about 12k of these pe=
r
second in 2k batches, so enough to do 6k 1024-bit RSA private decrypts. I
guess this Christmas's generation of cards (nV9) should push past the 10k
mark. On this basis it looks like GPUs make pretty competitive cheap
crypto accelerators :)

I don't have the time or expertise to integrate this as a patch. Is there
anyone who would be interested in taking this on? A significant challenge
would be to provide a sensible interface for batch processing of requests=
;
the current interface doesn't look well suited to this task.

Cheers
Eben Upton
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eric Fritzges  
View profile  
 More options Aug 27 2007, 8:43 pm
Newsgroups: mailing.openssl.dev
From: efrit...@cisco.com (Eric Fritzges)
Date: Tue, 28 Aug 2007 08:43:39 +0800 (CST)
Local: Mon, Aug 27 2007 8:43 pm
Subject: Re: Montgomery multiplication on nVidia GPUs
Eben,
  I don't have nearly the resources to take it over, but I'd be very
curious to hear more about it!

Thanks,

-Eric Fritzges
efrit...@cisco.com

On 8/27/2007 5:55 PM Eben spoke thusly:

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andy Polyakov  
View profile  
 More options Aug 28 2007, 11:06 am
Newsgroups: mailing.openssl.dev
From: ap...@fy.chalmers.se (Andy Polyakov)
Date: Tue, 28 Aug 2007 23:06:24 +0800 (CST)
Local: Tues, Aug 28 2007 11:06 am
Subject: Re: Montgomery multiplication on nVidia GPUs

> I've put together some code to do parallel 512-bit montgomery
> multiplication for nVidia GPUs. On my 8800GTX I get about 12k of these per
> second in 2k batches,

What does "12k in 2k" mean? I mean given that 512 bits is 64 bytes does
2k mean that you process 32 vectors at once and 12k means that is takes
1/(12000/32) seconds to perform 32 private key operations?

> so enough to do 6k 1024-bit RSA private decrypts.

Why do you think that it will be 6k? Complexity is n^2 so it should be
12/4, not 12/2.

> On this basis it looks like GPUs make pretty competitive cheap
> crypto accelerators :)

> I don't have the time or expertise to integrate this as a patch. Is there
> anyone who would be interested in taking this on? A significant challenge
> would be to provide a sensible interface for batch processing of requests;
> the current interface doesn't look well suited to this task.

Trouble with crypto, especially with asymmetric, is that there is no
guaranteed data flow. I mean there is no guarantee that there will be 32
operations every 32/12 milliseconds and given lower average request rate
batching 32 requests will incur longer average service times you might
not be willing to tolerate: when a request is posed it's expected to be
served instantly, not wait till 31 extra ones are posed. So you'd have
to be adaptive, i.e. grab as many requests as currently available and
process sometimes 1, sometimes 5, 2, 15, 9, etc. operations at once. Now
what if average load is less than 12000/32 requests per second? Then
you'd go with single request at a time practically all the time. But
single operation performance is *far* from impressive. So you'd have to
play even smarter, grab currently outstanding requests and decide if it
would pay off to pack them together and send to GPU or just process them
on CPU. Such dispatching per se implies certain requirements on
application design (as it would be hardly appropriate to delegate this
role to say OpenSSL engine), so it's likely to be impossible to "just
use GPU with arbitrary application."

Please note that my remark does not mean that I'm condemning the idea.
It only means that this class of problems is more complicated than
commonly presented. General purpose computing on graphic processor is
compelling idea, but normally you'd have to have guaranteed data flow
(say XGB of seismic data to process in shortest amount of minutes) to
justify the effort. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
"Eben"  
View profile  
 More options Aug 28 2007, 6:53 pm
Newsgroups: mailing.openssl.dev
From: e...@podfun.com ("Eben")
Date: Wed, 29 Aug 2007 06:53:50 +0800 (CST)
Local: Tues, Aug 28 2007 6:53 pm
Subject: Re: Montgomery multiplication on nVidia GPUs

>> I've put together some code to do parallel 512-bit montgomery
>> multiplication for nVidia GPUs. On my 8800GTX I get about 12k of these
>> per second in 2k batches,
>What does "12k in 2k" mean? I mean given that 512 bits is 64 bytes does =
2k
>mean that you process 32 vectors at once and 12k means that is takes
>1/(12000/32) seconds to perform 32 private key operations?

Worse than that. I perform montgomery exponentiation (not multiplication =
-
my bad) in batches of 2048 operations at a time, with each batch taking
approximately 160ms. That gives an average latency of 240ms for a fully
loaded system which processes one batch whilst building another.

>> so enough to do 6k 1024-bit RSA private decrypts.
>Why do you think that it will be 6k? Complexity is n2 so it should be

12/4, >not 12/2.

If you have the 512-bit factors p and q of the 1024-bit private key, the
majority of the time in the openssl implementation seems to go into a pai=
r
of 512-bit mont exp operations for each private decrypt, followed by a bi=
t
of CRT to bang them together. Obviously if you don't have p and q you nee=
d
a single 1024-bit exponentiation, which you would expect to be roughly
twice as costly, and may run into memory limitations on the GPU. I haven'=
t
tackled that case.

> On this basis it looks like GPUs make pretty competitive cheap
> crypto accelerators :)

> I don't have the time or expertise to integrate this as a patch. Is the=
re
> anyone who would be interested in taking this on? A significant challen=
ge
> would be to provide a sensible interface for batch processing of reques=
ts;
> the current interface doesn't look well suited to this task.
> Trouble with crypto, especially with asymmetric, is that there is no

guaranteed data flow. I mean there is no guarantee that there will be 32
operations every 32/12 milliseconds and given lower average request rate
batching 32 requests will incur longer average service times you might
not be willing to tolerate: when a request is posed it's expected to be
served instantly, not wait till 31 extra ones are posed. So you'd have
to be adaptive, i.e. grab as many requests as currently available and
process sometimes 1, sometimes 5, 2, 15, 9, etc. operations at once. Now
what if average load is less than 12000/32 requests per second? Then
you'd go with single request at a time practically all the time. But
single operation performance is *far* from impressive. So you'd have to
play even smarter, grab currently outstanding requests and decide if it
would pay off to pack them together and send to GPU or just process them
on CPU. Such dispatching per se implies certain requirements on
application design (as it would be hardly appropriate to delegate this
role to say OpenSSL engine), so it's likely to be impossible to "just
use GPU with arbitrary application."

I agree with most of that. However, based on benchmarks on my desktop (a
Core 2 Duo E6400) the 32-bit x86 assembler mont exp implementation in
OpenSSL seems a _lot_ slower than my GPU. There are plenty of application=
s
(web serving for example) which can tolerate 240ms of latency in return
for a fivefold increase in throughput.

I suppose my question is whether anyone has considered providing an
alternative, asynchronous, interface to the OpenSSL crypto libraries. The
current API is not a suitable abstraction for a crypto device which
exhibits latency comparable to or greater than its execution time, unless
you're prepared to fire off a lot of threads in the client. Either batche=
d
submission or callbacks would address this.

Cheers
Eben
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andy Polyakov  
View profile  
 More options Aug 29 2007, 3:26 am
Newsgroups: mailing.openssl.dev
From: ap...@fy.chalmers.se (Andy Polyakov)
Date: Wed, 29 Aug 2007 15:26:46 +0800 (CST)
Local: Wed, Aug 29 2007 3:26 am
Subject: Re: Montgomery multiplication on nVidia GPUs

> I agree with most of that. However, based on benchmarks on my desktop (a
> Core 2 Duo E6400) the 32-bit x86 assembler mont exp implementation in
> OpenSSL seems a _lot_ slower than my GPU.

Of course your CPU is a lot slower to perform 2048 signs, but it's a lot
faster to perform one. I mean if you simply don't get more than 1 sign
request within 240ms and if you insist on always using GPU, you'd have
to ask it to perform 1 real and 2047 bogus signs. And so you'll have GPU
spending 240ms on one sign instead of having CPU spending portion of
millisecond. Resulting in ... lower throughput.

> I suppose my question is whether anyone has considered providing an
> alternative, asynchronous, interface to the OpenSSL crypto libraries. The
> current API is not a suitable abstraction for a crypto device which
> exhibits latency comparable to or greater than its execution time, unless
> you're prepared to fire off a lot of threads in the client. Either batched
> submission or callbacks would address this.

I understand the question and once again I don't mean to discourage
anybody from looking for the answer. I'm just saying that solution is
likely to be more complex than anticipated and that outcome might be
limited to specific kind of applications (e.g. only multi-threaded ones,
as opposite to "fork-on-accept"). A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
"Eben"  
View profile  
 More options Aug 29 2007, 4:05 am
Newsgroups: mailing.openssl.dev
From: e...@podfun.com ("Eben")
Date: Wed, 29 Aug 2007 16:05:41 +0800 (CST)
Local: Wed, Aug 29 2007 4:05 am
Subject: Re: Montgomery multiplication on nVidia GPUs
> Of course your CPU is a lot slower to perform 2048 signs, but it's a lo=

t
faster to perform one. I mean if you simply don't get more than 1 sign
request within 240ms and if you insist on always using GPU, you'd have
to ask it to perform 1 real and 2047 bogus signs. And so you'll have GPU
spending 240ms on one sign instead of having CPU spending portion of
millisecond. Resulting in ... lower throughput.

Absolutely, although I'm not convinced you actually get lower throughput,
just lousy latency; remember, requests can pile up during your 240ms, so
the next batch is fuller. As you say, a good solution would transition
from CPU to GPU at (or near) the point of CPU exhaustion; this sounds
fairly simple to implement given an asynchronous API.

The obvious application is web serving, which I assume (perhaps naively)
accounts for a substantial majority of private decrypts performed using
openssl, is fairly tolerant of latency, and potentially very hungry for
throughput. People buy SSL accelerator cards for a reason, right?

Cheers
Eben
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Rapier  
View profile  
 More options Aug 29 2007, 10:12 am
Newsgroups: mailing.openssl.dev
From: rap...@psc.edu (Chris Rapier)
Date: Wed, 29 Aug 2007 22:12:50 +0800 (CST)
Local: Wed, Aug 29 2007 10:12 am
Subject: Re: Montgomery multiplication on nVidia GPUs

> The obvious application is web serving, which I assume (perhaps naively)
> accounts for a substantial majority of private decrypts performed using
> openssl

While this data isn't applicable to the internet as a whole it may
provide you with some insight into this.

http://netflow.internet2.edu/weekly/20070820/

This is netflow data for the I2 networks (used by a large number of
universities, research institutions, government and state agencies and
so forth). The encrypted traffic breakdown is probably the most
interesting to you. I must warn you that the application identification
is relatively naive - it assumes traffic on well known ports is what
you'd expect it to be. This doesn't always match up - I can say with
some confidence that the 2nd fastest non-measurement data flow listed
(400Mb/s between Switch in Bern, CH and PSC in Pittsburgh, PA) is an ssh
flow.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andy Polyakov  
View profile  
 More options Aug 29 2007, 10:59 am
Newsgroups: mailing.openssl.dev
From: ap...@fy.chalmers.se (Andy Polyakov)
Date: Wed, 29 Aug 2007 22:59:23 +0800 (CST)
Local: Wed, Aug 29 2007 10:59 am
Subject: Re: Montgomery multiplication on nVidia GPUs

>> The obvious application is web serving, which I assume (perhaps naively)
>> accounts for a substantial majority of private decrypts performed using
>> openssl

> While this data isn't applicable to the internet as a whole it may
> provide you with some insight into this.

> http://netflow.internet2.edu/weekly/20070820/

Given this context I have to admit that I have effectively misused
"throughput" term in my posts. I should have written "amount of private
key operation requests per time unit" and speculate about its effect on
overall service time becoming unnecessary high at low request rate.

Moving back to the original context the relevant data would be amount of
connections being established and *among them* amount of connections
triggering private key operations, not summed up TCP throughput or even
  packet size distribution. As for "amount of connections triggering
private key operations." In SSH context every connection established
triggers one, but connection rate is rather low as connections are
commonly held for relatively long times. In HTTPS context connection
rate can be high, but not every connection triggers private key
operation thanks to session context being reused for connections to same
server. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Rapier  
View profile  
 More options Aug 29 2007, 11:16 am
Newsgroups: mailing.openssl.dev
From: rap...@psc.edu (Chris Rapier)
Date: Wed, 29 Aug 2007 23:16:57 +0800 (CST)
Local: Wed, Aug 29 2007 11:16 am
Subject: Re: Montgomery multiplication on nVidia GPUs

Andy Polyakov wrote:
>> http://netflow.internet2.edu/weekly/20070820/

> Given this context I have to admit that I have effectively misused
> "throughput" term in my posts. I should have written "amount of private
> key operation requests per time unit" and speculate about its effect on
> overall service time becoming unnecessary high at low request rate.

Oh I understood the context you meant. Basically I was just trying to
show, for at least this segment of the network, how much traffic was
being handled by which application. Obviously, as you point out, this is
an imperfect way to estimate things. It wasn't really meant to be
anything more than a data point as I'm not involved enough to weigh in
on one side or another on the issue.

> Moving back to the original context the relevant data would be amount of
> connections being established and *among them* amount of connections
> triggering private key operations, not summed up TCP throughput or even
>  packet size distribution.

The data on the number of connections being made is likely being
captured as well. I could ask to make it available.

> private key operations." In SSH context every connection established
> triggers one, but connection rate is rather low as connections are
> commonly held for relatively long times.   In HTTPS context connection
> rate can be high, but not every connection triggers private key
> operation thanks to session context being reused for connections to same
> server. A.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-...@openssl.org
Automated List Manager                           majord...@openssl.org

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google