LTC + ECC + VROOM == v1.12

tomst...@gmail.com

unread,

May 24, 2006, 11:36:00 PM5/24/06

to

I've updated LTC to sport a variable (compile time) fixed point ECC
point multiplier. It uses a LRU like scheme to cache recently used
base points (e.g. NIST point or your public key) and use FP wherever
possible. I should point out my ECC-224 scores tie with DJB for his
"non-portable" Pentium-M optimized timings :-) [sorry I had to ...]

The code has passed all local testing (nist + random + tv_gen against
older copies) This code will speed up keygen, encrypt, sign and
verify. Decrypt uses random base points so it won't be sped up.

I still have to test it more and make sure it can use more than just
one window width (currently set to 8).

There are improvements to be had, currently it has one mutex (which by
default evals to a NOP) around the ptmul function which means no
threading. Ideally it should only lock entries in the cache. I'm
working on that for v1.13.

A window of 8 means 256 points in the cache per base point (only
allocated if the entry in the cache is populated) which is by default
kinda wasteful. TFM by default uses 4096 bits per bignum [512 bytes]
so that's ~3x520x256 or about 400KB for the cache per entry. Decent
for a server or desktop. For smaller devices you probably want a 4-6
bit window. (25KB and 100KB respectively). Also tuing TFM to use
smaller bignums would save mem. Using 1280 bit bignums reduces the
memory to 135KB for the 8 bit window and 34KB for the 6 bit window.

If you're only doing [say] ECC-224 then you can further trim things.
512 bit bignums are all you need making a 8-bit window take 62KB and a
6 bit window 16KB.

I'm adding the ability to save/restore the state of the cache so you
can train it (e.g. embedded devices).

ECC + TFM from v1.12 of LTC on my Opteron [885] box:
ECC-192 make_key took 525065 cycles
ECC-224 make_key took 656215 cycles
ECC-256 make_key took 808368 cycles
ECC-384 make_key took 1596507 cycles
ECC-520 make_key took 2871832 cycles

from an earlier version:

LTC 1.11 + TFM
ECC-192 make_key took 1626431 cycles
ECC-224 make_key took 2035252 cycles
ECC-256 make_key took 2599250 cycles
ECC-384 make_key took 5383834 cycles
ECC-520 make_key took 9427837 cycles

LTC 1.11 + GMP
ECC-192 make_key took 2740705 cycles
ECC-224 make_key took 3838721 cycles
ECC-256 make_key took 4399707 cycles
ECC-384 make_key took 8704742 cycles
ECC-520 make_key took 17555395 cycles

Tom

tomst...@gmail.com

unread,

May 25, 2006, 8:29:32 PM5/25/06

to

tomst...@gmail.com wrote:
> A window of 8 means 256 points in the cache per base point (only
> allocated if the entry in the cache is populated) which is by default
> kinda wasteful. TFM by default uses 4096 bits per bignum [512 bytes]
> so that's ~3x520x256 or about 400KB for the cache per entry. Decent

> ECC + TFM from v1.12 of LTC on my Opteron [885] box:
> ECC-192 make_key took 525065 cycles
> ECC-224 make_key took 656215 cycles
> ECC-256 make_key took 808368 cycles
> ECC-384 make_key took 1596507 cycles
> ECC-520 make_key took 2871832 cycles

Using a 12 bit window I got

ECC-192 make_key took 391225 cycles
ECC-224 make_key took 494494 cycles
ECC-256 make_key took 611546 cycles
ECC-384 make_key took 1164086 cycles
ECC-520 make_key took 2065745 cycles

Which by default takes 3x520x4096 = 6240KB of memory per base point.
If I tuned it, for say ECC-224 I'd use 3x80x4096=960KB. Still usable
by desktops and some embedded devices (which can afford the temporary
use of memory).

At 2.6Ghz this is 5257 point muls/sec. Means you could get close to 5K
signatures/sec or take 0.2ms performing a signature :-)

Tom

Phil Carmody

unread,

May 26, 2006, 2:45:14 AM5/26/06

to

tomst...@gmail.com writes:

Nice work, Tom.
The memory usage does look a bit high, can you give a breakdown of
what that footprint covers?
Cheers,
Phil
--
The man who is always worrying about whether or not his soul would be
damned generally has a soul that isn't worth a damn.
-- Oliver Wendell Holmes, Sr. (1809-1894), American physician and writer

tomst...@gmail.com

unread,

May 26, 2006, 4:36:14 AM5/26/06

to

Phil Carmody wrote:
> Nice work, Tom.
> The memory usage does look a bit high, can you give a breakdown of
> what that footprint covers?

Sure.

The fixed point algorithm works by creating a pattern over the
multiplicand spread out by fixed number of bits. If your multiplicand
is k bits and you are using a w bit "window" think of it as splitting
the multiplicand into w, k/w bit words. Say we are computing T = rG.

So you split r into w words. The algorithm works by computing T = 2T +
pG where p is the pattern of k bits found by taking the h'th bit of
each of the w words (from h = k/w - 1 to 0).

E.g. if r = 1011 0111 0001 1100 in binary and k=16, w=4

You'd first compute T = pG where p = 0001 0000 0000 0001.

Next, double T [now it's equal to 0010 0000 0000 0010G] and add pG
where p is now 0000 0001 0000 0001.

Again, double T [now it's equal to 0100 0010 0000 0110G] and add pG
where p is now 0001 0001 0000 0000

Double T, [now it's equal to 1010 0110 0000 1100G] and add pG where p
is now 0001 0001 0001 0000. You now have T = 1011 0111 0001 1100G with
only 3 doubles and 3 adds.

The table basically has the pattern

000X 000Y 000Z 000W*G

For all 16 patterns of XYZW. So it has 16 entries [2^w entries]. Each
entry [in LTC] is a Jacobian Projective point so has three components
[x,y,z]. so that's 3*2^w for "size".

Each component is a bignum so you have a used_count + sign + contents
or 8+FP_SIZE/8. By default FP_SIZE is 4096 so that's 8+512 = 520
bytes. So that's 3*2^w*520.

FP_SIZE is the max size for a number in TFM [e.g. twice the size of
your max multiplication inputs]. So if we KNOW we are doing ECC-224 we
can set it to 512 bits [nice multiple of 64 and over twice as large].
That's 8+512/8 = 72. So that's 3*2^w*72.

So for ECC-224 we have

w=4 => 3456 bytes
w=6 => 13824 bytes
w=8 => 55296 bytes
w=12 => 884736 bytes

There is a bit more overhead (trivial amount <1KB) for pointers but
that's the jist of it. As for speed, w=6 is faster than the previous
method [haven't timed w=4], w=8 is definitely faster and w=12 is crazy
fast.

w=4 gets
ECC-192 make_key took 810932 cycles
ECC-224 make_key took 1020540 cycles
ECC-256 make_key took 1281894 cycles
ECC-384 make_key took 2626950 cycles
ECC-520 make_key took 4775715 cycles

w=6 gets
ECC-192 make_key took 623910 cycles
ECC-224 make_key took 793165 cycles
ECC-256 make_key took 968627 cycles
ECC-384 make_key took 1949808 cycles
ECC-520 make_key took 3517040 cycles

And we know what w=8 and w=12 get :-)

Even at w=4 we're still at least about 2x faster than previous LTC
releases and the memory usage is trivial at 4KB or so.

Tom

Phil Carmody

unread,

May 26, 2006, 6:17:12 AM5/26/06

to

tomst...@gmail.com writes:
> Phil Carmody wrote:
> > Nice work, Tom.
> > The memory usage does look a bit high, can you give a breakdown of
> > what that footprint covers?
>
> Sure.
>
> The fixed point algorithm works by creating a pattern over the
> multiplicand spread out by fixed number of bits. If your multiplicand
> is k bits and you are using a w bit "window" think of it as splitting
> the multiplicand into w, k/w bit words. Say we are computing T = rG.
>
> So you split r into w words. The algorithm works by computing T = 2T +
> pG where p is the pattern of k bits found by taking the h'th bit of
> each of the w words (from h = k/w - 1 to 0).

So this is more like Pippenger's algorithm, rather than standard
windowed exponentiation? Sounds good.

> E.g. if r = 1011 0111 0001 1100 in binary and k=16, w=4
>
> You'd first compute T = pG where p = 0001 0000 0000 0001.
>
> Next, double T [now it's equal to 0010 0000 0000 0010G] and add pG
> where p is now 0000 0001 0000 0001.
>
> Again, double T [now it's equal to 0100 0010 0000 0110G] and add pG
> where p is now 0001 0001 0000 0000
>
> Double T, [now it's equal to 1010 0110 0000 1100G] and add pG where p
> is now 0001 0001 0001 0000. You now have T = 1011 0111 0001 1100G with
> only 3 doubles and 3 adds.

OK, yup, that's perfectly sensible.

> The table basically has the pattern
>
> 000X 000Y 000Z 000W*G
>
> For all 16 patterns of XYZW. So it has 16 entries [2^w entries]. Each
> entry [in LTC] is a Jacobian Projective point so has three components
> [x,y,z]. so that's 3*2^w for "size".

Ack.

> Each component is a bignum so you have a used_count + sign + contents
> or 8+FP_SIZE/8. By default FP_SIZE is 4096 so that's 8+512 = 520
> bytes. So that's 3*2^w*520.

Ack.

> FP_SIZE is the max size for a number in TFM [e.g. twice the size of
> your max multiplication inputs]. So if we KNOW we are doing ECC-224 we
> can set it to 512 bits [nice multiple of 64 and over twice as large].
> That's 8+512/8 = 72. So that's 3*2^w*72.

Ack.

> So for ECC-224 we have
>
> w=4 => 3456 bytes
> w=6 => 13824 bytes
> w=8 => 55296 bytes
> w=12 => 884736 bytes
>
> There is a bit more overhead (trivial amount <1KB) for pointers but
> that's the jist of it. As for speed, w=6 is faster than the previous
> method [haven't timed w=4], w=8 is definitely faster and w=12 is crazy
> fast.

[SNIP times]

> Even at w=4 we're still at least about 2x faster than previous LTC
> releases and the memory usage is trivial at 4KB or so.

Does your implementation always
a) allocate memory for
b) calculate values for
all 2^w intermediates? You'd never use them all for large w.
And if you don't use them, you don't even need to allocate memory
for them if you're prepared to add an extra test and dereference.
I doubt it would change the speed much (tiny plusses and minuses
would probably cancel), but would cap the memory usage figures
to less extreme values at high w, perhaps?

I guess I should just download it and have a peek myself!

tomst...@gmail.com

unread,

May 27, 2006, 9:43:09 AM5/27/06

to

Phil Carmody wrote:
> > So you split r into w words. The algorithm works by computing T = 2T +
> > pG where p is the pattern of k bits found by taking the h'th bit of
> > each of the w words (from h = k/w - 1 to 0).
>
> So this is more like Pippenger's algorithm, rather than standard
> windowed exponentiation? Sounds good.

Perhaps. I heard it was called [in ECC world] "fixed point". Which
makes sense since you inserting bits of the multiplicand at "fixed
points" [e.g. spaced by k/w bits].

> Does your implementation always
> a) allocate memory for
> b) calculate values for
> all 2^w intermediates? You'd never use them all for large w.
> And if you don't use them, you don't even need to allocate memory
> for them if you're prepared to add an extra test and dereference.
> I doubt it would change the speed much (tiny plusses and minuses
> would probably cancel), but would cap the memory usage figures
> to less extreme values at high w, perhaps?

Problem is which values inside the table is upto which multiplicand you
are using. Computing them as required leaks a good deal of timing
information. You would have to pre-comp random entries so that the
attacker may know that pre-comp work occurred but not for which values.

Beyond tuning TFM there are other ways to save memory. The
co-ordinates could be mapped back to Affine space [z=1] and the
Jacobian routines would still work. That would give you 2^{w+1}
bignums [e.g. 40KB for ECC-224 @ w=8].

> I guess I should just download it and have a peek myself!

Hehehehe, LTC 1.12 is the first release to sport this and it's still a
couple weeks out. I actually have to cram on the LTM book when I get
back home on Monday so that delays it a bit.

Tom

Phil Carmody

unread,

May 27, 2006, 10:52:19 AM5/27/06

to

tomst...@gmail.com writes:
> > I guess I should just download it and have a peek myself!
>
> Hehehehe, LTC 1.12 is the first release to sport this and it's still a
> couple weeks out. I actually have to cram on the LTM book when I get
> back home on Monday so that delays it a bit.

In your own time - keep up the good work.

tomst...@gmail.com

unread,

May 27, 2006, 12:58:17 PM5/27/06

to

Phil Carmody wrote:
> tomst...@gmail.com writes:
> > > I guess I should just download it and have a peek myself!
> >
> > Hehehehe, LTC 1.12 is the first release to sport this and it's still a
> > couple weeks out. I actually have to cram on the LTM book when I get
> > back home on Monday so that delays it a bit.
>
> In your own time - keep up the good work.

Yeah I have some hacks in mind besides space savings.

The way I presented it here [and implemented it so far] the routine
produces the entire multiplicand at once. There is no reason why two
threads couldn't do halves of the multiplicand and add their products.
The tables stay the same just the upper/lower half of the bits
disappear [just goes to zero].

So the 391K cycles I got for ECC-192 could become something as low as
200K cycles on dual-core processors which would rock.

Tom

Phil Carmody

unread,

May 28, 2006, 6:03:15 AM5/28/06

to

I'd be surprised if its Amdahl coefficient was quite that high.
But I would be happy to be surprised!

tomst...@gmail.com

unread,

May 28, 2006, 11:53:32 AM5/28/06

to

Phil Carmody wrote:
> > So the 391K cycles I got for ECC-192 could become something as low as
> > 200K cycles on dual-core processors which would rock.
>
> I'd be surprised if its Amdahl coefficient was quite that high.
> But I would be happy to be surprised!

In the case of my dual-threaded RSA plugin for LTC the speedup was
close to linear. The trick is to have the threads spun up already.
That way you avoid the process creation.

As for runtime they're essentially two very isolated threads. Aside
from cache sharing issues they never talk until at the end were you
have to add up the results.

Tom

D. J. Bernstein

unread,

May 28, 2006, 2:13:00 PM5/28/06

to

Anyone who cares about speed will be reusing one Diffie-Hellman secret
key to compute many shared secrets. This means that one fixed-point
scalar multiplication is accompanied by a huge number of variable-point
scalar multiplications. That's why

http://cr.yp.to/ecdh/reports.html

lists only variable-point timings, not fixed-point timings.

Tom now reports 656215 Athlon-64 cycles for fixed-point scalar
multiplication on the NIST P-224 curve. That's a fairly serious step
backwards from the NIST P-224 code that I published five years ago,
which took 595537 Athlon cycles for variable-point scalar multiplication
on the same curve. I'm not quite sure what Tom means by ``tie'' here.

Meanwhile, the state of the art has moved on to better choices of
curves, specifically Curve25519, achieving considerably higher security
for the same speed. See http://cr.yp.to/ecdh.html for details.

---D. J. Bernstein, Professor, Mathematics, Statistics,
and Computer Science, University of Illinois at Chicago

tomst...@gmail.com

unread,

May 28, 2006, 2:23:37 PM5/28/06

to

D. J. Bernstein wrote:
> Anyone who cares about speed will be reusing one Diffie-Hellman secret
> key to compute many shared secrets. This means that one fixed-point
> scalar multiplication is accompanied by a huge number of variable-point
> scalar multiplications. That's why

Again wrong. Signatures are a big part of many embedded and server
applications. Things like remote sensors which need to authenticate
data could now use ECC instead of just a MAC with some shared key.
Servers running SSL [or SSL like protocols] could make use of this in
key exchange and setup.

Also, in some embedded cases you can make things asymmetric where you
do encryption on the low power device and decryption on the higher
[e.g. like a cell phone or video game console].

> Tom now reports 656215 Athlon-64 cycles for fixed-point scalar
> multiplication on the NIST P-224 curve. That's a fairly serious step
> backwards from the NIST P-224 code that I published five years ago,
> which took 595537 Athlon cycles for variable-point scalar multiplication
> on the same curve. I'm not quite sure what Tom means by ``tie'' here.

656K when using 62K of memory. If I bump that up to just under 1MB I
get 494K cycles. That's plenty reasonable for a large segment of
crypto users. Furthermore, at least for server markets I haven't even
threaded this yet. I'm reasonably confident I can hit around 265K
cycles for ECC-224 with a dual-threaded approach.

> Meanwhile, the state of the art has moved on to better choices of
> curves, specifically Curve25519, achieving considerably higher security
> for the same speed. See http://cr.yp.to/ecdh.html for details.

But it's not a standard. What's worse is it's not even portable. You
sacrificed usefulness to get speed. I'm merely trading off memory to
get speed. My fixed point code is entirely portable, agnostic even to
the bignum library you are using and can achieve these results. On my
Pentium M laptop for instance I saw nearly the same improvement ratio
even though I moved from 64-bit to 32-bit mode [and different TFM
configurations too].

Tom

Mike Scott

unread,

May 28, 2006, 4:23:51 PM5/28/06

to

tomst...@gmail.com wrote:

> D. J. Bernstein wrote:
>> Tom now reports 656215 Athlon-64 cycles for fixed-point scalar
>> multiplication on the NIST P-224 curve. That's a fairly serious step
>> backwards from the NIST P-224 code that I published five years ago,
>> which took 595537 Athlon cycles for variable-point scalar multiplication
>> on the same curve. I'm not quite sure what Tom means by ``tie'' here.
>
> 656K when using 62K of memory. If I bump that up to just under 1MB I
> get 494K cycles. That's plenty reasonable for a large segment of
> crypto users. Furthermore, at least for server markets I haven't even
> threaded this yet. I'm reasonably confident I can hit around 265K
> cycles for ECC-224 with a dual-threaded approach.
>

Wait a minute. Its clearly "not fair" (if not grossly misleading!) to
compare fixed-point with variable-point multiplication.

Sure signature will be faster using a fixed point algorithm. In fact I
can do it in virtually zero cycles. First I do an off-line point
multiplication, then I get to see the message to be signed, and then I
apply the signature almost immediately.

I am sure that people read sci.crypt to be enlightened, not to be
misled. I can understand that people might get enthusiastic about their
own work, which is fair enough, but lets compare like with like.

Mike

> Tom
>

tomst...@gmail.com

unread,

May 29, 2006, 12:28:07 AM5/29/06

to

Mike Scott wrote:
> Wait a minute. Its clearly "not fair" (if not grossly misleading!) to
> compare fixed-point with variable-point multiplication.

Well, he can post fixed point timings then. My comment was really
meant to be just off the cuff. I don't care to strike up another
useless conversation with DJB simply because they go nowhere fast and
result in less public good than harm.

It's fair for him to belittle my RSA endeavours because his
non-portable ECC code is "faster" but I can't post an obvious
rhetorical comment about my ECC code?

> Sure signature will be faster using a fixed point algorithm. In fact I
> can do it in virtually zero cycles. First I do an off-line point
> multiplication, then I get to see the message to be signed, and then I
> apply the signature almost immediately.

That's a clearly contrived argument that has little merit. There are
many cases where you are applying signatures on rabid basis or don't
want to have pre-computed parameters lying around. So the time for
encrypt, sign and verify can and do matter.

And still, you have to compute [or precompute] the values at some
point. So in the grand scheme of all things you still have a very moot
point.

> I am sure that people read sci.crypt to be enlightened, not to be
> misled. I can understand that people might get enthusiastic about their
> own work, which is fair enough, but lets compare like with like.

This is obviously a pointless conversation. Clearly, DJB is just a god
so he must be right. In his world, all food are apples I guess.

Tom

tomst...@gmail.com

unread,

Jun 9, 2006, 6:27:52 PM6/9/06

to

tomst...@gmail.com wrote:
> tomst...@gmail.com wrote:
> > A window of 8 means 256 points in the cache per base point (only
> > allocated if the entry in the cache is populated) which is by default
> > kinda wasteful. TFM by default uses 4096 bits per bignum [512 bytes]
> > so that's ~3x520x256 or about 400KB for the cache per entry. Decent
> > ECC + TFM from v1.12 of LTC on my Opteron [885] box:
> > ECC-192 make_key took 525065 cycles
> > ECC-224 make_key took 656215 cycles
> > ECC-256 make_key took 808368 cycles
> > ECC-384 make_key took 1596507 cycles
> > ECC-520 make_key took 2871832 cycles
>
> Using a 12 bit window I got
>
> ECC-192 make_key took 391225 cycles
> ECC-224 make_key took 494494 cycles
> ECC-256 make_key took 611546 cycles
> ECC-384 make_key took 1164086 cycles
> ECC-520 make_key took 2065745 cycles

I've optimized it again by eating some cycles during the pre-comp
stage, I converted all the points back to affine and allow z==NULL in
the point addition code (for the second operand only).

With TFM 0.09 and the current LTC and a window of 8 bits I got

ECC-192 make_key took 374768 cycles
ECC-224 make_key took 461397 cycles
ECC-256 make_key took 572110 cycles
ECC-384 make_key took 1124445 cycles
ECC-520 make_key took 1982343 cycles

And with a window of 12 bits I got

ECC-192 make_key took 297831 cycles
ECC-224 make_key took 369044 cycles
ECC-256 make_key took 452169 cycles
ECC-384 make_key took 858204 cycles
ECC-520 make_key took 1491073 cycles

For some reason the encrypt function is a lot slower than I would
expect. It should be point muls of the NIST base point and their
public point. Both should be cacheable but it looks like the latter is
being uncached. It could be a bug in the caching logic or the demo.
I'll resolve it before v1.13...

I've checked the code and it looks like this is indeed the order of
events...

Anyone willing to chip in and review it would be thanked :-) I can
provide a snapshot of the 1.13 release cycle to anyone willing.

Tom

Salami Man

unread,

Jun 12, 2006, 9:53:47 AM6/12/06

to

Got Vroom v1.12? Vroom Vroom?
Zoom zoom!