[PATCH 0/7] Phase one of sparc crypto opcode support.

David Miller

unread,

Sep 19, 2012, 11:36:49 PM9/19/12

to

This is the first phase of changes to support the new cryptographic
opcodes found starting in the SPARC-T4 processor.

It first builds the infrastructure for feature presence detection,
then adds support for all of the hashing functions implemented in
current cpus (MD5, SHA1, SHA256, SHA512).

Here are some benchmarks on a SPARC T4-2 with these changes applied.

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 14423.71k 50416.70k 173663.49k 445940.05k 816587.27k
sha1 33231.78k 115492.48k 318273.91k 579320.83k 759701.50k
sha256 46641.41k 157805.85k 419859.54k 708643.16k 889514.67k
sha512 50184.57k 202770.99k 529172.57k 1023763.11k 1405414.06k

These numbers, with crypto-opcode-disabled numbers for comparison,
are duplicated in the relevant patch log messages.

I have cipher patches for AES, DES, and CAMELLIA as well but I would
like to refine them a bit before I make a formal submission. And once
I get those changes refined I will work on mongomery multiply,
montgomery square-root, etc. which these chips also support directly.

I've tested these changes on all of {static,shared}/linux{,64}-sparcv9

Oracle provided me with programmer's manuals that document these
instructions, and I've been promised that these would be made public
at some point in the not too distant future. But these instructions
are very straightforward, and when I post the AES changes later on
you will see that the AES instructions are virtually identical to
the AESNI stuff from Intel.

All of the patches are against mainline, but should be backportable
to 1.0.x without much difficulty.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List opens...@openssl.org
Automated List Manager majo...@openssl.org

Andy Polyakov

unread,

Sep 20, 2012, 5:23:03 AM9/20/12

to

There is no need to send me personal copy.

> This is the first phase of changes to support the new cryptographic
> opcodes found starting in the SPARC-T4 processor.

Cool.

> Oracle provided me with programmer's manuals that document these
> instructions, and I've been promised that these would be made public
> at some point in the not too distant future.

Could you ask your contact if they could provide second copy for
OpenSSL? You mentioned Montgomery BN. There will be intersections with
other platforms. I mean there is interest to provide alternative
framework for exponentiation that would benefit such cases and having
look at multiple platforms including T4 would help to choose better
strategy.

There will be more replies, but not right away.

David Miller

unread,

Sep 20, 2012, 3:25:30 PM9/20/12

to

From: Andy Polyakov <ap...@openssl.org>
Date: Thu, 20 Sep 2012 11:23:03 +0200

> There is no need to send me personal copy.

Ok, I was simply acknowledging the author of the code I was
touching :-)

> Could you ask your contact if they could provide second copy for
> OpenSSL?

I'll see what I can do, it took me more than a year of tireless
work and daily poking to get a copy for myself from people I've
been interacting with for a decade.

> You mentioned Montgomery BN. There will be intersections with
> other platforms. I mean there is interest to provide alternative
> framework for exponentiation that would benefit such cases and having
> look at multiple platforms including T4 would help to choose better
> strategy.

Here are how the instructions work.

The basic model is that there is a range of sizes supported by the
instruction, and all of the data is loaded into a combination of
the floating point registers and all of the register windows of
the cpu.

For exmaple, the montmul (Montgomery Multiply) instruction simply has
a 5-bit immediate field which indicates the size of the operands.
If it is set to N the operands are (N + 1) * 64-bits in size.

Nprime is stored in register %f60.

A[] values are stored in float and integer registers (integers go into
register window 5), in this order:

%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7
%o0, %o1, %o2, %o3, %o4, %o5, %f24, %f26
%f28, %f30, %f32, %f34, %f36, %f38, %f40, %f42
$f44, %f46, %f48, %f50, %f52, %f54, %f56, %f58

B[] values are stored in integer registers (3 register windows, 2 to 0):

%o0, %o1, %o2, %o3, %o4, %o5, (register window 2)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7 (register window 1)
%o0, %o1, %o2, %o3, %o4, %o5 (register window 1)
%l0, %l1, %l2, %l3, %l4, %l5, %l6, %l7 (register window 0)
%o0, %o1, %o2, %o3 (register window 0)

Similarly for the other inputs, you can see the pattern in use here.
The result is left in register window 5. If an internal ECC error
occurs on the register file during the operation, %fcc3 will be set to
unordered. This means there needs to be a limited retry loop over
this condition.

So basically the implementation starts at register window zero, loads
all the initial values of B[], does a 'save', loads the middle values
ot B[], does a 'save', leads the last values of B[].

Then it moves on the N[], which goes into register windows 2, 3, and
4.

Next comes A[], in floating point registers and register window 5.

And finally M[], in floating point registers and register window 6.

Nprime is loaded into %f60 and the montmul instruction is executed.

This instruction can essentially be used directly via the
bn_mul_mont() function signature in openssl(). I don't think
any special amends are necessary to facilitate the use of these
instructions.

The 'montsqr' (Montgomery Square) instruction uses the same scheme
and layout as 'montmul' for inputs and outputs.

Finally 'mpmul' (Multiple Precision Multiply) has a similar flavor
to montmul and montsqr, in that multiple register windows and the
float point registers are used to load the inputs all at once for
the operation.

Again, a 5-bit immedate field 'N' encodes the size of the operands,
as "(N + 1) * 64-bits".

The multiplier goes into a mixture of float regs and integer registers
in register window 6. The multiplicand goes into a mixture of float
regs and integer registers in register window 5, and the product goes
into integer registers in register windows 4, 3, 2, 1, and 0.

For example, to do a 2048 bit multiply given a pointer to the
multiplier in %g1, a pointer to the multiplicand in %g2, and
a pointer to the place to store the product in %g3 one would
go:

/* Register window 6 */
ldd [%g1 + 0x000], %f22
ldd [%g1 + 0x008], %f20
ldd [%g1 + 0x010], %f18
ldd [%g1 + 0x018], %f16
ldd [%g1 + 0x020], %f14
ldd [%g1 + 0x028], %f12
ldd [%g1 + 0x030], %f10
ldd [%g1 + 0x038], %f8
ldd [%g1 + 0x040], %f6
ldd [%g1 + 0x048], %f4
ldx [%g1 + 0x050], %i5
ldx [%g1 + 0x058], %i4
ldx [%g1 + 0x060], %i3
ldx [%g1 + 0x068], %i2
ldx [%g1 + 0x070], %i1
ldx [%g1 + 0x078], %i0
ldx [%g1 + 0x080], %l7
ldx [%g1 + 0x088], %l6
ldx [%g1 + 0x090], %l5
ldx [%g1 + 0x098], %l4
ldx [%g1 + 0x0a0], %l3
ldx [%g1 + 0x0a8], %l2
ldx [%g1 + 0x0b0], %l1
ldx [%g1 + 0x0b8], %l0
ldd [%g1 + 0x0c0], %f2
ldd [%g1 + 0x0c8], %f0
ldx [%g1 + 0x0d0], %o5
ldx [%g1 + 0x0d8], %o4
ldx [%g1 + 0x0e0], %o3
ldx [%g1 + 0x0e8], %o2
ldx [%g1 + 0x0f0], %o1
ldx [%g1 + 0x0f8], %g1

save

/* Register window 5 */
ldd [%g2 + 0x000], %f58
ldd [%g2 + 0x008], %f56
ldd [%g2 + 0x010], %f54
ldd [%g2 + 0x018], %f52
ldd [%g2 + 0x020], %f50
ldd [%g2 + 0x028], %f48
ldd [%g2 + 0x030], %f46
ldd [%g2 + 0x038], %f44
ldd [%g2 + 0x040], %f42
ldd [%g2 + 0x048], %f40
ldd [%g2 + 0x050], %f38
ldd [%g2 + 0x058], %f36
ldd [%g2 + 0x060], %f34
ldd [%g2 + 0x068], %f32
ldd [%g2 + 0x070], %f30
ldd [%g2 + 0x078], %f28
ldd [%g2 + 0x080], %f26
ldd [%g2 + 0x088], %f24
ldx [%g2 + 0x090], %o5
ldx [%g2 + 0x098], %o4
ldx [%g2 + 0x0a0], %o3
ldx [%g2 + 0x0a8], %o2
ldx [%g2 + 0x0b0], %o1
ldx [%g2 + 0x0b8], %o0
ldx [%g2 + 0x0c0], %l7
ldx [%g2 + 0x0c8], %l6
ldx [%g2 + 0x0d0], %l5
ldx [%g2 + 0x0d8], %l4
ldx [%g2 + 0x0e0], %l3
ldx [%g2 + 0x0e8], %l2
ldx [%g2 + 0x0f0], %l1
ldx [%g2 + 0x0f8], %l0

save
save
save
save
save

/* Register window 0 */
mpmul 0x1f

stx %l7, [%g3 + 0x000]
stx %l6, [%g3 + 0x008]
stx %l5, [%g3 + 0x010]
stx %l4, [%g3 + 0x018]
stx %l3, [%g3 + 0x020]
stx %l2, [%g3 + 0x028]
stx %l1, [%g3 + 0x030]
stx %l0, [%g3 + 0x038]

restore

/* Register window 1 */
stx %o5, [%g3 + 0x040]
stx %o4, [%g3 + 0x048]
stx %o3, [%g3 + 0x050]
stx %o2, [%g3 + 0x058]
stx %o1, [%g3 + 0x060]
stx %o0, [%g3 + 0x068]
stx %l7, [%g3 + 0x070]
stx %l6, [%g3 + 0x078]
stx %l5, [%g3 + 0x080]
stx %l4, [%g3 + 0x088]
stx %l3, [%g3 + 0x090]
stx %l2, [%g3 + 0x098]
stx %l1, [%g3 + 0x0a0]
stx %l0, [%g3 + 0x0a8]

restore

/* Register window 2 */
stx %o5, [%g3 + 0x0b0]
stx %o4, [%g3 + 0x0b8]
stx %o3, [%g3 + 0x0c0]
stx %o2, [%g3 + 0x0c8]
stx %o1, [%g3 + 0x0d0]
stx %o0, [%g3 + 0x0d8]
stx %l7, [%g3 + 0x0e0]
stx %l6, [%g3 + 0x0e8]
stx %l5, [%g3 + 0x0f0]
stx %l4, [%g3 + 0x0f8]
stx %l3, [%g3 + 0x100]
stx %l2, [%g3 + 0x108]
stx %l1, [%g3 + 0x110]
stx %l0, [%g3 + 0x118]

restore

/* Register window 3 */
stx %o5, [%g3 + 0x120]
stx %o4, [%g3 + 0x128]
stx %o3, [%g3 + 0x130]
stx %o2, [%g3 + 0x138]
stx %o1, [%g3 + 0x140]
stx %o0, [%g3 + 0x148]
stx %l7, [%g3 + 0x150]
stx %l6, [%g3 + 0x158]
stx %l5, [%g3 + 0x160]
stx %l4, [%g3 + 0x168]
stx %l3, [%g3 + 0x170]
stx %l2, [%g3 + 0x178]
stx %l1, [%g3 + 0x180]
stx %l0, [%g3 + 0x188]

restore

/* Register window 4 */
stx %o5, [%g3 + 0x190]
stx %o4, [%g3 + 0x198]
stx %o3, [%g3 + 0x1a0]
stx %o2, [%g3 + 0x1a8]
stx %o1, [%g3 + 0x1b0]
stx %o0, [%g3 + 0x1b8]
stx %l7, [%g3 + 0x1c0]
stx %l6, [%g3 + 0x1c8]
stx %l5, [%g3 + 0x1d0]
stx %l4, [%g3 + 0x1d8]
stx %l3, [%g3 + 0x1e0]
stx %l2, [%g3 + 0x1e8]
stx %l1, [%g3 + 0x1f0]
stx %l0, [%g3 + 0x1f8]

restore
restore

Of course, you might quickly ask what happens in 32-bit mode? If we
were to take a window save trap, it would clobber the upper 32-bits of
the 64-bit values we are loading into the register file.

You have to do a trick in this case by loading a cookie of some sort
(say, simply 0xffffffffffffffff) into one of the unused registers
in the initial register window. If, after the instruction executes,
the top 32-bits are zeroed out, you know that a window trap happened
and therefore you must retry.

This retry logic can be combined with the tests for ECC errors on
%fcc3.

Andy Polyakov

unread,

Sep 21, 2012, 5:36:16 AM9/21/12

to

>> You mentioned Montgomery BN.

>
> Here are how the instructions work.
>
> The basic model is that there is a range of sizes supported by the
> instruction, and all of the data is loaded into a combination of
> the floating point registers and all of the register windows of
> the cpu.

Ouch!

> ...
>
> save
>
> ...
>
> restore
> ...

>
> Of course, you might quickly ask what happens in 32-bit mode?

No, before thinking about 32-bit mode, I quickly ask what's with save-s
without arguments? I quickly ask what happens if context switch strikes
in the middle? save without argument means that %sp will be effectively
uninitialized and attempts to refer stack [during context switch or
asynchronous signal delivery] are either doomed or corrupt stack. So
save-s ought to allocate frames. But even then, [and in 64-bit mode], do
instructions in question ensure that register windows are loaded prior
execution? I mean consider context switch between a save and say
montmul. Kernel dumps all windows on stack and when execution resumes it
normally brings in only one top window and let's window trap bring in
remaining ones on demand. So that before instructions in question can
start actual processing, all windows has to be loaded. Presumably the
instructions can trigger window trap, then kernel would have to see that
it's one of the instructions that triggered it and act accordingly, i.e.
bring in all the windows. Does it work that way? Or do I get it
backwards? I assume that instructions in question are uninterruptible,
so that trap can be generated only prior calculation...

David Miller

unread,

Sep 21, 2012, 12:15:50 PM9/21/12

to

From: Andy Polyakov <ap...@openssl.org>
Date: Fri, 21 Sep 2012 11:36:16 +0200

> No, before thinking about 32-bit mode, I quickly ask what's with save-s
> without arguments?

Sorry, I just wrote that code as pseudo-code off the top of my
head without attending to all of the necessary details.

We would indeed need to allocate a minimal stack frame in each
save instruction.

It's just an oversight in my example code, that's all.

Andy Polyakov

unread,

Sep 22, 2012, 1:09:27 PM9/22/12

to

>> No, before thinking about 32-bit mode, I quickly ask what's with save-s
>> without arguments?
>
> Sorry, I just wrote that code as pseudo-code off the top of my
> head without attending to all of the necessary details.
>
> We would indeed need to allocate a minimal stack frame in each
> save instruction.
>
> It's just an oversight in my example code, that's all.

But the main question was about how context switch is handled between
save and say mulmont. I mean the part after "save-s ought to allocate
frames."

David Miller

unread,

Sep 22, 2012, 1:27:31 PM9/22/12

to

From: Andy Polyakov <ap...@openssl.org>
Date: Sat, 22 Sep 2012 19:09:27 +0200

>>> No, before thinking about 32-bit mode, I quickly ask what's with
>>> save-s
>>> without arguments?
>> Sorry, I just wrote that code as pseudo-code off the top of my
>> head without attending to all of the necessary details.
>> We would indeed need to allocate a minimal stack frame in each
>> save instruction.
>> It's just an oversight in my example code, that's all.
>
> But the main question was about how context switch is handled between
> save and say mulmont. I mean the part after "save-s ought to allocate
> frames."

I'm confused.

The cpu has 8 register windows.

This means that we can save down 7 times and fill all of the
registers in each window with the values we need.

At each save we allocate the minimal stack frame, at least
enough for the spill/fill trap handlers to save the register
window if needed.

The montmul instruction occurs in the deepest register window.

The cpu will force all 7 register windows to be restored, if
needed, if some spills have occurred due to context switches
or similar.

Andy Polyakov

unread,

Sep 22, 2012, 2:11:11 PM9/22/12

to

>> But the main question was about how context switch is handled between
>> save and say mulmont. I mean the part after "save-s ought to allocate
>> frames."
>
> I'm confused.
>
> The cpu has 8 register windows.
>
> This means that we can save down 7 times and fill all of the
> registers in each window with the values we need.
>
> At each save we allocate the minimal stack frame, at least
> enough for the spill/fill trap handlers to save the register
> window if needed.
>
> The montmul instruction occurs in the deepest register window.
>
> The cpu will force all 7 register windows to be restored, if
> needed, if some spills have occurred due to context switches
> or similar.

The question was if it's actually the case, i.e. that all the register
windows are *in fact* restored. And you say there are. Just wanted to
hear. I wondered about specific mechanism on how it's achieved (does the
montmul triggers window trap), but it's more of curiosity, i.e. the
question is optional and you don't have to answer.

David Miller

unread,

Sep 22, 2012, 2:24:23 PM9/22/12

to

From: Andy Polyakov <ap...@openssl.org>
Date: Sat, 22 Sep 2012 20:11:11 +0200

> I wondered about specific mechanism on how it's achieved (does
> the montmul triggers window trap),

Yes, this is exactly what the instruction does.

It issues fill traps until the CANRESTORE register is NWINDOWS-2.