hmac hardware acceleration...

Chris M. Thomasson

unread,

Dec 12, 2023, 12:21:58 AM12/12/23

to

Humm... I am wondering if hardware based HMAC could possibly help out
one of my encryption experiments, for fun... A hyper crude little write
up, has some crude Python 3 code in there. It's not all that fast, yikes!

http://funwithfractals.atspace.cc/ct_cipher

Online version of it:

Online experiment:

http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1

First of all never use this cipher simply because it has not been
properly peer reviewed yet! If interested, experiment with it, never use
it until it has been deemed worth to protect a pet's life, your Mom's
life, your own life, ect.

MitchAlsup

unread,

Dec 14, 2023, 3:11:17 PM12/14/23

to

When I looked into this a while back, I came to the conclusion that
incorporating something like SHA256, SHA512, DES, AES, ... encryption
stuff suits an attached processor a lot better than putting it in
ISA directly.

Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any
CPU register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.

Terje Mathisen

unread,

Dec 15, 2023, 4:19:17 AM12/15/23

to

I disagree, specifically because these algorithms are used a lot on
short inputs: For a bulk process an attached coprocessor is an excellent
idea, but when you just want to verify the hash of a very short message,
or encrypt a single packet, you do want this to be very close to the cpu.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Dec 15, 2023, 10:22:52 AM12/15/23

to

An issue I see is in thread switching. You don't want a user process
to be able to block the OS thread switching for an arbitrary time
while it syncs with this coprocessor.

It needs a coprocessor which is both fully asynchronous for
bulk jobs from multiple processes and threads in the background,
high priority communication packets from drivers,
and like the x87 available as a semi-asynchronous resource to the
current thread on zero notice but for limited size jobs.

Or make the coprocessor jobs interruptible.

Or maybe like a barrel processor where the OS can allocate
as many tasks as it wants, and assign one to each user thread plus
some to itself for high priority comms and low priority background.

Scott Lurndal

unread,

Dec 15, 2023, 3:21:10 PM12/15/23

to

Our coprocessors are 'virtualized', such that they provide
a physical function and a number of virtual functions; a
virtual function can be assigned (mapped into the
address space directly) to a process and
it can directly access the coprocessor from user mode.

There are no worries about the host scheduling threads
in the process - the process owns the virtual function.

(see PCI express single-root I/O virtualization (SR-IOV) which
is the model used for standard OS compatability).

This model is used in DPDK and ODP, for example.

Thomas Koenig

unread,

Dec 16, 2023, 7:18:41 AM12/16/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> MitchAlsup wrote:

>> When I looked into this a while back, I came to the conclusion that
>> incorporating something like SHA256, SHA512, DES, AES, ... encryption
>> stuff suits an attached processor a lot better than putting it in ISA
>> directly.

That is the solution that IBM Z is using.

>> Why: It is fundamentally difficult to chop up the units of work
>> to fit in GPRs, and if you run the data through the GPRs (or any CPU
>> register) you open up holes in your security blanket that
>> are never open in the attached processor implementation. Perf
>> will be better in the attached processor version unless the
>> width of the en/decryption is small.
>
> I disagree, specifically because these algorithms are used a lot on
> short inputs: For a bulk process an attached coprocessor is an excellent
> idea, but when you just want to verify the hash of a very short message,
> or encrypt a single packet, you do want this to be very close to the cpu.

And that is what Power does with its vcipher and vcipherlast
instructions, which do a single round of AES.

POWER9 has six cycles of latency and at most operation per cycle,
Power10 between four and seven cycles, but four in parallel (I
guess they invested some of their silicon there).

AES operates on blocks of 128 bits, so 128-bit registers are quite
natural there. For My 66000, this would require either register
pairs or a variant of Carry, so thi is probably not an easy fit.

EricP

unread,

Dec 16, 2023, 11:30:01 AM12/16/23

to

Unfortunately the PCIe specs are all paywalled so I can't get the
real poop on it. Linux doesn't seem to have any documentation on it.
Microsoft only has the Windows driver development guides which
I've had a look at.

Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for
optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".

I also could not find out how Windows delivers virtual interrupts
signaling IO completion for SR-IOV devices. Assuming it would use
something call an APC's, similar to a *nix signal, that would be
an expensive way to be notified of coprocessor completion.
Again as SR-IOV was intended for IO virtualization so in that
context that overhead is reasonable.

Otherwise one would have to use SR-IOV polling in a spin loop to
detect completion, whereas a coprocessor like the x87 has the FWAIT
instruction to halt the processor until completion.

MitchAlsup

unread,

Dec 16, 2023, 4:47:15 PM12/16/23

to

Each Guest OS gets one VF.

> I also could not find out how Windows delivers virtual interrupts
> signaling IO completion for SR-IOV devices. Assuming it would use
> something call an APC's, similar to a *nix signal, that would be
> an expensive way to be notified of coprocessor completion.

A sufficiently privileged interrupt dispatching thread receives control.
It examines the pending interrupts and dispatches the interrupt handler.
The interrupt handler then services the interrupt and DPCs/softIRQs cleanup
activities.
A stack of PDCs/softIRQs wander through the cleanup work and finally
schedule the user thread (synch) or send user thread a signal (asynch)
Scheduler receives control and sooner or later delivers control back
to user.

EricP

unread,

Dec 17, 2023, 12:22:42 PM12/17/23

to

Right, that is the intended purpose for network cards in virtual machines
although the SR-IOV specs are generalized.

Then there is also was this movement that wants to do "zero-copy" network
IO directly from user space IO buffers, which is a VF per process opening
the device.

Then the two combine and it becomes a VF per guest processes opening the
device per guest OS and that fixed device quota of 16,32,64 VF's starts
looking a little sparse.

In either of these cases the IO device has to be opened so it would be ok
to return a status "denied, device not available" as that is already a
possible IO open status.

A coprocessor is intended to be implicitly immediately available,
under OS control, to the current processor context, be it threads, OS
or drivers. That implies huge quota of VF's for all threads plus
sundry other uses on all guest OS just in case they want one.
And, just guessing at the device internals, implies huge management tables,
CAMs instead of SRAMs, caches, blah, blah, etc.

>> I also could not find out how Windows delivers virtual interrupts
>> signaling IO completion for SR-IOV devices. Assuming it would use
>> something call an APC's, similar to a *nix signal, that would be
>> an expensive way to be notified of coprocessor completion.
>
> A sufficiently privileged interrupt dispatching thread receives control.
> It examines the pending interrupts and dispatches the interrupt handler.
> The interrupt handler then services the interrupt and DPCs/softIRQs cleanup
> activities.
> A stack of PDCs/softIRQs wander through the cleanup work and finally
> schedule the user thread (synch) or send user thread a signal (asynch)
> Scheduler receives control and sooner or later delivers control back
> to user.

I'm familiar with the OS mechanisms, its the overhead I'm pointing out.
To do this the hypervisor has to dispatch a virtual interrupt to the
guest OS, which converts it to its local delivery mechanism,
on Windows DPC->UAPC, on *nix to softIrq->signal,
and delivers it to the guest thread on the guest OS.

The overhead of the async completion signal would likely be much greater
that the cost of the original coprocessor hash/encrypt.

MitchAlsup

unread,

Dec 17, 2023, 1:01:04 PM12/17/23

to

EricP wrote:

> MitchAlsup wrote:

>> EricP wrote:
>>
>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>> from the limited info I have seen, using a VF does seem to be limited
>>> because the SR-IOV device only export a fixed number of VF's,
>>> (eg 16, 32, 64) as it is the device that maps from the
>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>> optimizing paravirtualized network cards, in that context it is a
>>> reasonable limitation. However this would not be suitable for a
>>> coprocessor to say "access denied, all are in use".
>>
>> Each Guest OS gets one VF.

> Right, that is the intended purpose for network cards in virtual machines
> although the SR-IOV specs are generalized.

> Then there is also was this movement that wants to do "zero-copy" network
> IO directly from user space IO buffers, which is a VF per process opening
> the device.

Why is this not an I/O MMU mapping. Kernel still does setup and teardown
but device does DMA directly into user (requestor) memory.

> Then the two combine and it becomes a VF per guest processes opening the
> device per guest OS and that fixed device quota of 16,32,64 VF's starts
> looking a little sparse.

Which is why direct user access to devices will never win.

Scott Lurndal

unread,

Dec 17, 2023, 2:51:15 PM12/17/23

to

Indeed, and the number of VF's is limited by the PCIe specification to
65535 with one PF.

The device is dividing its resources amongst the VF's, so the maximum number
of VF's is controlled by the amount of resources available on the device
and the implementation of the logic on the device.

The number oF VF's exposed to the host is controlled by the host driver
(up to the max supported by the device) via stores to the device
configuration space SR-IOV capability.

>A coprocessor is intended to be implicitly immediately available,
>under OS control, to the current processor context, be it threads, OS
>or drivers. That implies huge quota of VF's for all threads plus
>sundry other uses on all guest OS just in case they want one.

That assumes that the coprocess will be used by all processors,
which aside from legacy coprocessors like FPUs (even then, most
applications didn't actually use floating point and there are
hooks in most major operating systems to detect whether an application
uses floating point so they don't need to save the FPR over context switches).

>And, just guessing at the device internals, implies huge management tables,
>CAMs instead of SRAMs, caches, blah, blah, etc.

Certainly in many cases, CAMS are quite useful. Particularly on
networking hardware that performs hardware packet classification
based on header fields.

>The overhead of the async completion signal would likely be much greater
>that the cost of the original coprocessor hash/encrypt.

That again, depends on the coprocessor. If the amount of work
that is offloaded isn't large enough to subsume the slight extra
cost for the virtio interrupt (particularly on cpus where the
interrupt overhead is low - e.g. ARMv8), you probably should
couple the coprocessor closer to the CPU, much like ARM Neoverse
cores where the RND instruction interacts with an off-cpu random
number generator (via MMIO).

Here's what our chips look like to the kernel/software:

https://doc.dpdk.org/guides-20.05/platform/octeontx2.html

Packet comes in, hardware allocates packet storage from the
NPA (network pool allocator) hardware block. Passes to
NCPC for classification (big CAMS), queues to scheduler,
scheduler may or may not interact with a processor or
one of the many blocks that can be added to the processing
flow for a packet (crypto for IPsec, compression, etc) before
queuing the packet for egress (where shaping occurs) on
a network port.

Scott Lurndal

unread,

Dec 17, 2023, 2:52:35 PM12/17/23

to

mitch...@aol.com (MitchAlsup) writes:
>EricP wrote:
>
>> MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>>> from the limited info I have seen, using a VF does seem to be limited
>>>> because the SR-IOV device only export a fixed number of VF's,
>>>> (eg 16, 32, 64) as it is the device that maps from the
>>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>>> optimizing paravirtualized network cards, in that context it is a
>>>> reasonable limitation. However this would not be suitable for a
>>>> coprocessor to say "access denied, all are in use".
>>>
>>> Each Guest OS gets one VF.
>
>> Right, that is the intended purpose for network cards in virtual machines
>> although the SR-IOV specs are generalized.
>
>> Then there is also was this movement that wants to do "zero-copy" network
>> IO directly from user space IO buffers, which is a VF per process opening
>> the device.
>
>Why is this not an I/O MMU mapping. Kernel still does setup and teardown
>but device does DMA directly into user (requestor) memory.

It is an IOMMU mapping.

>
>> Then the two combine and it becomes a VF per guest processes opening the
>> device per guest OS and that fixed device quota of 16,32,64 VF's starts
>> looking a little sparse.
>
>Which is why direct user access to devices will never win.

Sorry, they already have for the use cases where it makes sense.

EricP

unread,

Dec 17, 2023, 3:44:11 PM12/17/23

to

MitchAlsup wrote:
> EricP wrote:
>
>> MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>>> from the limited info I have seen, using a VF does seem to be limited
>>>> because the SR-IOV device only export a fixed number of VF's,
>>>> (eg 16, 32, 64) as it is the device that maps from the
>>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>>> optimizing paravirtualized network cards, in that context it is a
>>>> reasonable limitation. However this would not be suitable for a
>>>> coprocessor to say "access denied, all are in use".
>>>
>>> Each Guest OS gets one VF.
>
>> Right, that is the intended purpose for network cards in virtual machines
>> although the SR-IOV specs are generalized.
>
>> Then there is also was this movement that wants to do "zero-copy" network
>> IO directly from user space IO buffers, which is a VF per process opening
>> the device.
>
> Why is this not an I/O MMU mapping. Kernel still does setup and teardown
> but device does DMA directly into user (requestor) memory.

It is. My understanding from looking at the Windows Driver documents
was it would have to be allocated when the VF pseudo-device is opened
instead of for each individual IO. At FileOpen the OS would need to be
told one or more virtual buffers the pseudo-device will work within.

Leaving HV's out for the moment, the SR-IOV needs to pre-allocate and
prepare any buffer physical memory at the time of pseudo-device open.
Then when you write the pseudo-device control register referencing
a byte range in the virtual buffer, it can validate it and initiate
the IO without a trip through the OS.

At pseudo-device open it would check any pinning quotas, fault in the
buffer pages and pin them, and create a virtual buffer to physical
fragment map, and set up the IOMMU DMA registers (PTE's) which the
device HW uses later.

For networks this is slightly complicated because network cards want to
do lots of scatter-gather IO from many byte sized and aligned buffers,
to assemble the TCPIP packet headers, merge that with the app's payload,
and possible add a packet trailer for the checksum (which the card
usually adds automatically).

The IO operation should consists of just writing the VF control register a
pointer to an operation, which points to a user space scatter-gather list
of byte buffers inside the pre-allocated and prepared memory areas.

The HV adds one more indirection layer to this because all those
pinned physical addresses and fragments above are actually guest OS
addresses which the HV converts to real physical fragments,
pins real physical frames and sets up real IOMMU maps for them.
The when you write the VF register the HW card can assemble the
packet direct from the guest user virtual buffers as all the
guest OS and HV management work was done at FileOpen.

Chris M. Thomasson

unread,

Dec 17, 2023, 3:44:26 PM12/17/23

to

On 12/17/2023 9:57 AM, MitchAlsup wrote:
> EricP wrote:
>
>> MitchAlsup wrote:
>>> EricP wrote:
>>>
>>>> Presenting the coprocessor as a Virtual Function (VF) could work but,
>>>> from the limited info I have seen, using a VF does seem to be limited
>>>> because the SR-IOV device only export a fixed number of VF's,
>>>> (eg 16, 32, 64) as it is the device that maps from the
>>>> Physical Function (PF) to the VF. As SR-IOV was mostly intended for
>>>> optimizing paravirtualized network cards, in that context it is a
>>>> reasonable limitation. However this would not be suitable for a
>>>> coprocessor to say "access denied, all are in use".
>>>
>>> Each Guest OS gets one VF.
>
>> Right, that is the intended purpose for network cards in virtual machines
>> although the SR-IOV specs are generalized.
>
>> Then there is also was this movement that wants to do "zero-copy" network
>> IO directly from user space IO buffers, which is a VF per process opening
>> the device.
>
> Why is this not an I/O MMU mapping. Kernel still does setup and teardown
> but device does DMA directly into user (requestor) memory.

Not exactly sure if this is relevant, but are you familiar with the Cell
processors back on the PlayStation 3? PPE and several SPE's? Iirc there
was a DMA to communicate with the SPE's. Although, some games did not
even use them because they were too difficult to program for.

[...]

Chris M. Thomasson

unread,

Dec 17, 2023, 3:48:33 PM12/17/23

to

Wrt my cipher, well, the problem is that its not really parallel at all.
I cannot complete step a without first processing a - 1. So, shit.