Hi All,
This a note about some new work related to SLH-DSA hardware performance and side-channel security. tldr; you can make SLH-DSA run ten times faster with a regular hash accelerator, but you can make it about a hundred times faster with hardware built specifically for SLH-DSA (like my little prototype, "SLotH"). The special hardware isn't that much bigger than than the general-purpose hardware, unless you want do side-channel security fast.
I think any competent cryptography architect who understands hash-based signatures should be able to make the same observations and use similar (or better) tricks, so this "100x" is more in line with the performance that we will see once FIPS 205 becomes a finalized standard.
The write-up "Accelerating SLH-DSA by Two Orders of Magnitude with a Single Hash Unit" has been accepted to the 5th NIST PQC Standardization Conference next month. I just uploaded it to ePrint:
https://eprint.iacr.org/2024/367
Of related interest, that repo also has a new pure C implementation of SLH-DSA (under "slh" directly) that you can run on your PC without my hardware :) I think most people been using the SPHINCS+ reference implementation thus far. This one is structured much more like the FIPS 205 spec, as it is much more recent.
At RISC-V International we've been discussing dropping a full-Keccak
instruction into the official ISA, so things may change radically with
software too. (However, for this to happen, multiple RISC-V
architectural committees need to be convinced, but we'll at least try. I
will talk about the RISC-V stuff at Real World Crypto in Toronto.)
This is how I describe the situation to hardware architects:
Hash functions, especially SHA3/Keccak, are much, much faster in hardware than they are in software. A straightforward implementation of the Keccak core permutation f1600 takes 24 cycles in hardware and works even at very high frequencies thanks to a very short critical path -- f1600 has massive parallelism and no carry chains. A single iteration of f1600 is sufficient to compute SHAKE256 of a 136-byte block (or to produce 136 bytes of output.)
In software the situation is dramatically different: A processor implementation of the Keccak f1600 permutation is a couple of thousand cycles on current CPUs. A microcontroller such as Cortex M4 requires more than 10,000 cycles.
Especially in the microcontroller world, one can already get dramatic speed-ups ("one order of magnitude") in SLH-DSA end-to-end testing just by making a hardware implementation of the core hash available via memory-mapped or DMA streaming interfaces. This is universally done in Root-of-Trust units.
Such a hardware architecture is a totally reasonable approach when the hash is being used with traditional algorithms such as RSA or ECDSA that just use a hash function to hash messages to be signed or verified. This is the use case traditional hash accelerators were designed for. However, SLH-DSA spends only a tiny fraction of time dealing with the input message; most of the work is spent (essentially) hashing other hashes.
The main "trick" in the paper is pretty simple:
The paper contains quantitative counts on the types of hashing required by SLH-DSA (Table 2 -- with the frame formats summarized in Table 1). It is useful to observe that some of those are invoked only a couple of times during a signing operation, while some others are invoked tens of thousands of times.
I'd especially like to point out that roughly 90% of the operations are spent on chain() which is essentially an iterative computation of hash function F() -- i.e. essentially F(F(F(x..))) with a slight change in the frame (ADRS) of each hash. For the chain() function, SLotH implements in hardware the step that takes the previous hash output, increments a counter in the frame, and pads a new message out of it. This is a really simple operation in hardware, consisting of a bunch of wires and muxes and an 8-bit counter, and is accomplished in a single cycle. The hash never has to leave the hashing unit and each iteration is now 24+1=25 cycles instead of the hundreds of cycles required to arrange the correct padding with the CPU in the loop. Since the hash core itself is not affected, the hardware size does not grow much.
SLotH is still just a hash accelerator, not an end-to-end SLH-DSA implementation with everything hard-wired. However, such features increase the utilization of the core hash so that clh/h rate for some parameters is less than 50, within 50% of the absolute maximum.
There are additional security features, notably implementing the "secret key expander" function PRF so that the processor doesn't need to handle the secret key SK.seed once it has been loaded into the hardware unit. This helps to eliminate side-channel leakage; there is a TVLA test in there that shows that CPU implementations are extremely leaky.
Cheers,
-markku