[pqc-forum] Dilithium & Saber unified in Hardware

196 views
Skip to first unread message

AIKATA

unread,
Oct 31, 2021, 4:14:34 PM10/31/21
to pqc-...@list.nist.gov

Hello PQC Community,


We announce the implementation of a unified cryptoprocessor for performing both lattice-based digital signature (Crystals-Dilithium) and key encapsulation (Saber) operations. The architecture has been optimized to exploit algorithmic and structural synergies in Dilithium, and Saber. Our cryptoprocessor is fully in hardware, i.e., all computations are performed in hardware. 


We have uploaded the manuscript to IACR ePrint. 

It is currently available at: https://github.com/acmert/acmert.github.io/blob/master/files/saber-dil-processor.pdf



In this unified cryptoprocessor, the two cryptographic schemes (Dilithium and Saber) use common polynomial arithmetic and a Keccak core. Note that polynomial multiplications and pseudorandom number generations are the most expensive operations in both Dilithium and Saber. Hence, having optimized cores for these two expensive operations is important for both performance and area overheads. Our cryptoprocessor is also capable of executing data-independent operations in parallel.


As Dilithium makes NTT an integral part of the protocol, the unified polynomial multiplier in the coprocessor uses the NTT method. For Saber, a 24-bit prime p’ = 2^24 - 2^14 + 1 is used as the NTT-modulus. It closely resembles the 23-bit prime q = 2^23 - 2^13 + 1 of Dilithium and therefore leads to an optimized data-path design. The finite state machine (FSM) which controls the internal steps of the NTT is also common as we use the same type of NTT (with full CRT splitting) for both schemes. We also add a very efficient modular reduction unit benefitting from the structure of the two primes. 



Area and Timing results:

The NTT unit of the cryptoprocessor uses only two butterfly cores. Using more (or less) cores, one could reduce (or increase) the cycle count at the cost (or saving) of the area. 


The area consumption of the entire cryptoprocessor architecture is 18,040 LUTs, 9,101 flip-flops, 4 DSP units, and 14.5 BRAMs on the Xilinx Zynq Ultrascale+ ZCU102 FPGA. 


The FPGA implementation of the cryptoprocessor achieving 200 MHz clock frequency finishes the CCA-secure key generation, encapsulation, and decapsulation operations for Saber in 54.9, 72.5, and 94.7 $\mu$s, respectively. For Dilithium-II, the key generation, signature generation, and signature verification operations take 78.0, 164.8, and 88.5 $\mu$s, respectively, for the best-case scenario where a valid signature is generated after the first loop iteration.


The memory requirement of the implementation is determined by Dilithium as it requires much more memory than Saber. 


The cryptoprocessor is also synthesized for ASIC with the UMC 65nm library. It achieves 370 MHz clock frequency and consumes 0.301 mm^2 area (~ 200.6 kGE) excluding on-chip memory. The ASIC implementation can perform the key generation, encapsulation, and decapsulation operations for Saber in 29.6, 39.2, and 51.2 $\mu$s, respectively, while it can perform the key generation, signature generation, and signature verification operations for Dilithium-II in 42.2, 89.1, and 47.8 $\mu$s, respectively.



Regards

Aikata, Ahmet, Sujoy

Reply all
Reply to author
Forward
0 new messages