Zero-dependency Implementations of NIST LWC Finalists

153 views
Skip to first unread message

Anjan Roy

unread,
Aug 29, 2022, 2:26:48 PM8/29/22
to lwc-...@list.nist.gov
Hi LWC forum,

I'm Anjan Roy, a software engineer from India. Today I'm writing to inform you about an initiative I took a few months ago, which has recently come to an end.

Around five months ago, I decided to start implementing all finalists of the NIST LWC standardization effort, as zero-dependency, header-only C++ library, which should be fairly easy to use in production-grade projects. Last week I finished implementing all the final round candidates, including all (non-primary) variants proposed under all the candidate cipher suites. As a concrete example, for Romulus, I implement all the three AEAD variants along with a single hash function proposed. These implementations come with test suites, benchmarks and API usage examples. For testing functional correctness, I use KATs present in NIST submission packages. CPU benchmarking is done using google-benchmark framework. These implementations don't specifically target microcontrollers, instead they broadly target CPU present in desktop/ server/ mobile devices. Which is why you might notice them to be optimised for speed, not code-size/ power consumption.

I decided to write to you hoping that this helps you and the larger community, who might look for alternative implementations/ benchmarks on different compute platforms. Below I'm attaching links to the repositories where these ten projects are maintained.

I'll be eager to learn about any suggestions/ comments/ ideas that you might have for me.

Have a good day. Regards

List of repositories:

Ascon: https://github.com/itzmeanjan/ascon

Anjan Roy

Alex Max

unread,
Aug 30, 2022, 10:17:53 AM8/30/22
to Anjan Roy, lwc-...@list.nist.gov
Dear Anjan, all,

Thanks a lot for your hard work and the results!

I did not look into all your implementations but decided to compare your implementation of Grain-128AEADv2 vs. ours.
Here is the comparison table -- I have tested all variants under the same platform, test environment, and same conditions.

PS: some of your performance numbers are a bit confusing (see the pink cells in the table below)

EXTERN CODEhttps://github.com/itzmeanjan/grain-128aead
OUR CODEhttps://github.com/Noxet/Grain-128AEAD/tree/master/NIST/optimized
EXTERN TEST ENV & EXTERN CODE. Speeds are in Mega-bits per second (Mbps)
Speed of Encryption in (Mbps) using 32 bytes of AD and the Plaintext of size X bytes (as below)
64128256512102420484096
Intel Core i5-8...@2.40GHz134.96143.76?137.2?141.84147.28153.28156.16
AWS Graviton392.24117.68144.48166.96182192.32197.6
Intel Xeon E5-2...@2.30GHz52.1663.1273.3681.2885.8489.0490.8
OUR TEST ENV (3 sec loop) on platform: Intel Core i5-1...@2.60GHz, Win10, VS 2022 Professional
Speed of Encryption in (Mbps) using 32 bytes of AD and the Plaintext of size X bytes (as below)
64128256512102420484096
EXTERN CODE, x6485.9112.02131.77145.4152.39156.58162.18
OUR CODE, x64350.83453.16527.43574.12599.78615.16620.76
OUR CODE, SSE619.62861.051066.051212.151298.831344.971375.73
OUR CODE, AVX512658.07937.61190.131369.441486.181551.891582.65


Best regards,
/Alexander


Anjan Roy <anjan...@gmail.com>:
--
To unsubscribe from this group, send email to lwc-forum+...@list.nist.gov
Visit this group at https://groups.google.com/a/list.nist.gov/d/forum/lwc-forum
---
To unsubscribe from this group and stop receiving emails from it, send an email to lwc-forum+...@list.nist.gov.

Anjan Roy

unread,
Sep 1, 2022, 12:50:45 AM9/1/22
to Alex Max, lwc-...@list.nist.gov
Hi Alex and LWC community,

Thanks for taking a look at my Grain-128 AEAD implementation and comparing its performance with your optimized work.

The benchmark result table that you see in the repository https://github.com/itzmeanjan/grain-128aead ,
uses 32 -bytes associated data and N -bytes plain/ cipher text data, while bandwidth is provided in terms of bytes/ second. Those two fields
which you've marked as being confusing seems to suffer from target machine being busy doing other work, as I see CPU time is smaller compared to real time,

Looking at the comparison table provided by you, it looks like my implementation can speed up ~4x on x86_64, by authenticating & encrypting/ decrypting
32 consecutive message bits, in parallel. Because I process 8 -bits in parallel ( except only during cipher initialization 32 cycles are executed in parallel ),

Inspired by your results, I decided to update my implementation of Grain-128 AEAD to preferably process 32 consecutive message bits parallelly,
which has as expected brought ~4.5x boost in performance on x86_64. I've worked on this feature in PR https://github.com/itzmeanjan/grain-128aead/pull/5.

Note, this optimization also benefits from BMI2 capability ( if available ) on x86_64, for performing faster even/ odd bit deinterleaving, which is required for
extracting out key stream bits used for encryption/ decryption and authentication. Please see https://github.com/itzmeanjan/grain-128aead/commit/35717de .
For CPU systems where this capability is not available, I decided to settle with a somewhat expensive deinterleaving technique
https://github.com/itzmeanjan/grain-128aead/blob/706c2e3/include/aead.hpp#L45-L99 , which seems to do better compared to the trivial loop based solution.

These changes bring ~2.2x performance boost on ARM CPU systems, specifically on AWS Graviton{2,3}. I'll be interested in including SSE and AVX support
in my implementation.

For sake of better understanding, I'm attaching a diff of the benchmark table on x86_64 ( specifically Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz ).


Have a good day. Regards

Anjan Roy
bench_diff.png

Alex Max

unread,
Sep 14, 2022, 5:59:44 AM9/14/22
to Anjan Roy, lwc-...@list.nist.gov

Hello Anjan and everyone,

I have done some more programming of Grain utilising AVX512 and managed to improve my own fastest previous code by +50%, now it runs with the speed 2.3 Gbps. While this is a good exercise, we should bear in mind that Grain is a hardware-oriented cipher and thus I did not have big expectations for the software side, but still there is a possibility to do a number of interesting tricks and the final performance looks decent.

I was more focusing on improving the keystream (preoutput) function but also made some minor improvements to other parts -- thanks a lot for pointing me to the _pext_u64() instruction, I did not know about that. I also removed GF2 instructions by instead using vpshufbitqmb for 64 bits reversal, so there is now no dependency on GF2-NI.

In the attached zip you will find the new code and 5 implementation attempts of the keystream (preoutput) generation. Current settings in the sources are set to the fastest code (which is version 5). The updated performance table is as follows:

OUR TEST ENV (3 sec loop) on platform: Intel Core i5-1...@2.60GHz, Win10, VS 2022 Professional
Speed of Encryption in (Mbps -- Megabits per second) using 32 bytes of Associated Data
and the Plaintext of length N bytes (as below)
NEW CODE with AVX512
v5 (our kmask expansion)943135917272022221223212377


A short description of what I have tried in 5 versions:

V1. Here was my very first attempt to use full 512-bit registers with a better alignment of the 64-bit input arguments. I heavily used the instruction vpshrdvq which makes it possible to get 8x64-bit offsets from two zmm registers that represent the high and low 64-bit parts of 128-bit lanes, all with just a single call. Then I also used parallel calls for ternary logic and kmasks – the latter is the feature of AVX512 and kmasks have their own registers bank and I tested my own expansion of the kmasks as well as I tried to let the compiler to expand these k-registers. The speed achieved was around 1800Mbps.

V2. In v1 I realised that the biggest problem/most of instructions are spent for aligning of input arguments into 512-bit registers while the logic part is short, so I tried to make a program where I do alignment in x64 style but this did not help to speedup, the speed drops to only 638Mbps. The lesson learnt is that loading/storing between RAM and registers is quite costly.

V3. This is a refactored attempt of v1 where read/write from RAM is optimized in a better way. I also started to use the instruction vpermq for a quicker data aligning in registers, although it has latency 3 but the throughput is 1 – should still be ok. However, no significant speedup vs v1.

V4. This was my last attempt to generate 64 bits of the preoutput with all my knowledge gathered in previous attempts, with the focus on reducing the number of k-masks, which effectively saved some clocks but still no visible improvements compared to v1 and v3.

V5. In all previous versions I was computing 64 bits of the preoutput, thus I had to keep wide 64-bit values so that the computation of the upper 32 bits is shorter than the computation of the lower half. In this very last attempt I just wanted to see what would be my best try if we only focus on generation of 32 bits of the preoutput, and we get 64 bits by simply repeating the function twice. I have used instructions stitching and interleaving techniques, leveraging on instructions’ throughputs rather than the latency, and utilised only 4 k-masks that overall gave me an impressive speedup of 2377Mbps – much better than when I was trying to make 64 bits in one go. I provided more comments and details in the v5-code itself.

Best regards,

/Alexander

 

 


чт, 1 сент. 2022 г. в 06:50, Anjan Roy <anjan...@gmail.com>:
grain128-AEADv2-newopt-AVX512.zip

Alex Max

unread,
Sep 19, 2022, 7:50:05 AM9/19/22
to Anjan Roy, lwc-...@list.nist.gov
Dear Anjan, all,

Here are some more improvements. My 9th attempt of speeding up Grain-128AEADv2 runs with the speed 2953 Mbps, for 32 bytes of AD and 4096 bytes of the plaintext on the same platform and test conditions as before.

Best regards,
/Alexander

grain-aeadv2-v9.zip

Anjan Roy

unread,
Jul 29, 2023, 10:42:51 AM7/29/23
to lwc-forum, Alex Max, lwc-...@list.nist.gov
Dear NIST LWC community,

Good day. I hope you're doing well.

Today I'm writing for informing you about some recent developments on Ascon cipher suite's header-only C++ library implementation.

As Ascon cipher suite is being standardized by NIST, I decided to extend my existing Ascon C++ library implementation, which primarily implemented only three AEAD candidates, two hash functions and two extendable output functions, by including authentication schemes based on Ascon permutation. Ascon based authentication schemes such as Ascon-PRF, Ascon-MAC and Ascon-PRFShort were proposed later on in paper https://ia.cr/2021/1574, featuring interesting performance characteristics.

Right now Ascon cipher suite's C++ header-only library implements following
  • AEAD: Ascon-128, Ascon-128a and Ascon-80pq.
  • Hashing: Ascon-Hash, Ascon-Hasha with incremental message absorption; Ascon-Xof, Ascon-Xofa with incremental message absorption and output squeezing.
  • Authentication: Ascon-PRF, Ascon-MAC with incremental message absorption and output squeezing. And Ascon-PRFShort for authenticating/ verifying short messages.
This library implementation of Ascon cipher suite ensures its conformance with Ascon specification by using Known Answer Tests from https://github.com/ascon/ascon-c. This implementation of Ascon cipher suite might be interesting to one because it deviates from exposing traditional C -style pointer/ length based interfaces, rather it features C++20 std::span ( more @ https://en.cppreference.com/w/cpp/container/span ) based interface, which can be made strongly typed, by using static length information as part of type itself, at least for secret key, nonce, authentication tag and message digest etc.. Also Ascon permutation is implemented as constexpr ( more @ https://en.cppreference.com/w/cpp/language/constexpr ) function, making it possible to apply n -rounds permutation instance during compilation time, enabling us to use static assertions for testing correctness of permutation.

Finally I'd like to end by presenting benchmark results, collected on a machine running GNU/Linux kernel 6.2.0-26-generic on CPU 12th Gen Intel Core i7-1260P s.t. benchmark suite is compiled with GCC-12.2.0, optimized for speed, with flags -O3 -march=native -mtune=native.

Note, google-benchmark ( more @ https://github.com/google/benchmark ) library is used as benchmark harness. Also I use libPFM ( more @ https://perfmon2.sourceforge.net ) for collecting h/w performance counters such as CPU cycles and retired instructions.
ascon-perf-on-12th-gen-intel-i7.png

Header-only C++ library implementation of Ascon cipher suite is maintained @ https://github.com/itzmeanjan/ascon. Find tests/ benchmarks/ API usage examples etc.. I'd love to get your suggestions in improving this library implementation of Ascon cipher suite.

Best regards,
Anjan

Anjan Roy

unread,
Oct 3, 2023, 9:04:55 AM10/3/23
to lwc-forum, Anjan Roy, Alex Max, lwc-...@list.nist.gov
Dear LWC community,

Hope you're doing well. 

I just wanted to let you know that my C++{>=20} header-only library implementation of Ascon cipher suite now features constexpr functions for Ascon permutation-based hashing schemes such as Ascon-{Hash, HashA, Xof, XofA}. Meaning, given constant input message (known at program compilation-time), one can evaluate any of aforementioned hash functions on it, during program compile-time itself. I demonstrate that with following example screen-capture (notice the static_assert). 
You can find more about C++ constexpr functions @ https://en.cppreference.com/w/cpp/language/constexpr.

compile-time-eval-ascon-hash.png

I invite you to take a look @ https://github.com/itzmeanjan/ascon.git. And I'd appreciate any feedback regarding the library implementation of Ascon cipher suite, as I continue maintaining it.

Regards,
Anjan
Reply all
Reply to author
Forward
0 new messages