Yet Another Implementation of NIST PQC Selected Candidates

Anjan Roy

unread,

Feb 18, 2023, 8:13:50 AM2/18/23

to pqc-forum

Hi NIST PQC community,

Hope you're doing well.

I'm writing to you for informing you about an initiative I took few months ago, which has recently come to an end. I decided to implement all NIST selected PQC candidates as minimal, header-only, zero-dependency, easy-to-use C++ library. I've implemented Kyber KEM ( and PKE ), Dilithium, SPHINCS+ and Falcon DSA.

> From above ordering you might have already guessed why Falcon is listed at the end 😉. It took me sometime to figure.

Below I attach few details regarding implementation of these four selected candidates. In their respective repository, you may find how to test/ benchmark/ use these libraries.

Before moving to next part, I'd like to mention that though my Falcon implementation is functionally correct it is not yet tested to be conformant with Known Answer Tests found in NIST submission. One shoutout, during Falcon implementation I found Thomas Prest's https://github.com/tprest/falcon.py very helpful along with the specification and reference implementation. Other three implementations ( i.e. Kyber, Dilithium and SPHINCS+ ) are tested with Known Answer Tests ( found in respective NIST submission packages ) for conformance and correctness.

Following benchmark results were collected on an Intel Core i5-8279U CPU @ 2.40GHz Macbook Pro, using google-benchmark ( more @ https://github.com/google/benchmark ) as harness and Clang as C++ compiler. You'll find more benchmark results in respective repository.

1. Kyber PKE and KEM implementation lives @ https://github.com/itzmeanjan/kyber

2. Dilithium PQ DSA implementation lives @ https://github.com/itzmeanjan/dilithium

3. SPHINCS+ PQ DSA implementation lives @ https://github.com/itzmeanjan/sphincs

4. Falcon DSA implementation lives @ https://github.com/itzmeanjan/falcon

If you visit the repository of Falcon implementation, you will notice that it is not truly zero-dependency, because for key generation step, when solving NTRU equation, I make use of GNU MP library ( more @ https://gmplib.org/manual/ ) for arbitrary-precision integer arithmetic. In future, I'd like to trim it down to some minimal big-integer arithmetic implementation. Another thing to notice in benchmark results of Falcon key generation - it's pretty slow, well that's because it uses recursive Karatsuba algorithm for multiplying two polynomials with integer coefficients. As suggested in https://github.com/tprest/falcon.py/blob/88d01ede/ntrugen.py#L139-L144 it's worth optimizing key generation by implementing Toom-Cook-3, which I plan to do in sometime future.

All these candidates depend of SHA3 hash or extendable output functions, for which I implemented SHA3 specification @ https://github.com/itzmeanjan/sha3 which itself is a slim, zero-dependency, header-only C++ library.

That's all. Though the quest for implementing NIST PQC selected candidates has come to an end, I hope to continue improving and maintaining them, with your suggestion/ comments.

Thanks. Have a good day.

Anjan Roy

Thom Wiggers

unread,

Feb 18, 2023, 9:33:08 AM2/18/23

to Anjan Roy, pqc-forum

Hi,

It’s nice to have more implementations than the reference onces. However, could you maybe clarify your security goals in these implementations? For example if your implementations run in secret-independent/constant time?

Now, I do not have a lot of experience with C++, but I also see a Mersenne Twister in your PRNG functions. My understanding is that MT isn’t up to cryptographic standards, so can you maybe clarify that as well? Related:

https://stackoverflow.com/questions/44867500/is-stdrandom-device-cryptographic-secure.

Cheers,

Thom

Op za 18 feb. 2023 om 14:13 schreef Anjan Roy <anjan...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/f56ea1ff-4e3d-48c8-a787-dd893b778256n%40list.nist.gov.

Anjan Roy

unread,

Feb 19, 2023, 12:47:17 AM2/19/23

to pqc-forum, Thom Wiggers, pqc-forum, Anjan Roy

Hi Thom,

> However, could you maybe clarify your security goals in these implementations? For example if your implementations run in secret-independent/constant time?

I target desktop/ server grade CPU systems where binary size is not much of a concern, rather performance is priority, while just using plain C++ and its standard library features, keeping it as readable as possible. Speaking of security goals, I try to not decide which intructions to be executed or which memory locations to be accessed based on some secret input. In general, I try to reduce branching as much as possible while *almost always* removing secret dependent branching. I say almost because I'd ideally like these implementations to be audited - because it's totally possible that I've missed something. And also compilers are certainly free to do various sorts of optimizations so I won't claim these implementations are fully tested to be constant-time. Though making that happen would be my eventual goal.

> Now, I do not have a lot of experience with C++, but I also see a Mersenne Twister in your PRNG functions. My understanding is that MT isn’t up to cryptographic standards, so can you maybe clarify that as well?

I use Mersenne Twister seeded with system randomness ( std::random_device, more @ https://en.cppreference.com/w/cpp/numeric/random/random_device ) for generating random input data for testing/ benchmarking these implementations. That was mostly for ease of use. Somewhat vaguely I note that in https://github.com/itzmeanjan/kyber/#important-implementation-note. I'd certainly like to remove them and rather use NIST's RNG function. I also note that system randomness in C++ is an implementation detail and it shouldn't probably be used in cryptographic context.

In this context I've one question --- you may have also noticed I make use of SHAKE256 XOF, Mersenne Twister and C++ implementation of system randomness ( i.e. std::random_device ) for generating random bytes in Falcon implementation ( more @ https://github.com/itzmeanjan/falcon/blob/44424d3df64805602003051b6b92e31ae7787cb6/include/prng.hpp ). I'd like to change it such that it takes some sort seed ( say 32 -bytes seed ) from API user rather than depending on C++'s std::random_device and Mersenne Twister engine. But I don't have a background in cryptography, so I'm not sure whether it's helpful or not. If someone can confirm that using SHAKE256 XOF ( seeded with 32B seed value ) as source of randomness works - that will be great.

Thanks.

Regards

Anjan Roy

Thom Wiggers

unread,

Feb 19, 2023, 10:18:09 AM2/19/23

to Anjan Roy, pqc-forum

Hi Anjan,

I do like your implementations: they are pretty readable, the code is well-organized, and I think you've joined a short list of people that have actually implemented Falcon. However, I am afraid that I need to encourage you to clearly label your implementations, at the top of your READMEs, as experimental and likely insecure. There are a lot of extremely subtle things in cryptographic implementations, and it is extremely easy to make devastating mistakes.

People often say "don't roll your own cryptography", and you stated that you don't have a cryptography background. While "don't roll your own" might thus be used against your work, I do not agree with the sentiment: this is instead just one of your steps to becoming a cryptographer, so I do appreciate your work and hope you continue on this path. However, at this point in time, your implementation is too easy to use and too shiny (proper README, docs, Python bindings...), so there is a risk that people will take your code and integrate it into their products.

Regarding constant-time behaviour of your code, I have concerns. For example, in the `decapsulate` operation of KEMs using the Fujisaki-Okamoto transform, it is critical that the timing does not reveal if the decapsulation actually failed when the KEM returns the random value instead of the decapsulated shared secret. In your implementation

for (size_t i = 0; i < ctlen; i++) {
flg |= static_cast<bool>(cipher[i] ^ c_prime[i]);
}
std::memcpy(kdf_in, g_out, 32 * !flg);
std::memcpy(kdf_in, z, 32 * flg);

flg is thus a secret. The way the code is written, however, invites compilers to generate a conditional jump, which for example seems to be happening with MSVC here: https://godbolt.org/z/hY919xfd1

Another example is your xgcd implementation, which is clearly not constant-time. Falcon's reference implementation contains an example of a constant-time implementation with lots of documentation. GMP also probably does not provide, on all platforms, the constant-time 64-bit floating point arithmetic that Falcon requires.

I am afraid I don't have enough crypto implementation expertise to properly evaluate your code. I suggest clarifying the status of the constant-time-ness of your implementation in your READMEs, along with the warning.

Regarding RNG: It's probably better to explicitly rely on the system random number generator than the C++ stdlib. As I read your implementation, while the random_device you use to seed the PRNG *may* be the system RNG, you then extract the used random bytes from a non-cryptographically secure RNG, the Mersenne Twister. You may draw some inspiration from / take the implementation from Sprenkels at https://github.com/dsprenkels/randombytes/. Putting a seed into SHAKE, which is then used as a XOF for all other randomness needs is pretty common across many implementations, and fine. Linux, for example, does something similar under the hood, using ChaCha20.

I hope this helps, and again, I do like many things about your implementations. Good luck with the work.

Cheers,

Thom

Op zo 19 feb. 2023 om 06:47 schreef Anjan Roy <anjan...@gmail.com>:

Anjan Roy

unread,

Feb 20, 2023, 2:15:30 AM2/20/23

to pqc-forum, Thom Wiggers, pqc-forum, Anjan Roy

Hi Thom,

First of all thanks for looking at these implementations and providing me with such constructive feedback. It's really helpful.

> However, I am afraid that I need to encourage you to clearly label your implementations, at the top of your READMEs, as experimental and likely insecure. There are a lot of extremely subtle things in cryptographic implementations, and it is extremely easy to make devastating mistakes.

In all four of these implementation's top portion of README, I now add a warning note. Also now that NIST has selected Ascon in LWC standardization effort, I followed same suggestion in my implementation of Ascon.

> flg is thus a secret. The way the code is written, however, invites compilers to generate a conditional jump,

Yes it's totally correct and GCC/ Clang also generates a conditional jump. I will update them such that they don't use boolean rather comparison result is accumulated in a uint64_t or uint32_t value. I found Golang's subtle package ( more @ https://pkg.go.dev/crypto/subtle ) pretty helpful.

> Another example is your xgcd implementation, which is clearly not constant-time. Falcon's reference implementation contains an example of a constant-time implementation with lots of documentation. GMP also probably does not provide, on all platforms, the constant-time 64-bit floating point arithmetic that Falcon requires.

Falcon reference implementation is really well documented with various tricks/ techniques that it employs for preserving contant-timeness. I hope to bring those in my implementation of Falcon too. I use GMP only for arbitrary precision integer arithmetic ( in NTRUSolve algorithm ), for any sorts of floating point arithmetic I fall back to double precision floating point numbers ( C++'s primitive type double ). I believe, on some CPU targets it can be emulated or non-constant-time too due to optimizations that FPU can bring in. I think I should just look at Falcon reference implementation and read https://eprint.iacr.org/2019/893.

> You may draw some inspiration from / take the implementation from Sprenkels at https://github.com/dsprenkels/randombytes/.

Thanks I'll definitely take a look. This is going to be helpful.

> Putting a seed into SHAKE, which is then used as a XOF for all other randomness needs is pretty common across many implementations, and fine. Linux, for example, does something similar under the hood, using ChaCha20.

Thanks for confirming that. I'll update it to use SHAKE256 initialized with seed value. I saw using ChaCha20 cipher in a Python implementation of Falcon ( more @ https://github.com/tprest/falcon.py/blob/88d01ede1d7fa74a8392116bc5149dee57af93f2/rng.py ) for generating random bytes.

Thank you very much.

Regards

Anjan Roy

unread,

Mar 10, 2023, 3:34:33 AM3/10/23

to pqc-forum, Anjan Roy, Thom Wiggers, pqc-forum

Hi all,

Good day. Hope you're doing well.

> For example, in the `decapsulate` operation of KEMs using the Fujisaki-Okamoto transform, it is critical that the timing does not reveal if the decapsulation actually failed

> when the KEM returns the random value instead of the decapsulated shared secret. In your implementation

>

> for (size_t i = 0; i < ctlen; i++) {
> flg |= static_cast<bool>(cipher[i] ^ c_prime[i]);
> }
> std::memcpy(kdf_in, g_out, 32 * !flg);
> std::memcpy(kdf_in, z, 32 * flg);

>

> flg is thus a secret. The way the code is written, however, invites compilers to generate a conditional jump, which for example seems to be happening with MSVC here: https://godbolt.org/z/hY919xfd1

I just wanted to let you know that I've updated Kyber implementation ( more @ https://github.com/itzmeanjan/kyber ) while retaining constant-timeness in case decapsulation procedure fails. For which I wrote a light-weight, header-only, template-based C++ library for performing constant-time comparison, conditional selection and conditional swapping of unsigned integer ( of various bitwidths ) values - called subtle. Find more @ https://github.com/itzmeanjan/subtle.

Previous implementation of how decapsulation failure was handled lives @ https://github.com/itzmeanjan/kyber/blob/1bde099f/include/decapsulation.hpp#L60-L66

While current one makes use of subtle for comparison and conditional selection @ https://github.com/itzmeanjan/kyber/blob/7b2a82e7/include/decapsulation.hpp#L61-L69

Thanks to Thom Wiggers for helping me in catching that.

Regards

Anjan Roy

unread,

Jun 23, 2024, 6:53:00 AMJun 23

to pqc-forum

Dear all,

Good day ! I hope you're doing well.

I'm happy to inform you, I've updated my C++20 header-only library implementation of Kyber to be conformant with NIST FIPS 203 draft standard. The project lives @ https://github.com/itzmeanjan/ml-kem.git, implementing all parameter sets suggested in the draft standard and it's licensed under MPL-2.0.

✅ It passes all the Known Answer Tests, which I derive from ML-KEM "official" reference implementation. Reproducible steps @ https://gist.github.com/itzmeanjan/c8f5bc9640d0f0bdd2437dfe364d7710.

✅ I've integrated dudect -based constant-timing leakage tests. I've not found any problems so far.

❌ I've not yet finished integrating extensive KAT tests from wycheproof i.e. https://github.com/C2SP/wycheproof/pull/110. But my limited testing shows encapsulation fails for malformed public key, as expected.

✅ I've implemented all the functions of ML-KEM library as constexpr, meaning you should be able to evaluate ML-KEM key generation/ encapsulation/ decapsulation at program compile-time itself, given all inputs are known at program compile-time. I keep a demo @ https://github.com/itzmeanjan/ml-kem?tab=readme-ov-file#usage.

Finally, I decided to benchmark my portable C++20 header-only library implementation against "official" reference and AVX2 optimized implementation of ML-KEM and I present result 👇.

I see a huge room for improvement and I'd like to decrease the size of the room 😉. Also I'd like to know your thoughts.

Kind regards,

Anjan Roy

unread,

Jun 23, 2024, 7:01:41 AMJun 23

to pqc-forum, Anjan Roy

Dear all,

Hi again 👋

Speaking of benchmark comparison in above message, I noticed, in HEAD of `standard` branch of https://github.com/pq-crystals/kyber/tree/standard, it doesn't implement modulus-check ( see line 786-792 of https://doi.org/10.6028/NIST.FIPS.203.ipd ) when accepting public key for encapsulation. But in my implementation @ https://github.com/itzmeanjan/ml-kem, I'm already doing it. If it changes, benchmark comparison result will get affected to some extent.

I've opened an issue, inquiring about it @ https://github.com/pq-crystals/kyber/issues/80.