Moe details on the new XOF API in FIPS 203 and 204

2,646 views
Skip to first unread message

Moody, Dustin (Fed)

unread,
Apr 19, 2024, 2:06:05 PMApr 19
to pqc-forum

Hi All,

 

This forum post outlines a planned change in how we will describe some uses of the XOFs SHAKE128 and SHAKE256 in FIPS 203 and FIPS 204. This will not change how the algorithms in FIPS 203 and FIPS 204 work, it will only change how they are described.  In particular, the algorithm SampleNTT in FIPS 203 and the algorithms RejNTTPoly and RejBoundedPoly in FIPS 204 do rejection sampling on bytes sampled from the output of a XOF. Since the number of output bytes that must be produced depends on the output of those bytes, the API given by FIPS 202 for calling SHAKE128 and SHAKE256 is not ideal. Instead, for SampleNTT, RejNTTPoly, and RejBoundedPoly, we intend to use an alternate API for calling SHAKE that will consist of three functions:

 

SHAKEXXX.Init() \\Initializes a hash function context.

SHAKEXXX.Absorb(ctx,str) \\ Inject data from a byte string str to be used in the absorbing phase of SHAKEXXX; Output a hash function context that stores the current 1600-bit state of the Keccak sponge and any potentially incomplete data block.

SHAKEXXX.Squeeze(ctx, B) \\ Extract and output B bytes produced during the squeezing phase of SHAKEXXX ;In addition to the requested B bytes, output a context that stores the current 1600-bit state of the Keccak sponge and any squeezed bytes that have not yet been output.

 

We intend to publish pseudocode in terms of the Keccak-f permutation for these API calls in an update of SP 800-185 at roughly the same time as we publish the final versions of FIPS 203, 204, 205.

It is our intent that these three functions will be defined in such a way that the following call to SHAKEXXX:

 

output ← SHAKEXXX(str_1| . . . |str_m, B_1 + . . . + B_k),

 

will produce the same output as the following steps:

 

1. ctx ← SHAKEXXX.Init()

2. For i = 1 to m:

ctx ← SHAKEXXX.Absorb(ctx, str_i)

3. For j = 1 to k:

(ctx, out_j) ← SHAKEXXX.Squeeze(ctx, B_j)

4. output ← out_1| . . . |out_k

 

Please let us know if you have any questions or feedback on this plan. 

 

In addition, there is one more change we wanted to mention, which is relevant to FIPS 203 and FIPS 204.  We plan to explicitly allow implementers to limit the number of iterations in while loops. We intend to give a minimum allowable iteration limit, such that deviation from the behavior of the while loop is cryptographically rare for a correct implementation.

 

 

Best,

Ray Perlner (NIST PQC)

Simon Hoerder

unread,
Apr 19, 2024, 3:03:52 PMApr 19
to pqc-...@list.nist.gov
Hi,

with a separate implementation of absorb() and squeeze() functions it's
at least theoretically possible to alternate between absorbing and
squeezing. How do you expect that to be handled in a FIPS 140 crypto
module? Is a FIPS 140 crypto module even allowed to expose the init(),
squeeze() and absorb() subfunctions to the host via its API?

Best,
Simon
> --
> You received this message because you are subscribed to the Google
> Groups "pqc-forum" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pqc-forum+...@list.nist.gov
> <mailto:pqc-forum+...@list.nist.gov>.
> To view this discussion on the web visit
> https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/SA1PR09MB8669DB523611678C7CC80980E50D2%40SA1PR09MB8669.namprd09.prod.outlook.com <https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/SA1PR09MB8669DB523611678C7CC80980E50D2%40SA1PR09MB8669.namprd09.prod.outlook.com?utm_medium=email&utm_source=footer>.

Markku-Juhani O. Saarinen

unread,
Apr 19, 2024, 4:03:34 PMApr 19
to pqc-forum, Simon Hoerder
Hi Simon,

(Caveat: I'm not a NIST person but I had to deal with these things.)

There are all kinds of fun & useful things we could do with the Keccak permutation but are not allowed to (like having more secure AEADs), but this functionality shouldn't be a problem. Hash function APIs in FIPS compliant modules have not been required to be atomic (i.e. one hash() function call). In FIPS you can be atomic or incremental, but it's safer not to offer both interfaces! :D  (yes, this is what FIPS is like.) Let me explain..

Old crypto APIs have not had an incremental XOF output (i.e. squeeze() function) but I don't think that's a FIPS problem per se. Incremental input API has been there since the beginning of time of course ( RFC 1186 from 1990 had an incremental update API ).

For FIPS 140-3 compliant modules implementing FIPS 202 (which has everything we need for ML-KEM and ML-DSA) one simply needs to pass the CAVP tests.  There are tests in there (such as the large data test) that basically require the module to support incremental hashing -- in effect, separate absorb(): https://pages.nist.gov/ACVP/draft-celi-acvp-sha3.html#name-large-data-tests-for-sha-3

Now the thing from the test description above:

Quote: "There are multiple ways hash functions can be implemented in an IUT. The most common are via a single Hash() call on the message or via a series of Init(), any number of Update(), Final() calls. As noted in [LDT], the difference between these hashing techniques can have consequences in the cryptographic module. If both interfaces are offered and accessible for testing, the IUT MUST only utilize a single Update() call for the large message."

Reading that last sentence -- my interpretation is that you shouldn't make the one-call interface available for testing if one wants to support the incremental interfaces too. Of course in software modules the one-call interface is often just a wrapper for the incremental one so this has very little consequence outside certification paperwork. For an actual hardware module a one-call interface is not a thing.

It is possible for programmer go back to update()/absorb() after finalize()/squeeze() but I guess this can simply be considered a programming error. Sponge-based cryptography security proofs would actually support this -- so it is not necessarily a security problem but a compliance problem as FIPS 202 doesn't define it.

Cheers,
-markku

Markku-Juhani O. Saarinen

unread,
Apr 19, 2024, 4:31:26 PMApr 19
to pqc-forum, Markku-Juhani O. Saarinen, Simon Hoerder
Hi again,

Btw.. on the same topic, if one uses hash modules through some kind of FIPS module testing interface for FIPS PQC standards, one is seriously shooting themselves in the foot.

A sensible strategy is try to simply get a CAVP certificate for a signature module using the higher-level interfaces (i.e. the API for a signature scheme) and let the modules internally deal with hashing in an optimized manner. I hope FIPS IG documents will have language explicitly allowing this (it has been implicitly allowed -- OpenSSL-based FIPS modules do this.) This is because public-key algorithms were explicitly designed to use hash functions internally in a bit more complicated manner as anyone who has actually read Kyber, Dilithium, and SPHINCS+ specifications knows.

One can then perhaps get a separate CAVP for the hash module.. but to use those testing interfaces inside a hash-based signature module would be madness (which doesn't mean that some vendors have not tried to do that.) An SLH-DSA software module using SHA2 wastes at least 50% as PK.seed block hash state can't be cached. And a hardware-software codesign wastes a truly astonishingly large amount of cycles and energy because of primitive interfacing -- 90% or 99% as I discussed this at the NIST PQC workshop last week. https://csrc.nist.gov/Presentations/2024/accelerating-slh-dsa

Cheers,
-markku

Bas Westerbaan

unread,
Apr 22, 2024, 7:09:35 AMApr 22
to Moody, Dustin (Fed), pqc-forum
This is a good idea.

I would also like to echo Markku's wish that it'd be good to explicitly allow schemes to implement hashing internally, to avoid potentially costly interface boundaries.

Best,

 Bas

John Mattsson

unread,
May 18, 2024, 5:53:41 AMMay 18
to Bas Westerbaan, Moody, Dustin (Fed), pqc-forum

Hi,

 

Explicitly allowing spliting up SHAKE into Init(), Absorb(), and Squeeze() is a great idea. I strongly suggest that NIST in the update to SP 800-185 explicitly allows mixing calls to Absorb() and Squeeze(), i.e.,

 

Init()

Absorb()

Squeeze()

Absorb()

Squeeze()

...

 

Such use would be perfect for implementing e.g., transcipt hashes in future security protocols. Today you need to output the secret state and then use that as input in the next absorb. This does not only increase specification and implementation complexity, but also increases the risk that the secret state leaks. I.e., allowing mixing calls to Absorb() and Squeeze() significantly increases security and robustness. The security of the duplex construction can be shown to be equivalent to the sponge construction, so I don't see any reasons for not allowing this.

 

Cheers,

John Preuß Mattsson

 

--

You received this message because you are subscribed to the Google Groups "pqc-forum" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/CAMjbhoV0Gjnwg-bMdE9ExcoVpbhOL7GWkrnE2%3DdLnPJu_n%3D5XQ%40mail.gmail.com.

niux_d...@icloud.com

unread,
May 19, 2024, 5:49:10 AMMay 19
to John Mattsson, Bas Westerbaan, Moody, Dustin (Fed), pqc-forum
Hi Mattsson.

I suggest limit interleaved absorb-squeeze to Duplex objects only. Duplex objects are envisioned for permutation-based ciphers such as Ascon, Xoodyak, and (the now broken) Gimli, as well as reseeding PRNGs.

Cheers,
DannyNiu/NJF.

Mike Hamburg

unread,
May 19, 2024, 6:37:14 AMMay 19
to niux_d...@icloud.com, John Mattsson, Bas Westerbaan, Moody, Dustin (Fed), pqc-...@list.nist.gov
Wait, is Gimli really broken now?  I thought it had full-round distinguishers but no full-round attacks on standard-model security goals.

D. J. Bernstein

unread,
May 19, 2024, 7:07:38 AMMay 19
to pqc-...@list.nist.gov
Mike Hamburg writes:
> Wait, is Gimli really broken now?

No.

---D. J. Bernstein
signature.asc

John Mattsson

unread,
May 19, 2024, 8:11:35 AMMay 19
to niux_d...@icloud.com, pqc-forum

Hi Danny,

 

DannyNiu/NJF wrote:

>I suggest limit interleaved absorb-squeeze to Duplex objects only

 

Do you have any motivation for this? I.e., is there any reason to force people to use SHAKE with a multi-sponge pattern instead of a single duplex pattern?

 

Pattern 1 (multi-sponge):

---------------

Init()

Absorb(info1)

key1 = Squeeze()

secret_state_1 = Squeeze()

Init()

Absorb(secret_state_1)

Absorb(info2)

key2 = Squeeze()

secret_state_2 = Squeeze()

Init()

Absorb(secret_state_2)

Absorb(info3)

key3 = Squeeze()

secret_state_3 = Squeeze()

...

 

 

Pattern 2 (single-duplex):

---------------

Init()

Absorb(info1)

key1 = Squeeze()

Absorb(info2)

key2 = Squeeze()

Absorb(info3)

key3 = Squeeze()

...

 

Given that the security of the duplex construction is shown to be equivalent to the sponge construction [1-2] I can only see disadvantages with Pattern 1 using multiple sponges.

 

[1] https://keccak.team/sponge_duplex.html

[2] https://keccak.team/files/SpongeDuplex.pdf

 

Cheers,

John Preuß Mattsson

 

D. J. Bernstein

unread,
May 19, 2024, 9:14:09 AMMay 19
to pqc-...@list.nist.gov
I'm confused as to the overall approach being taken here.

The background makes sense: Kyber, following previous work, has its
matrix of integers mod q generated from a per-user 32-byte seed so that
attacks can't share per-matrix precomputations across users. Sure, the
matrix generation has a measurable cost, but there's ample evidence that
this cost is negligible in application contexts. Cheap insurance; great!

The problems start with the specific way that Kyber generates this
matrix, namely partitioning variable lengths of SHAKE output into 12-bit
pieces and rejecting integers larger than q = 3329. These details are

* annoying for software tests (see various previous threads on
painful test-generation mechanisms that still allow KyberBleed),

* annoying for software verification (this is exactly the optimized
code that https://eprint.iacr.org/2023/215 _didn't_ verify), and

* annoying for software APIs (pushing people into the complications
of adding incremental XOF APIs, as this discussion illustrates).

Since the evidence says that the CPU time here doesn't matter, why not
change Kyber to simply use SHAKE to generate the right number of 32-bit
integers and reduce those mod q? A matrix of uniform random 32-bit
integers reduced mod q has negligible divergence compared to a uniform
random matrix mod q, so it can't have noticeably more chance of
producing a weak matrix.

Or, _if_ there's an argument to trade some simplicity for speed, why not
generate a smaller _and still constant_ number of bytes, and use
conventional arithmetic-coding techniques to convert those bytes into
well-distributed integers mod q? I would expect this to be faster than
Kyber's current approach, while avoiding the problems with tests etc.

For me the obvious way to answer these questions is with the procedural
rule of not making cryptosystem changes between the final submission and
the standard. Such changes are an assault against public review; public
review is our top protection against things going wrong.

But NIST has made clear that it doesn't agree with this rule. NIST has
made various non-interoperable changes to Kyber that have far less
obvious benefit than switching to fixed-length SHAKE output for the
matrix generation. I don't understand what principles are driving this.

---D. J. Bernstein
signature.asc

John Mattsson

unread,
May 19, 2024, 10:56:10 AMMay 19
to pqc-...@list.nist.gov, D. J. Bernstein

Hi,

I agree with Bernstein that the specification in Draft FIPS 203 on how to generate Â

 

4. for (i ← 0; i < k; i++)

5:     for ( j ← 0; j < k; j ++)

6:         Â[i, j] ← SampleNTT(XOF(ρ, i, j))

7:     end for

8: end for

 

seems both complex and inefficiant and that NIST's proposed solution do not seem optimal.

 

Draft FIPS 203 is currently using what NIST calls the "The Simple Discard Method" [1] a whopping 256 * k * k times. Is there any reason not to use what NIST calls the "The Complex Discard Method" [1] to generate 256 * k * k numbers directly? (I assume this is what Bernstein refers to as arithmetic-coding techniques).

 

4.  do

5.     β ← 0

6.     α ← SHAKE128(ρ β) // digest length bit length of q^(256 * k * k) - 1

7.     if α < q^(256 * k * k) then

8.         Â =

9.         break

10.    end if

11.    β ← β + 1

12. end do

 

Using what NIST calls "The Simple Modular Method" would change the distribution but as Bernstein writes “it can't have noticeably more chance of producing a weak matrix.

 

(Note that I still think NIST should introduce absorb(), squeeze(), and allowing interleaved calls in SP 800-185).

 

Cheers,

John

--

You received this message because you are subscribed to the Google Groups "pqc-forum" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

John Mattsson

unread,
May 19, 2024, 11:43:55 AMMay 19
to pqc-...@list.nist.gov

There is a quite recent paper looking at speeding up the generation of  in ML-KEM.
“Applying the Simple Partial Discard Method to Crystals-Kyber”
https://ieeexplore.ieee.org/document/10379070

 

Cheers,

John

 

From: 'John Mattsson' via pqc-forum <pqc-...@list.nist.gov>
Date: Sunday, 19 May 2024 at 16:56
To: pqc-...@list.nist.gov <pqc-...@list.nist.gov>
Cc: D. J. Bernstein <d...@cr.yp.to>
Subject: Re: [pqc-forum] Moe details on the new XOF API in FIPS 203 and 204

niux_d...@icloud.com

unread,
May 20, 2024, 7:36:50 AMMay 20
to John Mattsson, pqc-forum
To support algorithm agility.

While "transcript hash" is a useful idea in some protocols, it should be used on top of some abstraction. The interleaved API is specific to Sponge/Duplex over Permutations, using it directly would break algorithm boundary, so I think we should formalize the requirements (which includes API) before developing actual modes for "transcript hash", and not haste to amend a specification that weren't intended for the said use in the first place.

The Duplex on the other hand, is already specified and studied in the literature, and the interleaved API is a natural fit for that.

niux_d...@icloud.com

unread,
May 20, 2024, 9:50:10 AMMay 20
to niux_d...@icloud.com, John Mattsson, pqc-forum
I might be overlooking something about the real-world protocols.
If there's indeed a **overwhelmingly wide-spread need** for a 
transcript hash, similar to how we had a wide-spread need 
for KDF that resulted in HKDF, then effort should by all mean start. 

But pqc-forum serves no greater role than coordination, 
as transcript hash isn't even PKC, and amending SP-800-185 
to permit interleaved absorb-squeeze is IMHO out-of-scope for now.

Sönke Jendral

unread,
May 24, 2024, 2:51:16 AMMay 24
to niux_d...@icloud.com, John Mattsson, pqc-forum

Hi,

 

I am working with John, who asked me to look into the matrix generation.

The issue with using the complex discard method seems to be that it will require arbitrary precision arithmetic. It might be possible to implement that with negligible performance impact, but the complexity would almost definitely be higher.

 

As an alternative, I have implemented arithmetic coding with Integer Arithmetic as described in https://arxiv.org/pdf/2302.00819 and compared it to the reference implementation. My implementation is here (in lines 450-487): https://godbolt.org/z/db7qvbcfx

 

Based on my quick comparison, the arithmetic coding seems to be about the same speed (or a tiny bit faster, but the benchmark is somewhat flawed) and has similar implementation complexity. The big benefit is that the arithmetic coding reads a constant number of blocks from the RNG, so this could help with uncertainty about the loops.

It is of course not clear what the comparison between two optimized implementations would look like, and I also haven’t really tested the code at all, so it is possible that the implementation is incorrect, biased, or overly complicated.

 

Best,
Sönke

D. J. Bernstein

unread,
May 24, 2024, 8:06:22 AMMay 24
to pqc-...@list.nist.gov
'Sönke Jendral' via pqc-forum writes:
> It is of course not clear what the comparison between two optimized
> implementations would look like

See supercop/crypto_decode/*/*/decodegen.py for scripts to generate
vectorized code (and non-vectorized code) for the parallelizable version
of arithmetic coding from the NTRU Prime submission. The scripts are for
the case of vectors mod q being mapped injectively to bit strings, but
if one simply rounds moduli down to multiples of 256 rather than up then
bit strings are mapped injectively to vectors mod q. This plugs in a
modular way into any application randomly or deterministically
generating such vectors. For most choices of q and the vector length,
the divergence is <2, meaning that any particular output is generated
with probability <2 times uniform.

An injective map from bit strings to cryptosystem objects (such as these
arithmetic-coding maps) also enables the main applications of Elligator,
such as having keys and ciphertexts indistinguishable from uniform
random strings. For, e.g., keys, one simply has to retry keygen until a
key is found in the image of the map, and then send the key as the
corresponding bit string. The average number of retries is <2 for
typical parameters. (The SABER documentation is wrong in suggesting that
such indistinguishability needs power-of-2 moduli.)

More generally, arithmetic coding isn't limited to vectors mod q; one
can freely take any sequence of moduli, for example to generate
well-distributed vectors of integers mod n-t, n-t+1, n-t+2, etc. to use
in Fisher--Yates shuffling, https://eprint.iacr.org/2024/548/,
https://cr.yp.to/2024/insertionseries-20240515.py (which is designed for
efficient vectorization, unlike Fisher--Yates), etc.

One can also drop the injectivity constraint, allowing larger input
ranges than output ranges, unifying arithmetic coding with the simpler
idea of taking a large integer mod q. This makes it easy to guarantee
that every vector mod q can appear. Obviously this can be tweaked to
guarantee a distribution of vectors indistinguishable from uniform.

I brought up arithmetic coding in email dated 19 May 2024 15:13:52 +0200
as avoiding all of the Kyber software hassles surrounding variable SHAKE
lengths. I said I expected this to save time. The most obvious reason is
that Kyber often needs extra SHAKE blocks while arithmetic coding never
does. One expects Kyber to keep 81% (3329/4096) of the 12-bit results,
and thus to use 472 bytes on average to generate 256 integers mod q; but
there's a lot of variance, and Kyber often spills over into another
block, while arithmetic coding is guaranteed to convert 374 bytes into a
well-distributed vector of 256 integers mod q (with divergence <1.35 per
vector, so <2^7 for 16 such vectors). This also makes clear how much
extra hashing time Kyber is consuming by having each size-256 vector
generated from separate SHAKE output, using >=48 blocks of output for
Kyber-1024 where 36 would have sufficed.

The amount of hash output isn't the full story. There's some arithmetic
in arithmetic coding. But, again, NTRU Prime already showed that this is
efficiently vectorizable when details are chosen sensibly. Meanwhile a
comparison has to account for the cost of Kyber's semi-vectorized
rejection sampling, which can be viewed as simulating the contortions
that hardware goes through for frequent branch mispredictions. Overall
I'd expect arithmetic coding to be the fastest option.

To be clear, I'm not saying that the savings of CPU time is important.
On the contrary, I would go the opposite direction, simply reducing
separate 32-bit integers mod q. Whether one takes this simple approach
or the fast arithmetic-coding approach, there's no need for a designer
to incur the hassles of variable amounts of SHAKE output.

Also, let me again emphasize that all of these changes violate the
procedural rule of not making cryptosystem changes between the final
submission and the standard. New designs should be published and go
through years of security review before they're considered for
standardization. But what's baffling is the following juxtaposition:

* NIST is violating this rule regarding, e.g., Kyber's RNGs and FO.

* NIST is incurring clear software problems and spec complications
to preserve Kyber's matrix generation.

Is there some coherent explanation for this?

---D. J. Bernstein
signature.asc

Sönke Jendral

unread,
May 24, 2024, 8:31:35 AMMay 24
to D. J. Bernstein, pqc-...@list.nist.gov
Hi,

Mike Hamburg helpfully pointed out a mistake in my earlier code which was calling the reference implementation both times. The correct implementation would have been https://godbolt.org/z/6WY7bnf79 where the arithmetic coding is much slower, but I understand that this is due to a poor implementation and not a limitation of the technique.
My motivation for looking into arithmetic coding was due to the suggestion by Bernstein from May 19th, and I second the observation about limiting the number of blocks consumed, but I can't comment on general performance based on my (slow) implementation.

If one were to make a change, the simplicity of reducing 32-bit integers mod q indeed seems hard to beat (including with already standardized techniques like the Simple Partial Discard Method or the Complex Discard Method). I can't comment on whether making a change at all is a good idea.

Best,
Sönke

-----Original Message-----
From: pqc-...@list.nist.gov <pqc-...@list.nist.gov> On Behalf Of D. J. Bernstein
Sent: Friday, 24 May 2024 14:06
To: pqc-...@list.nist.gov
Subject: Re: [pqc-forum] Moe details on the new XOF API in FIPS 203 and 204

'Sönke Jendral' via pqc-forum writes:
> It is of course not clear what the comparison between two optimized
> implementations would look like

See supercop/crypto_decode/*/*/decodegen.py for scripts to generate vectorized code (and non-vectorized code) for the parallelizable version of arithmetic coding from the NTRU Prime submission. The scripts are for the case of vectors mod q being mapped injectively to bit strings, but if one simply rounds moduli down to multiples of 256 rather than up then bit strings are mapped injectively to vectors mod q. This plugs in a modular way into any application randomly or deterministically generating such vectors. For most choices of q and the vector length, the divergence is <2, meaning that any particular output is generated with probability <2 times uniform.

An injective map from bit strings to cryptosystem objects (such as these arithmetic-coding maps) also enables the main applications of Elligator, such as having keys and ciphertexts indistinguishable from uniform random strings. For, e.g., keys, one simply has to retry keygen until a key is found in the image of the map, and then send the key as the corresponding bit string. The average number of retries is <2 for typical parameters. (The SABER documentation is wrong in suggesting that such indistinguishability needs power-of-2 moduli.)

More generally, arithmetic coding isn't limited to vectors mod q; one can freely take any sequence of moduli, for example to generate well-distributed vectors of integers mod n-t, n-t+1, n-t+2, etc. to use in Fisher--Yates shuffling, https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-454445554331-07d449bd442248bb&q=1&e=3612cdef-10d4-492a-9ec6-f9a3d4523f57&u=https%3A%2F%2Feprint.iacr.org%2F2024%2F548%2F,
https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-454445554331-2c17bdeda7022e52&q=1&e=3612cdef-10d4-492a-9ec6-f9a3d4523f57&u=https%3A%2F%2Fcr.yp.to%2F2024%2Finsertionseries-20240515.py (which is designed for efficient vectorization, unlike Fisher--Yates), etc.

One can also drop the injectivity constraint, allowing larger input ranges than output ranges, unifying arithmetic coding with the simpler idea of taking a large integer mod q. This makes it easy to guarantee that every vector mod q can appear. Obviously this can be tweaked to guarantee a distribution of vectors indistinguishable from uniform.

I brought up arithmetic coding in email dated 19 May 2024 15:13:52 +0200 as avoiding all of the Kyber software hassles surrounding variable SHAKE lengths. I said I expected this to save time. The most obvious reason is that Kyber often needs extra SHAKE blocks while arithmetic coding never does. One expects Kyber to keep 81% (3329/4096) of the 12-bit results, and thus to use 472 bytes on average to generate 256 integers mod q; but there's a lot of variance, and Kyber often spills over into another block, while arithmetic coding is guaranteed to convert 374 bytes into a well-distributed vector of 256 integers mod q (with divergence <1.35 per vector, so <2^7 for 16 such vectors). This also makes clear how much extra hashing time Kyber is consuming by having each size-256 vector generated from separate SHAKE output, using >=48 blocks of output for
Kyber-1024 where 36 would have sufficed.

The amount of hash output isn't the full story. There's some arithmetic in arithmetic coding. But, again, NTRU Prime already showed that this is efficiently vectorizable when details are chosen sensibly. Meanwhile a comparison has to account for the cost of Kyber's semi-vectorized rejection sampling, which can be viewed as simulating the contortions that hardware goes through for frequent branch mispredictions. Overall I'd expect arithmetic coding to be the fastest option.

To be clear, I'm not saying that the savings of CPU time is important.
On the contrary, I would go the opposite direction, simply reducing separate 32-bit integers mod q. Whether one takes this simple approach or the fast arithmetic-coding approach, there's no need for a designer to incur the hassles of variable amounts of SHAKE output.

Also, let me again emphasize that all of these changes violate the procedural rule of not making cryptosystem changes between the final submission and the standard. New designs should be published and go through years of security review before they're considered for standardization. But what's baffling is the following juxtaposition:

* NIST is violating this rule regarding, e.g., Kyber's RNGs and FO.

* NIST is incurring clear software problems and spec complications
to preserve Kyber's matrix generation.

Is there some coherent explanation for this?

---D. J. Bernstein

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-454445554331-92a3b8e2407ec414&q=1&e=3612cdef-10d4-492a-9ec6-f9a3d4523f57&u=https%3A%2F%2Fgroups.google.com%2Fa%2Flist.nist.gov%2Fd%2Fmsgid%2Fpqc-forum%2F20240524120559.2924993.qmail%2540cr.yp.to.

J. T.

unread,
May 24, 2024, 8:53:17 AMMay 24
to Moody, Dustin (Fed), pqc-forum

Dear NIST,

I am writing to express my sincere gratitude for the valuable work you do in the field of cryptography. Your efforts to ensure the security and integrity of our digital infrastructure are commendable.

As a person who is relatively new to the technical aspects of cryptography, I find your resources and initiatives highly informative and accessible. Your commitment to providing clear and concise explanations of complex concepts is greatly appreciated.

While I may not fully grasp the intricacies of every algorithm or protocol, I understand the importance of robust cryptographic standards and their role in protecting sensitive information. It is reassuring to know that organizations like yours are dedicated to advancing the science of cryptography and setting rigorous standards for the industry.

I am eager to learn more and contribute to the field in any way I can. If there are any opportunities for non-technical individuals to participate in your initiatives or provide feedback, please let me know. I would be honored to be a part of the NIST community and support your efforts in any capacity.

Thank you again for your dedication and service.

Best regards,

Jennifer Trokey


--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
Message has been deleted
Message has been deleted

John Mattsson

unread,
May 24, 2024, 1:14:55 PMMay 24
to pqc-...@list.nist.gov

Hi,

 

At first glance, it seems to me that both the arithmetic coding suggested by Bernstein and reducing 32-bit integers mod q suggested by Bernstein seem superior to the current version and the updated version suggested by NIST. I don't have any strong opinions on what NIST should do but NIST is planning to do changes anyway. I think it is hard to have a strong opinion about late changes fixing problems. It is true that a change risks introducing a new problem, but if you do not change, you are 100% sure that you have the old problem. Delaying publication would be good for review but I think the vast majority of people (including me) do not want a delay.

 

Some reflections on the NIST PQC project. I think it is good that NIST allowed changes and mergers. I think there could have been more collaboration and more changes early on. How to derive random vectors modulo q is for example quite independent of the algorithm. It would also have been good with a longer review period for the winners. Many people do not have time to review if there are too many candidates.  I am happy that NIST is aiming for a "collaborative approach" when standardizing the accordion cipher, even if it is hard to know exactly what this will entail. Collaborative approaches such as the recent standardization of TLS 1.3 and EDHOC in the IETF with public calls for review has attracted a lot of interest from both industry and academia.

 

Cheers,

John

 

From: 'Sönke Jendral' via pqc-forum <pqc-...@list.nist.gov>
Date: Friday, 24 May 2024 at 14:31
To: D. J. Bernstein <d...@cr.yp.to>, pqc-...@list.nist.gov <pqc-...@list.nist.gov>
Subject: RE: [pqc-forum] Moe details on the new XOF API in FIPS 203 and 204

Hi,

Mike Hamburg helpfully pointed out a mistake in my earlier code which was calling the reference implementation both times. The correct implementation would have been https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgodbolt.org%2Fz%2F6WY7bnf79&data=05%7C02%7Cjohn.mattsson%40ericsson.com%7C298a802b1fd34b34bd2308dc7bed7486%7C92e84cebfbfd47abbe52080c6b87953f%7C0%7C0%7C638521506991661662%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=GyozQ%2Fyk1ytmD6nr%2BlPxrli7cJuCRWBLMgPbRCb4Xas%3D&reserved=0 where the arithmetic coding is much slower, but I understand that this is due to a poor implementation and not a limitation of the technique.


My motivation for looking into arithmetic coding was due to the suggestion by Bernstein from May 19th, and I second the observation about limiting the number of blocks consumed, but I can't comment on general performance based on my (slow) implementation.

If one were to make a change, the simplicity of reducing 32-bit integers mod q indeed seems hard to beat (including with already standardized techniques like the Simple Partial Discard Method or the Complex Discard Method). I can't comment on whether making a change at all is a good idea.

Best,
Sönke

-----Original Message-----
From: pqc-...@list.nist.gov <pqc-...@list.nist.gov> On Behalf Of D. J. Bernstein
Sent: Friday, 24 May 2024 14:06
To: pqc-...@list.nist.gov
Subject: Re: [pqc-forum] Moe details on the new XOF API in FIPS 203 and 204

'Sönke Jendral' via pqc-forum writes:
> It is of course not clear what the comparison between two optimized
> implementations would look like

See supercop/crypto_decode/*/*/decodegen.py for scripts to generate vectorized code (and non-vectorized code) for the parallelizable version of arithmetic coding from the NTRU Prime submission. The scripts are for the case of vectors mod q being mapped injectively to bit strings, but if one simply rounds moduli down to multiples of 256 rather than up then bit strings are mapped injectively to vectors mod q. This plugs in a modular way into any application randomly or deterministically generating such vectors. For most choices of q and the vector length, the divergence is <2, meaning that any particular output is generated with probability <2 times uniform.

An injective map from bit strings to cryptosystem objects (such as these arithmetic-coding maps) also enables the main applications of Elligator, such as having keys and ciphertexts indistinguishable from uniform random strings. For, e.g., keys, one simply has to retry keygen until a key is found in the image of the map, and then send the key as the corresponding bit string. The average number of retries is <2 for typical parameters. (The SABER documentation is wrong in suggesting that such indistinguishability needs power-of-2 moduli.)

More generally, arithmetic coding isn't limited to vectors mod q; one can freely take any sequence of moduli, for example to generate well-distributed vectors of integers mod n-t, n-t+1, n-t+2, etc. to use in Fisher--Yates shuffling, https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprotect2.fireeye.com%2Fv1%2Furl%3Fk%3D31323334-501cfaf3-313273af-454445554331-07d449bd442248bb%26q%3D1%26e%3D3612cdef-10d4-492a-9ec6-f9a3d4523f57%26u%3Dhttps%253A%252F%252Feprint.iacr.org%252F2024%252F548%252F&data=05%7C02%7Cjohn.mattsson%40ericsson.com%7C298a802b1fd34b34bd2308dc7bed7486%7C92e84cebfbfd47abbe52080c6b87953f%7C0%7C0%7C638521506991671744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2FF2TdScSi%2BTEf4jU4oxuB2sLUoQb8pEaH4eLchDs5v4%3D&reserved=0,
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprotect2.fireeye.com%2Fv1%2Furl%3Fk%3D31323334-501cfaf3-313273af-454445554331-2c17bdeda7022e52%26q%3D1%26e%3D3612cdef-10d4-492a-9ec6-f9a3d4523f57%26u%3Dhttps%253A%252F%252Fcr.yp.to%252F2024%252Finsertionseries-20240515.py&data=05%7C02%7Cjohn.mattsson%40ericsson.com%7C298a802b1fd34b34bd2308dc7bed7486%7C92e84cebfbfd47abbe52080c6b87953f%7C0%7C0%7C638521506991678330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=AxvgVIpQpEq9ZFbw39WzNgufnQlIcnkESi8peI1W5PM%3D&reserved=0 (which is designed for efficient vectorization, unlike Fisher--Yates), etc.



One can also drop the injectivity constraint, allowing larger input ranges than output ranges, unifying arithmetic coding with the simpler idea of taking a large integer mod q. This makes it easy to guarantee that every vector mod q can appear. Obviously this can be tweaked to guarantee a distribution of vectors indistinguishable from uniform.

I brought up arithmetic coding in email dated 19 May 2024 15:13:52 +0200 as avoiding all of the Kyber software hassles surrounding variable SHAKE lengths. I said I expected this to save time. The most obvious reason is that Kyber often needs extra SHAKE blocks while arithmetic coding never does. One expects Kyber to keep 81% (3329/4096) of the 12-bit results, and thus to use 472 bytes on average to generate 256 integers mod q; but there's a lot of variance, and Kyber often spills over into another block, while arithmetic coding is guaranteed to convert 374 bytes into a well-distributed vector of 256 integers mod q (with divergence <1.35 per vector, so <2^7 for 16 such vectors). This also makes clear how much extra hashing time Kyber is consuming by having each size-256 vector generated from separate SHAKE output, using >=48 blocks of output for
Kyber-1024 where 36 would have sufficed.

The amount of hash output isn't the full story. There's some arithmetic in arithmetic coding. But, again, NTRU Prime already showed that this is efficiently vectorizable when details are chosen sensibly. Meanwhile a comparison has to account for the cost of Kyber's semi-vectorized rejection sampling, which can be viewed as simulating the contortions that hardware goes through for frequent branch mispredictions. Overall I'd expect arithmetic coding to be the fastest option.

To be clear, I'm not saying that the savings of CPU time is important.
On the contrary, I would go the opposite direction, simply reducing separate 32-bit integers mod q. Whether one takes this simple approach or the fast arithmetic-coding approach, there's no need for a designer to incur the hassles of variable amounts of SHAKE output.

Also, let me again emphasize that all of these changes violate the procedural rule of not making cryptosystem changes between the final submission and the standard. New designs should be published and go through years of security review before they're considered for standardization. But what's baffling is the following juxtaposition:

   * NIST is violating this rule regarding, e.g., Kyber's RNGs and FO.

   * NIST is incurring clear software problems and spec complications
     to preserve Kyber's matrix generation.

Is there some coherent explanation for this?

---D. J. Bernstein

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.



--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

John Mattsson

unread,
Jun 1, 2024, 7:47:22 AMJun 1
to niux_d...@icloud.com, pqc-forum

Hi Danny,

DannyNiu/NJF wrote:

>I might be overlooking something about the real-world protocols.

>If there's indeed a **overwhelmingly wide-spread need** for a

>transcript hash, similar to how we had a wide-spread need

>for KDF that resulted in HKDF, then effort should by all mean start.

 

I would say that there is an **overwhelmingly wide-spread use** of the the multi-sponge /multi-extract-expand pattern. Below illustrated with a keyed hash function.

 

Multi-sponge / multi-extract-expand pattern:

---------------

Absorb(init_secret, info1)

key1 = Squeeze()

secret_state_1 = Squeeze()

Absorb(secret_state_1, info2)

key2 = Squeeze()

secret_state_2 = Squeeze()

Absorb(secret_state_2, info3)

key3 = Squeeze()

secret_state_3 = Squeeze()

...

 

Single-duplex pattern:

---------------

Absorb(init_secret, info1)

key1 = Squeeze()

Absorb(info2)

key2 = Squeeze()

Absorb(info3)

key3 = Squeeze()

...

 

I think usability and usable security are very important design principles. NIST writes in [1] that its cryptographic standards should be designed to "minimize the demands on users and implementers as well as the adverse consequences of human mistakes and equipment failures". Following this I strongly think keyed hashing APIs should be designed so that the secret state never leaves the keyed hash function boundary. Compromise of secret_state_1 has more serious consequences than compromise of key1.

 

DannyNiu/NJF wrote:

>But pqc-forum serves no greater role than coordination,

>as transcript hash isn't even PKC, and amending SP-800-185

>to permit interleaved absorb-squeeze is IMHO out-of-scope for now.

 

NIST explicitly asked for feedback on their plan to update of SP 800-185. My comment was just that. I agree that it is not PQC.

 

Cheers,

John Preuß Mattsson

[1] NISTIR 7977, "NIST Cryptographic Standards and Guidelines Development Process"

https://nvlpubs.nist.gov/nistpubs/ir/2016/nist.ir.7977.pdf

 

 

Reply all
Reply to author
Forward
0 new messages