Footguns as an axis for security analysis

906 views
Skip to first unread message

Mike Ounsworth

unread,
Aug 28, 2019, 9:55:06 AM8/28/19
to pqc-...@list.nist.gov

I strongly support the comment that Scott Fluhrer made at the microphone during the NIST PQC workshop last week; “footguns” or “ease of mis-implementation” ought to be considered as an axis for security analysis of candidate schemes. CVEs don’t (frequently) happen because of mathematical attacks; they happen when a non-cryptographer is handed a complex spec to implement, and nobody on their team has the expertise needed to notice a subtle deviation from the spec during code review. Simplicity of the spec leads to fewer vulnerabilities.

 

I would extend Scott’s sentiment to cover both the ease of implementing the primitive itself, and the ease of designing / implementing secure protocols and modes of operation around the primitive.

 

- - -

Mike Ounsworth | Software Security Architect

Entrust Datacard

 

 

David Jao

unread,
Aug 28, 2019, 11:09:52 AM8/28/19
to pqc-...@list.nist.gov
Ease of implementing the primitive matters for misuse resistance, but
perhaps not in the obvious way. I have seen the argument made (with
respect to RSA vs. ECC) that more complicated primitives are less prone
to foot-guns because people don't dare to implement them on their own.

Quoting from https://blog.trailofbits.com/2019/07/08/ [1]:

"the math behind ECC is so complicated that very few people feel
confident enough to actually implement it. In other words, it
intimidates people into using libraries built by cryptographers who know
what they’re doing. RSA on the other hand is so simple that it can be
(poorly) implemented in an hour."

-David

[1] Click through the link to get to the real article. I didn't link to
the real article because the URL of the real article contains an expletive.
> --
> You received this message because you are subscribed to the Google
> Groups "pqc-forum" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pqc-forum+...@list.nist.gov
> <mailto:pqc-forum+...@list.nist.gov>.
> To view this discussion on the web visit
> https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/c879c485d970480890076e9eaab4a7d3%40PMSPEX05.corporate.datacard.com
> <https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/c879c485d970480890076e9eaab4a7d3%40PMSPEX05.corporate.datacard.com?utm_medium=email&utm_source=footer>.

Watson Ladd

unread,
Aug 28, 2019, 11:45:42 AM8/28/19
to Mike Ounsworth, pqc-...@list.nist.gov


On Wed, Aug 28, 2019, 6:55 AM Mike Ounsworth <Mike.Ou...@entrustdatacard.com> wrote:

I strongly support the comment that Scott Fluhrer made at the microphone during the NIST PQC workshop last week; “footguns” or “ease of mis-implementation” ought to be considered as an axis for security analysis of candidate schemes. CVEs don’t (frequently) happen because of mathematical attacks; they happen when a non-cryptographer is handed a complex spec to implement, and nobody on their team has the expertise needed to notice a subtle deviation from the spec during code review. Simplicity of the spec leads to fewer vulnerabilities.


Just use a good implementation. Non cryptographers should use libraries.

 

I would extend Scott’s sentiment to cover both the ease of implementing the primitive itself, and the ease of designing / implementing secure protocols and modes of operation around the primitive.


All the primitives are KEMs. They are identical for protocol purposes.

 

- - -

Mike Ounsworth | Software Security Architect

Entrust Datacard

 

 

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/c879c485d970480890076e9eaab4a7d3%40PMSPEX05.corporate.datacard.com.

Jonathan Katz

unread,
Aug 28, 2019, 11:48:18 AM8/28/19
to Mike.Ou...@entrustdatacard.com, pqc-...@list.nist.gov
How in the world would one go about evaluating “ease of mis-implementation”of a given scheme, or comparing between two schemes? I ask this non-rhetorically.

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/2393d683-236c-0444-5ae9-aab53055b235%40uwaterloo.ca.

D. J. Bernstein

unread,
Aug 28, 2019, 12:41:33 PM8/28/19
to pqc-...@list.nist.gov
Jonathan Katz writes:
> How in the world would one go about evaluating “ease of
> mis-implementation”of a given scheme, or comparing between two
> schemes? I ask this non-rhetorically.

https://cr.yp.to/papers.html#nistecc includes a survey of what's known
about this topic in the case of ECC.

The broader literature on software engineering has many more analyses of
engineering cost and error rates, and recognizes these as important
topics of scientific research.

---Dan
signature.asc

Scott Fluhrer (sfluhrer)

unread,
Aug 28, 2019, 1:10:31 PM8/28/19
to Jonathan Katz, Mike.Ou...@entrustdatacard.com, pqc-...@list.nist.gov

I’ll give a couple of examples where there is a difference

 

The first example is the error vectors in Lattice Based schemes:

 

  1. In some systems, the error vectors have to be chosen out of a very tightly-defined probability distribution (e.g. Discrete Gaussian), and become vulnerable if you accidentally select from a different probability distribution.
  2. Other systems are secure with an easier-to-implement error distribution (e.g. binomial)
  3. Still other systems use quite easy to implement error distributions (e.g. rounding, where you literally chop a few bits off)

 

The ‘avoid-foot-cannon’ criteria would prefer (3) over (2) and (2) over (1).

 

Another place where it may come up is with error conditions (or other low-probability cases); schemes that necessarily need to do some logic on an error in order to be secure is less attractive than another method which sticks to the main logic to do things right.

 

The third place is the issue of CPA KEMs vs CCA KEMs.  One concern I have with a CPA KEM is that, while we know that means that you generate fresh keys for every negotiation, we might run into a clever implementor that tries to optimize things by reusing keys for multiple sessions.  This is Yet Another foot cannon (even if it is not within the primitive itself); but it does open the question as to whether NIST would want to approve CPA at all…

 

Rafael Misoczki

unread,
Aug 28, 2019, 1:56:29 PM8/28/19
to pqc-forum, jk...@cs.umd.edu, Mike.Ou...@entrustdatacard.com
Regarding CCA KEMs vs CPA KEMs:

We have seen a major effort being done by the industry to push for SSL/TLS cipher suites that achieve Forward Secrecy (FS). In order to achieve FS, ephemeral keys are used (e.g. ECDHE). This perfectly matches what CPA KEMs can offer without the inherent overhead of the CCA machinery. 

In short, I clearly see great value in IND-CPA KEMs and would like that NIST continue to consider them for standardization moving forward. 

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-...@list.nist.gov.

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-...@list.nist.gov.

Scott Fluhrer (sfluhrer)

unread,
Aug 28, 2019, 2:01:22 PM8/28/19
to Rafael Misoczki, pqc-forum, jk...@cs.umd.edu, Mike.Ou...@entrustdatacard.com

CCA KEMs can provide FS; you just generate a fresh key each time (or alternatively periodically, e.g. once a minute).

 

Now, a CPA KEM may be cheaper than a CCA KEM in this scenario; we’d need to consider: a) how much cheaper (both in computation time and bandwidth), b) how much does this cost delta matter to implementations, and c) how does that compare to the safeness we get by sticking with CCA.

 

The answers to these questions are considerably harder (and will depend on the primitive we’re talking about); however (IMHO) they need to be asked…

 

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/27a467d7-d270-48e2-ab3d-c4161a7e5e48%40list.nist.gov.

Alperin-Sheriff, Jacob (Fed)

unread,
Aug 28, 2019, 2:12:20 PM8/28/19
to Scott Fluhrer (sfluhrer), Jonathan Katz, Mike.Ou...@entrustdatacard.com, pqc-forum

I’m not aware of any Round 2 KEM that could conceivably fall into your (1) there (all of the signature schemes do).

And it’s plausibly as easy if not easier to screw up category 3 IMO. Rounding is possibly more sensitive doing something stupid in the form of, e.g. rounding off one fewer bit (in order to fit it into your constrained device or for whatever reason) than specified.

 

And in particular, most of the schemes do rounding off as well (regardless of whether they add additional error) for reducing ciphertext sizes. Even here it’s not so clear.

Christopher J Peikert

unread,
Aug 28, 2019, 2:24:46 PM8/28/19
to Scott Fluhrer (sfluhrer), Jonathan Katz, Mike.Ou...@entrustdatacard.com, pqc-...@list.nist.gov
> The first example is the error vectors in Lattice Based schemes:
>
> In some systems, the error vectors have to be chosen out of a very tightly-defined probability distribution (e.g. Discrete Gaussian), and become vulnerable if you accidentally select from a different probability distribution.

I'll concur with Jacob here: this isn't true for any of the round-2
KEMs (whether CPA or CCA); small changes in the error distributions
are *not* known to introduce vulnerabilities. For example, FrodoKEM is
specified to use a Gaussian-derived error distribution, but it doesn't
suddenly become vulnerable if that's replaced with something similar
(e.g., binomial).

Of course, using a completely wrong error distribution -- like "always
zero" -- totally breaks security. But that goes for all of the KEMs,
regardless of which distribution they're defined to use.

Signature schemes are much more sensitive to the choice of probability
distribution, because wrong distributions can leak information about
the secret key.

Sincerely yours in cryptography,
Chris

David A. Cooper

unread,
Aug 28, 2019, 2:54:50 PM8/28/19
to Scott Fluhrer (sfluhrer), Rafael Misoczki, pqc-forum, jk...@cs.umd.edu, Mike.Ou...@entrustdatacard.com
Hi Scott,

On the IETF TLS mail list you said that "The security proof of TLS 1.3 (at least, the ones I’ve seen) assume that the key exchange is CCA secure." I haven't seen these security proofs, and so don't know the reason they made this assumption -- it was my understanding that the security proofs assume fresh keys are used for each key exchange. However, wouldn't this be another issue to consider when trying to determine whether it is acceptable to use a CPA KEM rather than a CCA KEM?

Thomas Prest

unread,
Aug 28, 2019, 3:24:24 PM8/28/19
to pqc-...@list.nist.gov

I agree with Jacob that simpler distributions do not necessarily prevent implementation blunders.
Even something as simple as the uniform distribution can be screwed up, as illustrated by countless ECDSA-related examples (like this recent one presented at the WAC2 workshop before CRYPTO: https://ia.cr/2019/023).
I think we need to come up with systematic measures to prevent bad implementation regardless of the error distribution used.
To some extent, deterministic signatures can be such a measure (as pointed out in https://ia.cr/2019/023), but I would not go as far as making them mandatory.
Statistical tests could be another one: there already exist standardized tests for testing uniform randomness, so it makes sense to define analogous tests for other distributions (for example, something like this for Gaussians: https://ia.cr/2017/438).

Thomas

Serge Vaudenay

unread,
Aug 28, 2019, 5:38:25 PM8/28/19
to pqc-...@list.nist.gov
The Fujisaki-Okamoto implementation easily leaks by a side channel. It
makes IND-CCA and IND-CPA variants equivalent. Experts make mistakes
too. Even some implementations submitted to NIST suffered from this.

See reference [17] from https://eprint.iacr.org/2019/525

Key reuse with IND-CPA variants often leads to key recovery. This is a
critical issue.

Best regards

Serge

Peter Schwabe

unread,
Aug 28, 2019, 5:52:04 PM8/28/19
to Alperin-Sheriff, Jacob (Fed), Scott Fluhrer (sfluhrer), Jonathan Katz, Mike.Ou...@entrustdatacard.com, pqc-forum
"'Alperin-Sheriff, Jacob (Fed)' via pqc-forum" <pqc-...@list.nist.gov> wrote:

Dear Jacob,

> I’m not aware of any Round 2 KEM that could conceivably fall into your
> (1) there (all of the signature schemes do).

Dilithium uses uniform sampling precisely to be more foot-cannon-secure,
or, as we phrase it in the submission document

"...it is unreasonable to assume that a universally-deployed scheme
containing many subtleties will always be expertly implemented.
Dilithium therefore only uses uniform sampling"

All the best,

Peter
signature.asc

Alperin-Sheriff, Jacob (Fed)

unread,
Aug 28, 2019, 5:58:20 PM8/28/19
to Peter Schwabe, Scott Fluhrer (sfluhrer), Jonathan Katz, Mike.Ou...@entrustdatacard.com, pqc-forum
Yes, but it does use rejection sampling and screwing up and eliminating that step "for efficiency" or carelessness would be pretty fatal.

Peter Schwabe

unread,
Aug 29, 2019, 6:36:53 AM8/29/19
to Alperin-Sheriff, Jacob (Fed), Peter Schwabe, Scott Fluhrer (sfluhrer), Jonathan Katz, Mike.Ou...@entrustdatacard.com, pqc-forum
"'Alperin-Sheriff, Jacob (Fed)' via pqc-forum" <pqc-...@list.nist.gov> wrote:

Dear Jacob, dear all,

> Yes, but it does use rejection sampling and screwing up and
> eliminating that step "for efficiency" or carelessness would be pretty
> fatal.

That's true, but at least omission of the rejection step is
something that can be easily covered by simple unit tests. Testing for
correct Gaussian sampling at the required precision is much harder.

All the best,

Peter
signature.asc

Mike Ounsworth

unread,
Aug 29, 2019, 9:45:02 AM8/29/19
to Serge Vaudenay, pqc-...@list.nist.gov
My favourite historical example of this is the Bleichenbocker attack on RSA PKCS1 v1.5 padding (1998). The "simple" fix was documented in RFC5246 TLS 1.2 section 7.4.7.1 (2008). In 2017 the ROBOT Attack project found a pile of TLS libraries, some of them quite well know, that didn't apply the mitigation properly.

So there's an example of the experts falling to a foot-cannon, en mass.

- - -
Mike Ounsworth | Office: +1 (613) 270-2873

-----Original Message-----
From: Serge Vaudenay <serge.v...@epfl.ch>
Sent: Wednesday, August 28, 2019 5:38 PM
To: pqc-...@list.nist.gov
Subject: [EXTERNAL]Re: [pqc-forum] Footguns as an axis for security analysis

WARNING: This email originated outside of Entrust Datacard.
DO NOT CLICK links or attachments unless you trust the sender and know the content is safe.
--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/704cb260-92bf-f334-55cc-d99ab358299f%40epfl.ch.

D. J. Bernstein

unread,
Aug 29, 2019, 7:42:55 PM8/29/19
to pqc-...@list.nist.gov
Christopher J Peikert writes:
> Of course, using a completely wrong error distribution -- like "always
> zero" -- totally breaks security. But that goes for all of the KEMs,
> regardless of which distribution they're defined to use.

If a scheme adds noise to a ring element r by communicating Round(r),
encoded in less space than r, then how exactly would you imagine an
implementor managing to use zero noise?

---Dan
signature.asc

Jacob Alperin-Sheriff

unread,
Aug 29, 2019, 8:16:58 PM8/29/19
to pqc-...@list.nist.gov
Rounding off fewer bits would do it  

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
--
-Jacob Alperin-Sheriff

D. J. Bernstein

unread,
Aug 29, 2019, 9:45:52 PM8/29/19
to pqc-...@list.nist.gov
Jacob Alperin-Sheriff writes:
> Rounding off fewer bits would do it  

How exactly do you imagine this happening? The rounding spec describes
some sort of arithmetic that converts the ring element r into Round(r),
and some sort of packing/encoding process that squeezes the coefficients
of Round(r) into a compact fixed-length string

Round(r)[0] Round(r)[1] Round(r)[2] ...

sent as part of a ciphertext (or key). You're imagining an implementor
taking extra bits of r beyond Round(r) and then putting them... um...
where? What would the code look like, and how would this pass even the
most trivial tests?

---Dan
signature.asc

Alperin-Sheriff, Jacob (Fed)

unread,
Aug 30, 2019, 9:14:56 AM8/30/19
to D. J. Bernstein, pqc-forum
It would probably require intentional deviation from the spec (to reduce the modulus to fit into some custom device or something), yes. I assume there have been cryptographic implementations that have done dumb things like that; maybe I'm giving the lowest common denominator too little credit.
--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20190830014544.30931.qmail%40cr.yp.to.


D. J. Bernstein

unread,
Sep 2, 2019, 10:15:48 PM9/2/19
to pqc-...@list.nist.gov
'Alperin-Sheriff, Jacob (Fed)' via pqc-forum writes:
> It would probably require intentional deviation from the spec (to
> reduce the modulus to fit into some custom device or something), yes.
> I assume there have been cryptographic implementations that have done
> dumb things like that; maybe I'm giving the lowest common denominator
> too little credit.

The Dilithium implementors aren't the "lowest common denominator" and
surely weren't _trying_ to deviate from the spec, but their original
software misused randomness in a way that destroyed all security:

https://groups.google.com/a/list.nist.gov/d/msg/pqc-forum/aWxC2ynJDLE/YOsMJ2ewAAAJ

It's important to analyze what sort of tests, verification, etc. could
have prevented this security failure. More broadly, there are processes
that convert designs into implementations, and it's important to analyze
the security of the implementations when the processes and designs vary.
This isn't making the naive designer's assumption that implementations
will magically work. It also isn't making the worst-case assumption that
implementors are trying to deviate from the design. Instead it's

* taking the normal case of implementors trying to do their jobs;
* recognizing that mistakes happen; and
* asking what we can do to prevent and/or catch those mistakes.

If insecure implementations are easier to write than secure
implementations but still seem to work then it's unsurprising that
insecure implementations are deployed, as illustrated by a 2018 attack

https://www.cs.technion.ac.il/~biham/BT/

breaking most Bluetooth implementations. If a design makes insecure
implementations _harder_ than secure implementations then the risks are
lower. (Note that the question here is about a whole cloud of potential
implementations surrounding each secure implementation, and a designer
who simply tries to minimize the cost of a secure implementation could
end up encouraging insecure implementations if those become too easy.)

Regarding lattice noise, the correct procedures for generating _random_
noise are obviously surrounded in many directions by noise-generation
procedures that are easier to break. An extreme case is failing to add
_any_ noise, but it's also easy to obtain considerably smaller nonzero
noise by tweaking, e.g., psi() inside ThreeBears or cbd() inside Kyber.
How do we catch this type of failure? One answer is that if

* the noise is wrong for a considerable fraction of ciphertexts and
* implementors try decrypting ciphertexts with correct software and
* the system includes CCA reencryption

then a considerable fraction of tests will fail. From this perspective,
CPA systems are problematic, and even for CCA systems one has to analyze
how an implementation could end up occasionally producing much smaller
noise. Test vectors for encryption would handle CPA systems but require
all randomness to be generated in a controlled way. Formal verification
can eliminate occasional exceptions but will need much better ease of
use before we can reasonably ask for all post-quantum implementations to
be verified.

For comparison, how could an implementor accidentally misimplement
_rounding_ in a way that loses security? There's no closer rounded
object---no way to communicate something with less noise. Tweaking a
constant could produce occasional rounding to something _farther_ away,
but this isn't giving away a less noisy version of the original element.
The "always zero noise" and "rounding off one fewer bit" stories are
imagining a design of a cryptosystem that will fail all interoperability
tests---it doesn't even have the same key+ciphertext sizes as the
original system.

---Dan
signature.asc

Simon Hoerder

unread,
Sep 3, 2019, 2:27:41 AM9/3/19
to pqc-...@list.nist.gov
Hi,

I'm working as Senior Security Analyst but the following is entirely my personal opinion.

I don't believe the issue is with the complexity (or simplicity) of a scheme as much as with the precision of implementation guidance and testing procedures.

1. No matter whether you're a cryptographer or a developer you need to put in significant effort to create a secure, efficient and usable implementation. And you'll fail to achieve your target unless you possess or acquire both skill sets. Crypto implementations are never simple.

2. Anyone aiming to create a production quality implementation of a crypto system will read the standard. So that's the point where you need
to guide people to avoid crypto footguns: The standard must specify all the important implementation aspects. It's ok to do that in an appendix of the standard but it needs to be done. In my opinion RFC 7748 (https://tools.ietf.org/html/rfc7748) is a positive example for this. RSA key generation is a negative example as ROCA (
https://crocs.fi.muni.cz/_media/public/papers/nemec_roca_ccs17_preprint.pdf) shows.

3. A cryptographic standard needs to provide test vectors / routines for functional correctness _and_ to detect specific cryptographic pitfalls
as far as possible. The document specifying these should clearly state any known cryptographic pitfalls that are not covered by the prescribed testing. Engineers will then know to pay special attention to these untestable pitfalls during manual review. (I specifically mention "test vectors / routines" because not everything can be tested sufficiently by test vectors alone. Implementation bugs are, of course, outwith the
scope of a cryptographic standard.)

[Optional]
4. If you can provide a reference implementation of the test vectors / routines it'll increase the number of people who actually test their
implementations thoroughly. Not everyone will submit their implementations to the rigour of a certification scheme and so helping those developments by providing a reference implementation of the tests is going to improve the ecosystem.

[Optional]
5. It is recommendable to discuss system integration requirements. For example, there should be guidance regarding how much entropy from a
 TRNG (as opposed to PRNG entropy) is needed, how much secure memory for key
storage, etc. This would probably fit better into a companion guidance document instead of the standard itself.

[Optional]
6. If you have a standard and a certification scheme for that standard then the webpage for the standard should contain a big fat link to
certified libraries / products. Distinguishing good from bad implementations often is a non-trivial task in itself so any help guiding developers / integrators towards certified implementations will
help to improve the ecosystem. Not everyone who’ll look for the standard will know that there is a certification scheme and where to find the list of certified products. (Even among developers I find a surprising number of non-native English speakers who struggle to navigate large English language websites like the NIST website.)

Best,
Simon Hoerder

Alperin-Sheriff, Jacob (Fed)

unread,
Sep 3, 2019, 10:31:16 AM9/3/19
to D. J. Bernstein, pqc-forum
Um from a week ago (emphasis has been added here)

"I’m not aware of any Round 2 KEM that could conceivably fall into your (1) there (*************************ALL OF THE SIGNATURES SCHEMES DO**************).

And it’s plausibly as easy if not easier to screw up category 3 IMO. Rounding is possibly more sensitive doing something stupid in the form of, e.g. rounding off one fewer bit (in order to fit it into your constrained device or for whatever reason) than specified.

And in particular, most of the schemes do rounding off as well (regardless of whether they add additional error) for reducing ciphertext sizes. Even here it’s not so clear.

From: "'Scott Fluhrer (sfluhrer)' via pqc-forum" <pqc-...@list.nist.gov>
Reply-To: "Scott Fluhrer (sfluhrer)" <sflu...@cisco.com>
Date: Wednesday, August 28, 2019 at 1:10 PM
To: Jonathan Katz <jk...@cs.umd.edu>, "Mike.Ou...@entrustdatacard.com" <Mike.Ou...@entrustdatacard.com>, pqc-forum <pqc-...@list.nist.gov>
Subject: RE: [pqc-forum] Footguns as an axis for security analysis

I’ll give a couple of examples where there is a difference

The first example is the error vectors in Lattice Based schemes:

In some systems, the error vectors have to be chosen out of a very tightly-defined probability distribution (e.g. Discrete Gaussian), and become vulnerable if you accidentally select from a different probability distribution.
Other systems are secure with an easier-to-implement error distribution (e.g. binomial)
Still other systems use quite easy to implement error distributions (e.g. rounding, where you literally chop a few bits off)

The ‘avoid-foot-cannon’ criteria would prefer (3) over (2) and (2) over (1)."

...
--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20190903021534.30618.qmail%40cr.yp.to.


D. J. Bernstein

unread,
Sep 3, 2019, 12:26:59 PM9/3/19
to pqc-...@list.nist.gov
'Alperin-Sheriff, Jacob (Fed)' via pqc-forum writes:
> Um from a week ago (emphasis has been added here)

The distinction you're emphasizing, between fragile signature noise and
less fragile KEM noise, isn't relevant to my message. The Dilithium
software generated random noise so badly that KEMs using similar types
of random noise would also have been broken---consider, e.g., how much
damage could be done by an implementor similarly screwing up psi()
inside ThreeBears, especially in the CPA modes where interoperability
tests don't involve reencryption.

Regarding "ease of mis-implementation", Peikert claimed that "all of the
KEMs" could be misimplemented to use "always zero" noise. You claimed
that rounding is "plausibly as easy if not easier to screw up" compared
to random noise. As far as I can tell, these claims are based entirely
on asking the wrong question, namely how easily implementors can damage
security _when they aren't trying to implement the system correctly_.
(Wow, look at how easy it is to implement the null cipher!)

It's much more important to ask how easily implementors can damage
security when they _are_ trying to implement the system correctly.
There's a long history of crypto failures of this type, and the added
complexity of post-quantum systems provides even more opportunities to
screw up, as illustrated by the Dilithium example.

It's much harder for any accidental screwups in the noise to lose
security for the KEMs that generate noise deterministically by rounding,
and that compactly encode the rounded ring elements. Scott is clearly
correct to prefer rounding over random noise on this axis of comparison.

---Dan
signature.asc

Peter Schwabe

unread,
Sep 9, 2019, 4:52:00 AM9/9/19
to pqc-...@list.nist.gov
"D. J. Bernstein" <d...@cr.yp.to> wrote:

Dear Dan, dear all,

> It's much harder for any accidental screwups in the noise to lose
> security for the KEMs that generate noise deterministically by rounding,
> and that compactly encode the rounded ring elements. Scott is clearly
> correct to prefer rounding over random noise on this axis of comparison.

I agree that it's much harder to screw up the noise in LWR schemes; the
only thing I could think of is something like truncating bits instead of
rounding (which has influence on failure probability).

However, what the discussion ignores so far is the fact that also LWR
schemes need to sample a secret value in key generation and encryption,
namely the value s in the LRW samples round(a*s). Most (?) LWE schemes
use the same routine for sampling this value and for sampling the noise.
LWR schemes obviously don't use the same routine, but the sampling of s
is prone to very much the same mistakes as the sampling of e (and s) in
LWE schemes.

All the best,

Peter
signature.asc

D. J. Bernstein

unread,
Sep 9, 2019, 8:17:22 AM9/9/19
to pqc-...@list.nist.gov
Peter Schwabe writes:
> Most (?) LWE schemes use the same routine for sampling this value and
> for sampling the noise.

Out of the 36 KEMs listed in https://cr.yp.to/papers.html#latticeproofs,
I agree with this claim for exactly 2 KEMs: newhope512 and newhope1024.
To draw this conclusion, I simply juxtaposed Tables 8.7, 8.8, and 8.9 of
that paper, and looked for which KEMs have identical entries (short
secret, key noise, ciphertext noise):

* 7 KEMs are Quotient NTRU so there's a key numerator instead of
noise. These were outside the scope of the claim to begin with.

* 15 of the Product NTRU KEMs use rounding instead of noise. These
were outside the scope of the claim to begin with.

* 9 KEMs (frodo*, kyber*, threebears*) use matrices so the key noise
has more entries than the ciphertext noise. The claim fails here.

* Out of the remaining 5 KEMs (lac*, newhope*), 3 have weight
restrictions for the key noise and not for the ciphertext noise.
The claim also fails for those 3.

Beware that it's possible that I made mistakes in collating information
for this paper---the KEMs listed might not match the submitted KEMs. If
anyone spots discrepancies, please let me know.

In case anyone is thinking "Generating a vector of random values is just
as easy as generating one random value", let me point out that the very
low security level of the submitted round-1 Dilithium software came from
the software using a random value twice instead of using two separate
random values. It's very common for software bugs to be in "trivial"
steps; one has to ask not just how easy the right software is, but also
how easy the wrong software is.

Furthermore, even for newhope* (where I agree with the claim above), I
don't agree that the identical distributions end the analysis. The
process of adding random noise to a key or ciphertext isn't the same as
the process of generating a short element, and isn't tested in the same
way, especially in the CPA scenario. For example, what happens if
someone generates the newhope* noise but forgets to add it? A simple
test that checks for keys being different will still pass, and a simple
test that checks for ciphertexts being different will still pass, but of
course the security is a disaster. Deterministic rounding (enforced by
a compact encoding) would have stopped this bug from happening in the
first place.

As another example, imagine a newhope512cpa implementation that again
shares the subroutine for generating small elements/noise, and that adds
the noise properly, but that has a moderate flaw inside the subroutine:
e.g., adding only 4 or 8 coins per entry instead of 16 coins. What
happens to the security of aG+e if both a and e are produced by this
flawed subroutine? In this case, moving over to Round(aG) doesn't
eliminate the bug, but the quantitative security loss should be less
severe since the bug can't affect the rounding. It would be interesting
to quantify this effect.

---Dan
signature.asc

Mike Hamburg

unread,
Sep 9, 2019, 3:41:47 PM9/9/19
to D. J. Bernstein, pqc-...@list.nist.gov


> On Sep 9, 2019, at 5:17 AM, D. J. Bernstein <d...@cr.yp.to> wrote:
>
> Peter Schwabe writes:
>> Most (?) LWE schemes use the same routine for sampling this value and
>> for sampling the noise.
>
> Out of the 36 KEMs listed in https://cr.yp.to/papers.html#latticeproofs,
> I agree with this claim for exactly 2 KEMs: newhope512 and newhope1024.
> To draw this conclusion, I simply juxtaposed Tables 8.7, 8.8, and 8.9 of
> that paper, and looked for which KEMs have identical entries (short
> secret, key noise, ciphertext noise):
>
> * 7 KEMs are Quotient NTRU so there's a key numerator instead of
> noise. These were outside the scope of the claim to begin with.
>
> * 15 of the Product NTRU KEMs use rounding instead of noise. These
> were outside the scope of the claim to begin with.
>
> * 9 KEMs (frodo*, kyber*, threebears*) use matrices so the key noise
> has more entries than the ciphertext noise. The claim fails here.

I disagree with this. By my count, NewHope calls the sampler twice on
keygen and three times on encrypt, whereas Kyber768 and MamaBear
call it 6 times for keygen and 7 times for encrypt, or with vectors of
length 6 and 7. Sure, I can imagine ways to put a bug in one of these
that don’t apply to the other, but to say this makes Peter’s claim wrong
is cutting things a bit fine.

Peter’s claim is mathematically imprecise, particularly when applied to
specifications rather than implementations, but I don’t think that makes
it wrong for the likes of Frodo, Kyber and ThreeBears. Rather, this
discussion seems to be about:

* The probability of bugs in a straightforward (optimized or unoptimized)
implementation of a scheme.

* The probability that these bugs have security implications.

* The probability that these bugs are undetected by various levels of
testing. Reasonable levels would be:

** Testing with sender and receiver on the same impl.
** Interop testing for CPA, or with only the sender as a known-good impl.
** Interop testing for CCA in both directions.
** Testing against KATs.
** Negative testing. We should absolutely make this part of the KATs for
the final spec.
** Full formal verification.




The worst class of footgun is a security-relevant bug that can pass a
comprehensive test suite. For lattice KEMs, this is likely to include:

* Private keys from bad RNGs.

* Buffer overflows and the like. Fortunately none of the KEMs have
variable-length messages anymore.

* Bugs when the inputs and outputs alias. It’s hard to control these
from the spec level.

* Carry propagation, overflow or reduction bugs, either in imprecise
reductions or in ThreeBears’ carry chains. Mostly can’t happen in
power-of-2 schemes.

* Error correcting codes. You could just drop this in a ThreeBears
implementation, pass tests and still have a chosen-ciphertext attack,
though probably not an easy one.

* Rejection sampling that doesn’t reject the right things, eg weight
restrictions on private keys. Again, this results in a chosen-ciphertext
attack, but likely not an easy one or perhaps not even a practical one.

* Corner cases in complex routines, such as Gaussian sampling,
fixed-weight sampling, polynomial inversions, or error-correcting
codes (especially LAC’s complicated codes).

* High-bandwidth side channels like timing attacks. These are also
more likely in systems that use the above complex routines.




Some less-severe classes of errors would be ones that pass a
less-comprehensive test suite, or have smaller security implications.
These include at least:

* CCA rejection code. This passes all but negative tests.

* Sampling the private key blatantly wrong. This is catastrophic but
fails KATs, particularly in schemes that specify the private key as a seed.

* Sampling the noise blatantly wrong, or not adding it at all. This is
catastrophic but fails both KATs and CCA interop.

* Rounding (x*p+k)/q wrong, or likewise adding the wrong constant
when unrounding. This might pass tests depending on iteration count,
especially if you unround wrong. I’m not sure the resulting CCA
attacks are practical though.

* More invasive or lower-bandwidth side-channel attacks.




Within these categories of bugs, I agree with Peter that reusing the same
distribution does significantly reduce the probability of serious,
security-revelant bugs in noise sampling, but probably not to the same
level as rounding.

From this point of view, the safest scheme would be a power-of-2-modulus
“product NTRU” RLWR scheme with iid sampling, which expands the
private key from a seed and uses no error correction, and which uses
a U^bot variant for CCA security. No such scheme is in round 2, but
Saber is probably closest.

> * Out of the remaining 5 KEMs (lac*, newhope*), 3 have weight
> restrictions for the key noise and not for the ciphertext noise.
> The claim also fails for those 3.


To me, this seems like a bigger footgun than vectors of unequal lengths.

Regards,
— Mike

D. J. Bernstein

unread,
Sep 13, 2019, 4:06:12 AM9/13/19
to pqc-...@list.nist.gov
Thanks to Mike for his effort to build a list of ways that post-quantum
implementations can fail. A more detailed analysis changes some of the
conclusions; see below. I think that further community effort to analyze
likely implementation failures will be tremendously useful, both to
guide implementors and to support risk assessments.

Mike Hamburg writes:
> I can imagine ways to put a bug in one of these that don’t apply
> to the other

Exactly. See, e.g., the Dilithium software disaster. There's a long
history before this of implementations copying randomness where they
should be using new randomness: e.g., DSA's weakness under repeated
signing randomness produced the Sony PS3 security failure in 2010 and
Bitcoin losses more recently. When Rivest was criticizing this
misfeature of DSA in 1992, he wrote "The poor user is given enough rope
with which to hang himself---something a standard should not do."

> but to say this makes Peter’s claim wrong is cutting things a bit fine.

The claim was "Most (?) LWE schemes use the same routine for sampling
this value and for sampling the noise." In fact, NewHope is the only
round-2 lattice candidate where {short elements} = {key-noise vectors} =
{ciphertext-noise vectors}. Furthermore, even for NewHope, I gave two
examples of implementation failures that would produce security damage
that would have been reduced or eliminated by a rounding design: one
example was failing to add the noise, and the other was using 4 or 8
coins instead of 16.

> * The probability of bugs in a straightforward (optimized or
> unoptimized) implementation of a scheme.

I think the original "ease of mis-implementation" phrasing is helpful in
emphasizing _how_ these bugs happen. It's not some uncontrollable random
process; it's an analyzable question regarding the software-engineering
(and hardware-engineering) process applied to these candidates.

> * The probability that these bugs have security implications.

Beyond the qualitative question of whether there's _any_ security loss,
there can be quantitative differences in _how much_ security is lost.
See, e.g., my example of lower noise doing more damage to aG+e than to
Round(aG).

> * The probability that these bugs are undetected by various levels of
> testing.

Here I think it's important to pin down exactly what tests to recommend,
and to analyze how hard the tests are.

For example, you list "full formal verification", but this actually
covers many different things at many different levels of difficulty,
with many interactions with the details of what's being verified. There
has been some verification of some post-quantum software components, but
much more needs to be done.

A closer look shows that code complexity can be a poor predictor of
verification difficulty. For example, the verification in

https://sorting.cr.yp.to/verif.html

has covered many thousands of lines of machine code, and the underlying
techniques don't really care how long or complicated the code is.
Meanwhile there are other much simpler post-quantum subroutines that
aren't verified yet.

> * Private keys from bad RNGs.

I guess here you're talking about the incoming random bytes (e.g.,
getrandom() in Linux or randombytes() in SUPERCOP), rather than the
higher level of randomness generation that failed for the Dilithium
software.

https://blog.cr.yp.to/20170723-random.html describes testable,
verifiable mechanisms to produce random bytes, so I don't see why you
claim that failures here would "pass a comprehensive test suite". This
again isn't some uncontrollable process; when I see RNG failures, I ask
how those failures could have been caught, and how the designs could
have been improved to make this easier.

This level of randomness generation is also covered by various existing
standards, and it's probably good to leave it as a separate topic unless
you see reasons to think that this interacts with post-quantum designs.
People who want to put an extra hashing layer around randombytes() can
do it independently of the post-quantum design; for software engineering
and security review it's clearly better to provide this extra layer as a
new randombytes() independent of the post-quantum system, and to have
the post-quantum systems all assume that randombytes() works properly,
rather than to take Kyber's approach of a more complicated post-quantum
spec built on top of unclear assumptions regarding the RNG.

> * Carry propagation, overflow or reduction bugs, either in imprecise
> reductions or in ThreeBears’ carry chains. Mostly can’t happen in
> power-of-2 schemes.

There's a cross-submission team working on software multipliers. I think
it's clear that the simplest way to get good speeds for everything is to
use NTTs everywhere. There are still some corner cases where larger
Karatsuba/Toom multipliers are producing better Haswell speeds, but

* this corner is narrowing (e.g., the NTRU Prime multiplier speedups
announced last month come from using an NTT everywhere internally);

* it's hard to see how the remaining Toom-is-faster cases matter when
communicating ciphertexts is clearly a much larger cost; and

* small platforms surely shouldn't be spending the space on these
large Toom multipliers.

Given this picture, I don't see how the need to verify NTTs will
disappear. Verifying NTTs means tracing ranges and algebraic properties
through chains of operations. Fortunately, ECC verification has given
the community lots of practice with range analysis etc., and the same
tools suffice for verifying NTT-based multipliers.

(Outside the multipliers, reductions generally need to be "freezing"
rather than just "squeezing", and this makes the verification task even
easier---each reduction is verified separately and involves very few
possible inputs.)

With this background in mind, I'm puzzled by your claim that power-of-2
systems have an advantage regarding implementation failures. Logically,
the only possible scenarios are the following:

* These systems end up with NTT-based implementations _and_
non-NTT-based implementations. This sounds like the maximum
verification work, more than schemes that have everyone use NTT.
(I'm not saying this verification work is infeasible.)

* These systems end up with only non-NTT-based implementations. Is
this plausible, given the pressure to support small platforms?

* These systems end up with only NTT-based implementations. This puts
them in the same bucket as everybody else.

I also don't understand why you say that overflow bugs etc. would "pass
a comprehensive test suite"; your list of tests includes "full formal
verification". Perhaps the answer is that your list is limited to a
type of verification that's too weak to catch these bugs, but this is
further reason to think that the list entries need to be pinned down in
more detail.

> * Rejection sampling that doesn’t reject the right things, eg weight
> restrictions on private keys. Again, this results in a chosen-ciphertext
> attack, but likely not an easy one or perhaps not even a practical one.

You seem to misunderstand how weight restrictions work in modern
designs. These designs _don't_ generate random vectors and then reject
vectors of the wrong weight; they use deterministic functions such as

static void Short_fromlist(small *out,const uint32 *in)
{
uint32 L[p];
int i;

for (i = 0;i < w;++i) L[i] = in[i]&(uint32)-2;
for (i = w;i < p;++i) L[i] = (in[i]&(uint32)-3)|1;
crypto_sort_uint32(L,p);
for (i = 0;i < p;++i) out[i] = (L[i]&3)-1;
}

to convert arrays of randomness into fixed-weight vectors.

I do think that rejection sampling has a nonzero cost. For example,
rejection sampling in most submissions is the primary reason that we
don't yet have reliable tests for constant-time behavior---I expect this
to be fixed but that's not the same as saying it's fixed today.

> * Corner cases in complex routines, such as Gaussian sampling,
> fixed-weight sampling, polynomial inversions, or error-correcting
> codes (especially LAC’s complicated codes).

I think that casually tossing around the word "complex" is a poor
substitute for analyzing how bugs will happen. For example, regarding
fixed-weight sampling, how exactly do you imagine that bugs will get
past (1) existing tests of the Short_fromlist() function shown above and
(2) existing verification of the underlying sorting code?

> * CCA rejection code. This passes all but negative tests.

How would "full formal verification" fail to catch this? The way that
positive tests fail here, namely by limiting the inputs to the set of
enc results, is the opposite of what verification projects aim for,
namely analyzing all possible inputs.

> * Sampling the private key blatantly wrong. This is catastrophic but
> fails KATs

You're taking an extreme case (high probability of failures, complete
security disaster---e.g., zero noise), but it's important to analyze a
much larger category of potential sampling failures. How could noise be
_occasionally_ much lower? What if there's still significant entropy,
but an unacceptable reduction in security level?

Another important question here is how often people avoid using KATs.
The whole concept of KAT files (as opposed to a stream cipher generating
inputs and a hash function producing a small checksum) creates an
unnecessary conflict between the size of those files and how many tests
are carried out. There are also quite a few cases where people have
broken KATs by changing, e.g., how randomness is used inside keygen;
maybe these changes are safe but obviously the KATs don't catch any
bugs caused by the changes. Given this testing difficulty, it would be
interesting to see whether designers can reduce the motivation for
breaking KATs. Why weren't people satisfied with how randomness was used
inside keygen in the first place?

> particularly in schemes that specify the private key as a seed.

I don't see how this affects the KAT reliability, or anything else
you're saying. Of course storing a seed saves secret-key space at the
expense of seed-expansion cost, but how does this discourage bugs?

> * Rounding (x*p+k)/q wrong, or likewise adding the wrong constant
> when unrounding. This might pass tests depending on iteration count,
> especially if you unround wrong.

Tweaking the constants would end up having a failure probability of at
least 1/q in each entry, and there's no reason that these probabilities
would be correlated across entries. If there are, say, 256 entries and
q=5167 then 100 KATs would miss the failing case with probability 0.7%.

Obviously the probability rapidly drops with more KATs. SUPERCOP's
crypto_encode() checksums (used for, e.g., crypto_encode/761x1531round
inside NTRU Prime) run through 4096 inputs. If each input has 256
chances of hitting 1 failing case out of q, and q<=8192, then the
overall chance of missing the failing case is below 1/2^184.

(The recent NTRU Prime slides already mentioned that the "factored" NTRU
Prime software uses "modules with separate tests". I would encourage
implementors of other systems to aim for a similar structure.)

Of course the no-testing situation would allow screwed-up rounding, but
it's still hard to see how this would produce a serious security loss,
whereas similar key-generation failures and noise-generation failures
can trivially destroy all security. For designers thinking about
implementation failures, switching from noise to rounding

* replaces potentially disastrous noise-generation failures with
much less damaging rounding failures, and

* seems to lessen the impact of key-generation failures.

Again, I would be interested in quantification of how much security aG+e
loses if a,e are from narrower distributions, vs. how much security
Round(aG) loses.

> Within these categories of bugs, I agree with Peter that reusing the same
> distribution does significantly reduce the probability of serious,
> security-revelant bugs in noise sampling, but probably not to the same
> level as rounding.

Hmmm. I think what you're saying here is

* highest risk: noisy key, differently noisy ciphertext;
* less risk: noisy key, similarly noisy ciphertext;
* lowest risk: rounded key, rounded ciphertext.

But this doesn't look like what Peter was saying ("prone to very much
the same mistakes", as if LWR had as much risk as LWE). This is also
ignoring the extra implementation risks from matrix/module outer loops
compared to pure ring schemes.

> From this point of view, the safest scheme would be a power-of-2-modulus
> “product NTRU” RLWR scheme with iid sampling, which expands the
> private key from a seed and uses no error correction, and which uses
> a U^bot variant for CCA security.

I agree regarding rounding vs. noise. I'd guess the same regarding error
correction, but I think this needs a real analysis of failure modes in
each error-correction option. I don't agree regarding modulus choice,
and regarding fixed weight vs. variable weight; see above. I don't see
your arguments for seeds and U^bot. Finally, Product NTRU vs. Quotient
NTRU is a much more complicated question than what either of us has
covered here.

---Dan
signature.asc

Mike Hamburg

unread,
Sep 13, 2019, 12:44:11 PM9/13/19
to D. J. Bernstein, 'Rainer Urian' via pqc-forum
Thanks for your insight, Dan.

To be clear, when I refer to a “comprehensive test suite”, I mean testing
the code by running it, not formal verification. I recognize that you could
have a partially formally verified system, or verify the wrong thing, but
nonetheless I am not as worried about Project Wycheproof having a
bug as I am about some random IOT or V2X or smart card system
having a bug.

So, in the below, please consider that I am focusing on systems which
do not attempt full formal verification, or indeed any formal verification
beyond (possibly) lack of compiler or maybe memory-safety warnings.

> On Sep 13, 2019, at 1:06 AM, D. J. Bernstein <d...@cr.yp.to> wrote:
>
> Thanks to Mike for his effort to build a list of ways that post-quantum
> implementations can fail. A more detailed analysis changes some of the
> conclusions; see below. I think that further community effort to analyze
> likely implementation failures will be tremendously useful, both to
> guide implementors and to support risk assessments.
>
> Mike Hamburg writes:
>> I can imagine ways to put a bug in one of these that don’t apply
>> to the other
>
> Exactly. See, e.g., the Dilithium software disaster. There's a long
> history before this of implementations copying randomness where they
> should be using new randomness...

Sure. And if you use the same randomness for “small values” and
“key noise”, it will also cause a disastrous failure. But doing either
of these things is unlikely to pass reasonable tests, unless everyone
makes the same mistake.

>> but to say this makes Peter’s claim wrong is cutting things a bit fine.
>
> The claim was "Most (?) LWE schemes use the same routine for sampling
> this value and for sampling the noise." In fact, NewHope is the only
> round-2 lattice candidate where {short elements} = {key-noise vectors} =
> {ciphertext-noise vectors}. Furthermore, even for NewHope, I gave two
> examples of implementation failures that would produce security damage
> that would have been reduced or eliminated by a rounding design: one
> example was failing to add the noise, and the other was using 4 or 8
> coins instead of 16.

Yes, but these failures are unlikely to pass even a single test vector. If
done during encryption, they also won’t interoperate. So while disastrous,
they would likely be found by minimal testing.

Also, you chose a particular reading of Peter’s claim that made it false.
Consider that in the specification of ThreeBears, and also in each of my
implementations of it, the short key elements and noise are sampled by
calling the function “noise”. Thus, at both the spec and implementation
level, this scheme uses the same routine for short elements, key-noise
vectors and ciphertext-noise vectors, just as Peter claimed.

I don’t know if the same is true for other LWE schemes, but Peter’s
claim is demonstrably true under the natural reading for a scheme other
than NewHope.

>> * The probability of bugs in a straightforward (optimized or
>> unoptimized) implementation of a scheme.
>
> I think the original "ease of mis-implementation" phrasing is helpful in
> emphasizing _how_ these bugs happen. It's not some uncontrollable random
> process; it's an analyzable question regarding the software-engineering
> (and hardware-engineering) process applied to these candidates.

As you like.

But this is inconsistent with the rest of your analysis, which appears
to focus on the existence of correct implementations or the ease of
producing and verifying them, more than on the “ease of
mis-implementation”.

>> * The probability that these bugs are undetected by various levels of
>> testing.
>
> Here I think it's important to pin down exactly what tests to recommend,
> and to analyze how hard the tests are.
>
> For example, you list "full formal verification", but this actually
> covers many different things at many different levels of difficulty,
> with many interactions with the details of what's being verified. There
> has been some verification of some post-quantum software components, but
> much more needs to be done.

Fair, but my post was mainly focused on implementations that are not
formally verified at all, since that is where I think most of the risk is. For
example, each hardware implementation and each embedded software
implementation of ECC is different, and very few of them are formally
checked (but they might use formally checked components, like formulas
from Hyperelliptic).

>> * Carry propagation, overflow or reduction bugs, either in imprecise
>> reductions or in ThreeBears’ carry chains. Mostly can’t happen in
>> power-of-2 schemes.
>
> There's a cross-submission team working on software multipliers. I think
> it's clear that the simplest way to get good speeds for everything is to
> use NTTs everywhere….
>
> With this background in mind, I'm puzzled by your claim that power-of-2
> systems have an advantage regarding implementation failures. Logically,
> the only possible scenarios are the following:
>
> * These systems end up with NTT-based implementations _and_
> non-NTT-based implementations. This sounds like the maximum
> verification work, more than schemes that have everyone use NTT.
> (I'm not saying this verification work is infeasible.)

This is a fair quibble, which might tilt the balance in favor of Kyber and
NewHope over Saber, since they must use NTT, and might therefore cause
more users to reuse existing verified code instead of implementing
multiplication themselves.

Otherwise, “maximum verification work” only applies if the verification work
is shared across implementations, which in the case of 1-off embedded
implementations it usually isn’t. If it isn’t shared, then I expect schoolbook
or Karatsuba implementations to be harder to screwup in the first place,
and easier to verify, especially for power-of-2 moduli.

This is consistent with your “ease of mis-implementation” metric. It’s
harder to mis-implement power-of-2 reduction than non-power-of-2
reduction, because you pretty much have to go beyond schoolbook
and even Karatsuba to do it.

> * These systems end up with only non-NTT-based implementations. Is
> this plausible, given the pressure to support small platforms?

Schoolbook uses less memory and code than NTT for non-NTT-optimized
schemes. The same may even be true for 1-2 levels of Karatsuba. Since
small platforms care about size as much as speed, they will end up with
many non-NTT-based implementations.

> * These systems end up with only NTT-based implementations. This puts
> them in the same bucket as everybody else.

I’m pretty sure this won’t happen. Someone somewhere will base an
implementation on the reference code.

>> * Rejection sampling that doesn’t reject the right things, eg weight
>> restrictions on private keys. Again, this results in a chosen-ciphertext
>> attack, but likely not an easy one or perhaps not even a practical one.
>
> You seem to misunderstand how weight restrictions work in modern
> designs.

Yeah, I misunderstood you as saying that some of the schemes were
using rejection sampling, and I hadn’t read all their second-round
specs.

>> * Corner cases in complex routines, such as Gaussian sampling,
>> fixed-weight sampling, polynomial inversions, or error-correcting
>> codes (especially LAC’s complicated codes).
>
> I think that casually tossing around the word "complex" is a poor
> substitute for analyzing how bugs will happen. For example, regarding
> fixed-weight sampling, how exactly do you imagine that bugs will get
> past (1) existing tests of the Short_fromlist() function shown above and
> (2) existing verification of the underlying sorting code?

This is well and good if everyone copy-pastes djbsort. If they do not,
they will either be using qsort or mergesort (not constant-time) or
trying to write their own constant-time sort, which will probably not
be formally verified.

It is easier to mis-implement this than it is to mis-implement i.i.d.
sampling.

>> * Sampling the private key blatantly wrong. This is catastrophic but
>> fails KATs
>
> You're taking an extreme case (high probability of failures, complete
> security disaster---e.g., zero noise), but it's important to analyze a
> much larger category of potential sampling failures. How could noise be
> _occasionally_ much lower?

Perhaps you have a type of bug in mind that I don’t. At least for i.i.d.
schemes, the only way I can see this happening by accident is with
an RNG failure or a memory safety bug. On top of the verification
strategies you discussed, these issues are mostly orthogonal to the
scheme being implemented.

> What if there's still significant entropy,
> but an unacceptable reduction in security level?


This is why specifying the private key expansion from seed is useful.
Any other location where the sampling is significantly wrong will result
in interop test failures, and test vector failures even without a
controllable randombytes().

> Another important question here is how often people avoid using KATs.
> The whole concept of KAT files (as opposed to a stream cipher generating
> inputs and a hash function producing a small checksum) creates an
> unnecessary conflict between the size of those files and how many tests
> are carried out.

I agree that this is better than KATs.

>> particularly in schemes that specify the private key as a seed.
>
> I don't see how this affects the KAT reliability, or anything else
> you're saying. Of course storing a seed saves secret-key space at the
> expense of seed-expansion cost, but how does this discourage bugs?

The deterministic seed-expansion function is easier to test than a
non-deterministic keygen, especially on systems where it’s hard to
patch out the RNG in favor of a particular randombytes()
implementation.

Also, if private keys are moved between different implementations,
bugs in seed expansion will cause interop failures.

>> * Rounding (x*p+k)/q wrong, or likewise adding the wrong constant
>> when unrounding. This might pass tests depending on iteration count,
>> especially if you unround wrong.
>
> Tweaking the constants would end up having a failure probability of at
> least 1/q in each entry, and there's no reason that these probabilities
> would be correlated across entries. If there are, say, 256 entries and
> q=5167 then 100 KATs would miss the failing case with probability 0.7%.
>
> Obviously the probability rapidly drops with more KATs….

KATs are annoying enough that people don’t always run the whole file,
but instead paste 1-2 examples into their test suite. Or they test only
the KATs that made it into the RFC, or whatever. So even the probability
of failing a small number of KATs is important.

>> Within these categories of bugs, I agree with Peter that reusing the same
>> distribution does significantly reduce the probability of serious,
>> security-revelant bugs in noise sampling, but probably not to the same
>> level as rounding.
>
> Hmmm. I think what you're saying here is
>
> * highest risk: noisy key, differently noisy ciphertext;
> * less risk: noisy key, similarly noisy ciphertext;
> * lowest risk: rounded key, rounded ciphertext.

> But this doesn't look like what Peter was saying ("prone to very much
> the same mistakes", as if LWR had as much risk as LWE).

I think it’s entirely compatible with what Peter was saying. Rounding
reduces the probability that you screw up the noise (thus moving from
“less” to “lowest”), but you could still screw up the key in just as many
ways. In systems where sampling the key and sampling the noise
share most of their code, the difference in screwup probability (or
ease of mis-implementation) is slight.

> This is also
> ignoring the extra implementation risks from matrix/module outer loops
> compared to pure ring schemes.

True, but again these bugs are unlikely to pass tests, at least once the
reference implementation is correct.

Regards,
— Mike

D. J. Bernstein

unread,
Sep 14, 2019, 5:09:16 PM9/14/19
to pqc-...@list.nist.gov
It's important to ask how a post-quantum implementor could screw up in a
way that would still pass NIST's KATs or SUPERCOP's checksums, and how
the resulting damage varies across designs. It's _also_ important to ask
how often implementors _don't_ use KATs (and why they don't), and again
how the resulting damage varies across designs.

Asking only the first question would be a mistake, missing many of the ways
that designers can help stop real-world crypto failures. For example:

* DSA/ECDSA repeated-nonce security disasters happened even though
all examples I've seen would have been caught by KATs. Designs that
hash the message into the nonce are safer---to screw these up,
implementors have to skip the KATs _and_ the hashing.

* ECDH invalid-point security disasters have been happening for
decades even though they would have been caught by KATs that use
invalid points. Designs that send only compressed points are safer,
especially with twist-secure curves.

For more discussion see https://cr.yp.to/talks.html#2015.01.07, which
gives the following advice: "Creating or evaluating a design? _Think
about the implementations_. What will the implementors do? What errors
are likely to appear in implementations? Can you compensate for this?"

Mike Hamburg writes:
> To be clear, when I refer to a “comprehensive test suite”, I mean
> testing the code by running it, not formal verification.

Thanks for the clarification.

I think you were right to include "full formal verification" on your
list of "levels of testing", and wrong to use "comprehensive" to refer
merely to tests of _some_ inputs (how many inputs?). We've seen

* examples of ECC bugs that are missed by millions of tests, and
* advances in verification already handling full implementations of
X25519 and various interesting chunks of post-quantum code.

As Firefox illustrates, verification overlaps modern real-world crypto.
I don't see the argument for leaving verification out of the picture.

A serious analysis of implementation failures has to cover the cost,
prevalence, effectiveness, etc. of various testing methods. Conclusions
regarding tests should be driven by this analysis, not by preconceived
notions of what's "comprehensive" (or "reasonable"; see below).

[ re "copying randomness where they should be using new randomness": ]
> unlikely to pass reasonable tests

The round-1 Dilithium software disaster was this type of bug. Are you
saying that the Dilithium tests---100 KATs!---weren't "reasonable"?

People asking about "ease of mis-implementation" are asking a serious
technical question about how the development process can go wrong---and
asking about the impact of variations in designs. You seem to think that
some failures aren't "reasonable"; but if typical development doesn't
qualify as "reasonable" then you're missing the core of the question.

> unless everyone makes the same mistake

All of the implementations in the round-1 Dilithium submission had the
same security flaw, because they were copying the same mistaken code.
Concretely, the "Additional Implementations" in the submission had some
performance-critical subroutines optimized for AVX2, but copied other
code from the reference implementation. This copying is an example of
"software reuse", a very common practice in software development.

Do you imagine that most post-quantum implementors will develop code
from the spec? It's almost always easier to reuse existing code---even
if some parts of the code need to be tweaked for a new environment.

Google wrote a new implementation of NTRU-HRSS for CECPQ2, but also
changed various details, so of course the KATs are different. I haven't
yet seen evidence of Google doing the extra code development that would
be required to use this new implementation as a serious double-check on
the original KATs.

NTRU Prime did a code rewrite from the round-2 spec (spec -> Sage -> C)
_and_ did extra work to generalize this code to also handle the round-1
spec, allowing the old and new implementations to check each other. But

* it's still possible that there was a mistake made by the round-1
code and again by the round-2 code, and

* it's also possible that there were mistakes in the part of the
round-2 code beyond the round-1 overlap, so the round-2 KATs could
be wrong even if the round-1 KATs are correct.

Furthermore, my impression is that most of the candidates haven't made
_any_ effort to do multiple implementations straight from the spec.

> > This is also
> > ignoring the extra implementation risks from matrix/module outer loops
> > compared to pure ring schemes.
> True, but again these bugs are unlikely to pass tests, at least once the
> reference implementation is correct.

You keep trying to dismiss critical Dilithium-type bugs by imagining a
universe where people are always using correct KATs.

There haven't been any announcements of errors in the KATs for round-2
Dilithium. Do you thus claim to be confident that the KATs are correct?
Presumably not. How do you imagine that the community is going to build
this confidence? Same question regarding other candidates.

Of course KATs go a long way in catching bugs, but it's important to
also analyze ways that bugs can get past KATs---how the KATs can be
wrong, reasons that implementors aren't using KATs, how bugs can be too
rare to be caught by KATs---and what this means for designers.

In other parts of your message, you change tune, advocating other design
features because you say they would help for implementors who _aren't_
using KATs. See below.

> Also, you chose a particular reading of Peter’s claim that made it false.
> Consider that in the specification of ThreeBears, and also in each of my
> implementations of it, the short key elements and noise are sampled by
> calling the function “noise”.

So, instead of asking whether the short element and the noise are
sampled by "the same routine" as he wrote, you're asking merely whether
these two routines have _some_ overlap of code? This is content-free:
e.g., the SABER code that creates noise by rounding uses some of the
same CPU instructions as the code that creates short elements.

If you meant something quantitative such as measuring _how much_ code is
shared, then you're obviously diverging even more from what Peter wrote.
(He said "LWR schemes obviously don't"---clearly a yes/no question, not
a quantitative question.) Furthermore, this metric clearly isn't a valid
input to risk analysis. If, e.g., a modified design works the same way
except for having this shared routine more complicated and harder to get
right, then saying "larger volume of shared code; good!" makes no sense.

As an analogy, "provably secure" cryptosystems sometimes fail because
proofs are wrong, but they also frequently fail because of the things
that _haven't_ been proven. Measuring designs by the volume of proofs is
counterproductive---it's missing most risks, and at the same time it's
encouraging designers to add unnecessary risks simply to have more
proofs. A proper risk analysis takes a different perspective, asking
what can go wrong; the answer sometimes depends on what has been proven
but it also depends on what has _not_ been proven. Similarly, a proper
analysis of code risks has to look at the whole code base, including the
parts that are shared but also the parts that are _not_ shared.

> this is inconsistent with the rest of your analysis, which appears
> to focus on the existence of correct implementations or the ease of
> producing and verifying them, more than on the “ease of
> mis-implementation”.

As I wrote before: "If insecure implementations are easier to write than
secure implementations but still seem to work then it's unsurprising
that insecure implementations are deployed ... If a design makes insecure
implementations _harder_ than secure implementations then the risks are
lower. (Note that the question here is about a whole cloud of potential
implementations surrounding each secure implementation, and a designer
who simply tries to minimize the cost of a secure implementation could
end up encouraging insecure implementations if those become too easy.)"

A proper analysis has to ask about the secure implementations _and_ the
insecure implementations. That's why I've been looking at both sides of
this in this thread.

> Otherwise, “maximum verification work” only applies if the
> verification work is shared across implementations

Huh? I'm talking about the work of verifying NTT-based implementations
_plus_ the work of verifying Toom-based implementations. I wrote that
this "sounds like the maximum verification work, more than schemes that
have everyone use NTT." I'm not sure what you're disagreeing with.

> which in the case of 1-off embedded implementations it usually isn’t.

You mean that 1-off embedded implementations usually aren't verified? Or
that the work in verifying implementations usually isn't shared?

The point I'm making is that catching multiplier bugs is harder if the
ecosystem has NTT-based implementations _and_ Toom-based implementations
rather than just NTT-based implementations.

> Schoolbook uses less memory and code than NTT for non-NTT-optimized
> schemes.

I agree that the ecosystem will also have schoolbook implementations,
and those also need verification. However, the work needed to get the
whole ecosystem right comes primarily from the fancier multipliers.

> using qsort or mergesort (not constant-time)

Indeed, existing variable-time libraries feature prominently in the
literature analyzing how people screw up constant-time crypto. However,
sorting is only one tiny corner of this, measured by the amount of code
in question and by the level of security damage that seems to be done.
The elephant in the room is variable-time arithmetic, such as

* variable-time bigints and
* variable-time polys and
* variable-time matrices.

GMP now supports constant-time bigints, but it'll be years before this
percolates up through higher-level libraries and applications. NTL
hasn't even started down an analogous path for polynomials.

One sensible response is to provide more constant-time implementations
for people to reuse. This is an example of making secure implementations
easier to write.

Another sensible response is developing better tools to systematically
enforce constant-time behavior. To the extent that the tools are used,
they make insecure implementations harder.

> or trying to write their own constant-time sort, which will probably
> not be formally verified.

Here I think you're forgetting the "ease" part of the question about
"ease of mis-implementation".

Knuth came up with a particularly clever version of Batcher's sorting
network, producing surprisingly short code, which is packaged (and
verified after compilation on various machines) as portable code in
djbsort. It's easy to copy this, and it's _much_ more work to figure it
out from scratch.

> > You're taking an extreme case (high probability of failures, complete
> > security disaster---e.g., zero noise), but it's important to analyze a
> > much larger category of potential sampling failures. How could noise be
> > _occasionally_ much lower?
> Perhaps you have a type of bug in mind that I don’t.

There are clearly many ways to have the arithmetic _sometimes_ screw up:
e.g., an implementor could easily think that "hw()" in NewHope needs
only 3 bits of output (this type of optimization is normal in hardware
implementations and vectorized software implementations), or someone
trying to avoid storing the loop counter could use an all-0 "t" in Kyber
to mark the end (this would be a roughly one-in-a-million screwup, and a
roughly one-in-a-billion chance of zero noise).

> This is why specifying the private key expansion from seed is useful.
> Any other location where the sampling is significantly wrong will result
> in interop test failures, and test vector failures even without a
> controllable randombytes().

I agree that it's important to consider implementors who are trying
enc-dec tests but _not_ checking KATs. However, I don't understand why
you think that specifying a compressed secret-key format makes
mis-implementation harder in this scenario.

For KATs there's a deterministic randombytes() that feeds a known seed
into a known expansion function. You're distinguishing two situations
for the structure of code on top of randombytes():

* Compressed secret keys: e.g., keypair() copies a seed from
randombytes() (there are other possibilities), and this seed is the
compressed secret key provided to dec(), which does the computation
to expand the seed.

* Uncompressed secret keys: e.g., keypair() obtains the seed from
randombytes() and does the computation to expand the seed,
obtaining the uncompressed secret key provided to dec().

In both cases, a new implementation that screws up the expansion---in
particular, generating too little noise for security---will be detected
by the KATs _if_ the screwup occurs frequently enough. In both cases,
this detection fails if the bad expansion is too rare to be caught by
the KATs, or if the KATs aren't being used. Why do you think that the
first case does a better job here?

> The deterministic seed-expansion function is easier to test than a
> non-deterministic keygen

Splitting the expansion into a separate function makes the expansion
easier to test. This is independent of whether the expansion is called
by keygen or by dec.

> Also, if private keys are moved between different implementations,
> bugs in seed expansion will cause interop failures.

If the randombytes() result is moved between different implementations,
bugs in seed expansion will cause exactly the same test failures with
exactly the same probability. Furthermore, this structure is obviously
more powerful in catching bugs, since it also catches (e.g.) keygen or
enc failing to copy enough bytes from randombytes().

In other words, for implementors who are testing the same inputs across
multiple implementations, compressed keys don't make bugs less likely.
For implementors who _aren't_ testing the same inputs across multiple
implementations, compressed keys also don't make bugs less likely.

You seem to be imagining someone doing tests but limiting the inputs to
catch only dec bugs and not keygen bugs---which would clearly reward
moving work from keygen to dec. But why is this supposed to be more
likely than someone limiting the inputs in the opposite way, which would
clearly reward moving work from dec to keygen?

> especially on systems where it’s hard to patch out the RNG in favor of
> a particular randombytes() implementation.

There's no need for the target platform to provide a swappable RNG. The
implementation is written using the randombytes() abstraction, which is
specialized one way for KATs and another way for the target platform.

I'm not saying that KATs are always used---on the contrary; again
consider Google changing the NTRU-HRSS hashing details. But obviously
interop tests also don't work in the Google case. One needs a deeper
form of robustness in the system design to prevent bugs here.

> KATs are annoying enough that people don’t always run the whole file,
> but instead paste 1-2 examples into their test suite. Or they test only
> the KATs that made it into the RFC, or whatever. So even the probability
> of failing a small number of KATs is important.

Sure, it's important to consider the 0-KAT and 1-KAT and 2-KAT cases.
Thinking through the possibilities and consequences makes clear that
rounding bugs are easier to catch and less damaging than noise bugs.
Also, the damage that noise bugs do to aG+e is reduced in systems that
replace aG+e with Round(aG+e), and further reduced in systems that
replace aG+e with Round(aG).

---Dan
signature.asc

Mike Hamburg

unread,
Sep 14, 2019, 6:46:25 PM9/14/19
to D. J. Bernstein, pqc-...@list.nist.gov
Hi Dan,

This is turning into another debate where we agree on the basic points,
and that in some axis A > B > C, but you write volumes of text on why
B is very far from A (which happens to be your design) whereas I think
that A and B are pretty close. Then when the same thing happens with
something that’s not your design, you assert that A = B exactly.

Furthermore, as usual you are replying to each one of my points with
two or three of your own, so I will reply to only a few points, where I
think we haven’t been entirely clear.

For the rest, the evidence and arguments are out there, and we can
agree to disagree.

> On Sep 14, 2019, at 2:08 PM, D. J. Bernstein <d...@cr.yp.to> wrote:
> Do you imagine that most post-quantum implementors will develop code
> from the spec? It's almost always easier to reuse existing code---even
> if some parts of the code need to be tweaked for a new environment.

I frequently develop code from specs, and in fact I wrote the
ThreeBears Python and C implementation from the spec rather
than from each other for exactly this reason.

But you’re right that it’s important to reduce the probability of bugs
even if there’s only one implementation.

>> Also, you chose a particular reading of Peter’s claim that made it false.
>> Consider that in the specification of ThreeBears, and also in each of my
>> implementations of it, the short key elements and noise are sampled by
>> calling the function “noise”.
>
> So, instead of asking whether the short element and the noise are
> sampled by "the same routine" as he wrote, you're asking merely whether
> these two routines have _some_ overlap of code? This is content-free:
> e.g., the SABER code that creates noise by rounding uses some of the
> same CPU instructions as the code that creates short elements.

This is ridiculous and you know it.

First of all, the full code used to sample the key and noise can’t be
exactly the same, because at least during encapsulation it’s
deterministic (and must be for FO), so it must at least take
different seeds or indices.

When it comes down to calling the same function but with different
indices and/or lengths, or calling the same function in a loop that’s
3-long on one side and 4-long on the other side, it makes sense to
say that these functions are “us[ing] the same routine for sampling
[the secret] value and for sampling the noise”. This isn’t content-
free, and it’s not the same as saying that any two routines are the
same because they share some CPU instructions.

The fact that the same routine is used will decrease the probability
of bugs, but it is still higher than for rounding because there are
more call points. This is what Peter was saying, and it’s what I’ve
been saying, and I don’t understand what part of this you have a
problem with.

>> This is why specifying the private key expansion from seed is useful.
>> Any other location where the sampling is significantly wrong will result
>> in interop test failures, and test vector failures even without a
>> controllable randombytes().
>
> I agree that it's important to consider implementors who are trying
> enc-dec tests but _not_ checking KATs. However, I don't understand why
> you think that specifying a compressed secret-key format makes
> mis-implementation harder in this scenario.
>
> For KATs there's a deterministic randombytes() that feeds a known seed
> into a known expansion function. You're distinguishing two situations
> for the structure of code on top of randombytes():
>
> * Compressed secret keys: e.g., keypair() copies a seed from
> randombytes() (there are other possibilities), and this seed is the
> compressed secret key provided to dec(), which does the computation
> to expand the seed.
>
> * Uncompressed secret keys: e.g., keypair() obtains the seed from
> randombytes() and does the computation to expand the seed,
> obtaining the uncompressed secret key provided to dec().
>
> In both cases, a new implementation that screws up the expansion---in
> particular, generating too little noise for security---will be detected
> by the KATs _if_ the screwup occurs frequently enough. In both cases,
> this detection fails if the bad expansion is too rare to be caught by
> the KATs, or if the KATs aren't being used. Why do you think that the
> first case does a better job here?

I think I’m biased because I often develop for hardware, and we don’t
necessarily have randombytes() available within the simulator or
FPGA. But it’s easy to run a test cases through, so long as you’re
checking input -> output mappings. It’s harder to hook the RNG to
match the one in the reference platform, or to port supercop’s
randombytes() to hardware just for the test.

Furthermore, I’ve seen and even written a fair number of projects
which, rather than testing with randombytes(), contain pasted (input,
output) pairs from an RFC or other spec, or just from running a
reference implementation on some other platform. Yes, it’s better
to test with randombytes() and a big loop, but people don’t always
do that. Sometimes it’s also painfully slow, depending on platform.

It’s not necessary to actually *store* the private key in compressed
form to test this way, as long as it’s expanded deterministically
from a short-ish seed. Storing it in compressed form only helps
catch bugs if you move private keys between implementations,
in which case many classes of error (but surely not all) will break
interop and not just KATs.

Cheers,
— Mike

Peter Schwabe

unread,
Sep 15, 2019, 11:35:11 AM9/15/19
to pqc-...@list.nist.gov
"D. J. Bernstein" <d...@cr.yp.to> wrote:

Dear Dan, dear all,

> The round-1 Dilithium software disaster was this type of bug. Are you
> saying that the Dilithium tests---100 KATs!---weren't "reasonable"?

At the time, there were not KATs of correct software to compare to. Of
course, this raises the question how we can be sure that any software is
correct and this brings us back to formal verification.

However, the bug was found through auditing, which is still probably the
most common way to obtain crypto implementations that are eventually
believed to be correct.

I agree with Mike that I am much more worried about bugs that are not
caught by comparing, for example, 100 KATs generated from one
implementation with the bug and one software without the bug.


Cheers,

Peter
signature.asc

Elsa Velazquez

unread,
Sep 16, 2019, 10:45:03 AM9/16/19
to pqc-forum, d...@cr.yp.to
It can't be assumed that anyone implementing PQC would necessarily use KEMs, meaning use cases include strictly using KEMs or signatures in isolations.  There isn't a standard forcing both, right?
That said, I purposefully changed around API parameters and got a false PASS on the algo I was experimenting on, and it was confirmed this could cause serious leaks.  Were I a bad actor, not necessarily as a screw-up, I could have convinced lay-people/organizations they had a valid PQC implementation then robbed them blind with a classical computer attack.
KATs and checksums weren't relevant in that use case.

D. J. Bernstein

unread,
Oct 26, 2019, 3:16:09 PM10/26/19
to pqc-...@list.nist.gov
Here's a new footgun data point announced this month, many current ECDSA
implementations leaking enough secret data through timing to be broken
by the 1999 Howgrave-Graham--Smart attack:

https://minerva.crocs.fi.muni.cz/

As another data point, back in June, Google announced that it had used
only 64 bits of randomness in Chromebook ECDSA signing, an exploitable
security flaw that evidently had passed all of Google's tests:

https://www.chromium.org/chromium-os/u2f-ecdsa-vulnerability

How as a community do we plan to respond to bugs similarly creating
security holes in implementations of post-quantum cryptography? Surely
we'll see more of these. After this thread began, round-1 Dilithium was
joined by round-2 Falcon as a known example of a NISTPQC candidate where
_all_ implementations were wrong in a way that destroyed security.

Here are two different answers:

* Categorically blame the implementors for any deviations from the
spec: e.g., the ECDSA spec required Google to generate 256 bits of
fresh randomness for the nonce; Google didn't; end of analysis.

* Ask what designers can do to predict and prevent these security
holes. This means broadening the scope of security analysis to also
account for errors in implementation processes.

The benefit of the second answer is clear, producing designs that are
less likely to fail in the real world, as illustrated by
https://blog.cr.yp.to/20191024-eddsa.html.

It's important to realize that this second answer makes security analysis
even harder, perhaps _much_ harder. On top of all the cryptographic
security issues, there's a massive literature on software errors, and
we've seen many examples of interactions between the details of the
cryptography and the details of how implementations are going to fail.

These difficulties don't make me think that the second answer is the
wrong answer. They _do_ make me think that an exceptional level of care
is required in the analysis. Even for the simple case of _correct_
implementations, there's already a clear risk that inadequate care in
security review will lead to post-standardization security disasters;
surely this risk increases when the scope is broadened to also cover
implementation errors.

Mike Hamburg writes:
> When it comes down to calling the same function but with different
> indices and/or lengths, or calling the same function in a loop that’s
> 3-long on one side and 4-long on the other side

Let's try an example. The Kyber software has the following
noise-generation calls in indcpa.c:

for(i=0;i<KYBER_K;i++)
poly_getnoise(skpv.vec+i, noiseseed, nonce++);
for(i=0;i<KYBER_K;i++)
poly_getnoise(e.vec+i, noiseseed, nonce++);
...
for(i=0;i<KYBER_K;i++)
poly_getnoise(sp.vec+i, coins, nonce++);
for(i=0;i<KYBER_K;i++)
poly_getnoise(ep.vec+i, coins, nonce++);
poly_getnoise(&epp, coins, nonce++);

Each loop here is another chance for Dilithium-type security disasters.
It's easy to imagine, e.g., an implementor unrolling the loops for a
specific K but miscalculating the nonce offsets, or implicitly undoing
the nonce adjustments by putting the loops inside a subroutine. This is
beyond the chances for disasters in the "for(i=0;i<KYBER_N/8;i++)" loop
in cbd.c inside the polynomial layer.

For comparison, here are the analogous lines of code in cpapke.c in the
NewHope software:

poly_sample(&shat, noiseseed, 0);
...
poly_sample(&ehat, noiseseed, 1);
...
poly_sample(&sprime, coin, 0);
poly_sample(&eprime, coin, 1);
poly_sample(&eprimeprime, coin, 2);

No loops here! Fewer opportunities for an implementor to screw up. Again
there are chances for disasters in the "for(i=0;i<NEWHOPE_N/64;i++)"
loop inside the polynomial layer, but all of Kyber's matrix/vector loops
on top of this are gone---NewHope uses simply polynomials, not matrices
of polynomials.

In NewHope, eprime and eprimeprime are being sampled by the same
poly_sample() code from inputs (coin,1) and (coin,2). In Kyber, it is
simply not true that ep and epp are being sampled by the same code from
different inputs---the code for ep has an extra loop. Saying that this
code _could_ have been rewritten as, e.g.,

poly_getnoise_vec(ep.vec, KYBER_K, coins, nonce);
poly_getnoise_vec(&epp, 1, coins, nonce+KYBER_K);

is ignoring the reality that the code _wasn't_ written this way. The
bigger picture is that implementors often unroll short fixed-length
loops---especially loops of length 1.

(As a side note, I wonder what fraction of readers noticed that I didn't
add KYBER_K+1 to the nonce at the end of this rewrite, and what fraction
of readers checked whether this affects the correctness of the code.)

Given the code as it actually exists (and thinking through possibilities
for further implementations), one can reasonably say that a _subroutine_
is shared, but this sharing does _not_ imply that the overall chance of
error is the same, or even that it's close. One has to analyze the
chance of error inside the subroutine _and_ the chance of error outside
the subroutine.

> I think I’m biased because I often develop for hardware, and we don’t
> necessarily have randombytes() available within the simulator or
> FPGA. But it’s easy to run a test cases through, so long as you’re
> checking input -> output mappings.

Build an implementation using randombytes(). Plug in a randombytes()
implementation that does "read this input". This converts checking KATs
into an example of "checking input -> output mappings". This has nothing
to do with what RNG APIs are "available within the simulator or FPGA";
it's another type of test for the original implementation.

Of course there are ways that KATs can fail. For example, suppose an
implementor tries running a series of tests as follows:

( echo keygen; echo enc; echo dec ) \
| while read operation
do
ssh masterdevice ./runtest $operation \
| cmp - $operation.kat
done

Will the implementor notice that this is actually testing only keygen,
and skipping the tests of enc and dec?

In exactly this situation, say there's a bug in seed expansion. With the
usual design choice of uncompressed seeds, the tests will catch these
bugs, because the bugs will affect the keygen output. With your design
choice of compressed seeds, the tests will _fail_ to catch these bugs,
because the bugs won't affect the keygen output.

Evidently _uncompressed_ seeds are safer in this situation. How do you
justify your claim that _compressed_ seeds are safer without arguing
that this situation is outweighed by something else? Structurally, your
argument imagines keygen being tested worse than dec, but doesn't even
consider the possibility of dec being tested worse than keygen, never
mind evaluating the probabilities of the different situations happening.

To be clear, I'm _not_ saying that everyone will _try_ to check KATs. On
the contrary, I think we have to consider implementors who are just
checking whether their software seems to work, and dealing with any
observed bugs, _without_ any known-answer tests. What these implementors
see is simply that random keypair-enc-dec returns the original message.

I also don't think it's safe to assume that one side of the enc-dec is
always a preexisting implementation. Often implementors are writing
applications where they simply have to interoperate with themselves---
consider, e.g., Apple's initial Curve25519 deployment in 2010---and then
why would they think that they need to test anything else? Apple
sensibly reused Adam Langley's public implementation of Curve25519, but
everything I'm saying about testing also applies to new implementations.

> This is turning into another debate where we agree on the basic points,
> and that in some axis A > B > C, but you write volumes of text on why
> B is very far from A

No. Beyond the concrete data points, you've made a pile of claims
regarding whether various bugs

* have a "slight" probability of appearing (how much is "slight"?),
* "mostly" can't happen (how often is "mostly"?),
* are caught by "minimal testing" (how much is "minimal"?),
* are caught by "reasonable tests" (which tests are "reasonable"?),
* are caught by "comprehensive tests" (how much is "comprehensive"?),
* come from "complex" subroutines (how do we judge "complex"?),
* are "hard to control" in specs (how do we judge "hard"?),

etc. The pervasive lack of clarity makes it unnecessarily difficult to
figure out what exactly you're claiming and what your evidence is, and
makes any underlying errors unnecessarily difficult to correct.

For example, in reference to "copying randomness where they should be
using new randomness", you claimed that this would be "unlikely to pass
reasonable tests". I asked whether the round-1 Dilithium tests were
"reasonable". You didn't answer. I can't figure out whether

* you think the round-1 Dilithium tests weren't "reasonable", or

* you think the tests were "reasonable" but the round-1 Dilithium
software security disaster was one of the supposedly "unlikely"
cases not caught by such tests. (How often is "unlikely"?)

Meanwhile you seem convinced that matrices of polynomials aren't any
more error-prone than single polynomials. But anyone who looks at the
code sees (unsurprisingly, given the specs) a matrix layer _and_ a
polynomial layer. The same mistake that destroyed the security of the
round-1 Dilithium software can happen in the polynomial layer of the
code _and_ can happen in the matrix layer of the code---so eliminating
the matrix layer means eliminating a chance for implementors to make
this mistake. I don't understand how you draw a different conclusion.

This isn't an "agree to disagree" situation. It's a "one side asking
for clarification and the other side refusing to clarify" situation.

> (which happens to be your design)

If these issues "happen" to look good for a design that I'm involved in,
perhaps that's because I've put quite a lot of thought into these issues
over the past twenty years, as documented in various talks and papers
starting long before NISTPQC. Perhaps this allocation of time, in turn,
comes from my seeing these issues as critical for the end users and
worthy of careful analysis.

Readers might also observe that there aren't many candidates using
matrices of polynomials; that you have one of these, modulo polynomials
vs. bigints; that your main reason for jumping into this discussion
seemed to be to deny that the added matrix layer increases the chance of
bugs; and that, when your position was challenged, you evaded questions
and resorted to issuing accusations of bias, dragging this thread into
the mud, which might discourage discussion here but in the end isn't
going to make security holes _less_ likely to happen.

> Furthermore, as usual you are replying to each one of my points with
> two or three of your own

Have you considered the possibility that the shorter analysis is
oversimplified?

As an analogy, recall that at one point you painted a simple picture of
bandwidth efficiency and tried to dismiss smooth parameter spaces as
"fine-tuning". Analyzing the actual quantitative information took much
more work (https://cr.yp.to/papers.html#paretoviz), showing among other
things that your picture was oversimplified and that smooth parameter
spaces have a _larger_ effect than various issues that you highlighted.

I would prefer to see you skip the ad-hominem attacks and respond to the
technical points. If you don't have time for a detailed analysis, simply
say so, rather than trying to make it sound as if people carrying out
more detailed analyses are doing something wrong.

> > > Also, you chose a particular reading of Peter’s claim that made it false.
> > > Consider that in the specification of ThreeBears, and also in each of my
> > > implementations of it, the short key elements and noise are sampled by
> > > calling the function “noise”.
> > So, instead of asking whether the short element and the noise are
> > sampled by "the same routine" as he wrote, you're asking merely whether
> > these two routines have _some_ overlap of code? This is content-free:
> > e.g., the SABER code that creates noise by rounding uses some of the
> > same CPU instructions as the code that creates short elements.
> This is ridiculous and you know it.

Peter wrote that "Most (?) LWE schemes use the same routine for sampling
this value and for sampling the noise." You're the one claiming that
this criterion means something different from

{short elements} = {key-noise vectors} = {ciphertext-noise vectors}

but then what exactly do you think it _does_ mean? Again the lack of
clarity makes analysis unnecessarily difficult.

> First of all, the full code used to sample the key and noise can’t be
> exactly the same, because at least during encapsulation it’s
> deterministic (and must be for FO), so it must at least take
> different seeds or indices.

Samplers are normally defined as taking random input. FO uses the same
samplers and feeds non-random inputs to them, but this doesn't mean FO
is part of the sampler.

I agree with the broader point that there's code around the sampling. I
already said as much, pointing to the _addition_ of noise as an example
of an extra step that can go wrong and that's eliminated by rounding.

> The fact that the same routine is used will decrease the probability
> of bugs, but it is still higher than for rounding because there are
> more call points.

We seem to agree that rounding designs are safer than noise designs from
the perspective of implementation bugs.

> This is what Peter was saying

Really? I don't see where Peter said, in this message or elsewhere, that
rounding designs are safer. More broadly, someone skimming this thread
will find

* 3 people seeming to say that rounding is safer,
* 2 people seeming to say that rounding is equally safe, and
* 1 person seeming to say that rounding is less safe

as if there were some sort of controversy about the results of the
existing analysis.

---Dan
signature.asc

Mike Hamburg

unread,
Oct 26, 2019, 9:53:58 PM10/26/19
to D. J. Bernstein, pqc-...@list.nist.gov
Hi Dan,

I don’t have time to reply to all your points, but here are responses
to some of them.

On Oct 26, 2019, at 12:15 PM, D. J. Bernstein <d...@cr.yp.to> wrote:

...

Given the code as it actually exists (and thinking through possibilities
for further implementations), one can reasonably say that a _subroutine_
is shared, but this sharing does _not_ imply that the overall chance of
error is the same, or even that it's close. One has to analyze the
chance of error inside the subroutine _and_ the chance of error outside
the subroutine.

Yes, this is what I said before.
“””
The fact that the same routine is used will decrease the probability
of bugs, but it is still higher than for rounding because there are
more call points. 
“”"


I think I’m biased because I often develop for hardware, and we don’t
necessarily have randombytes() available within the simulator or
FPGA. But it’s easy to run a test cases through, so long as you’re
checking input -> output mappings.

Build an implementation using randombytes(). Plug in a randombytes()
implementation that does "read this input". This converts checking KATs
into an example of "checking input -> output mappings". This has nothing
to do with what RNG APIs are "available within the simulator or FPGA";
it's another type of test for the original implementation.

Of course there are ways that KATs can fail. For example, suppose an
implementor tries running a series of tests as follows:

  ( echo keygen; echo enc; echo dec ) \
  | while read operation
  do
    ssh masterdevice ./runtest $operation \
    | cmp - $operation.kat
  done

Will the implementor notice that this is actually testing only keygen,
and skipping the tests of enc and dec?

In exactly this situation, say there's a bug in seed expansion. With the
usual design choice of uncompressed seeds, the tests will catch these
bugs, because the bugs will affect the keygen output. With your design
choice of compressed seeds, the tests will _fail_ to catch these bugs,
because the bugs won't affect the keygen output.

What?  Yes it will.  A bug in seed expansion will cause the wrong public
key to be generated.

That said, obviously it is possible to engineer a scenario where an
otherwise safer solution leads to or masks a bug.  I am arguing that
in this case, seed expansion leads to better testability.  However, I
do not have time to continue to argue this at length.

I also don't think it's safe to assume that one side of the enc-dec is
always a preexisting implementation.

I did not assume this.  I explicitly left “testing with sender and receiver
on the same implementation” as the first (least effective) level of testing.

This is turning into another debate where we agree on the basic points,
and that in some axis A > B > C, but you write volumes of text on why
B is very far from A

No. Beyond the concrete data points, you've made a pile of claims
regarding whether various bugs

  * have a "slight" probability of appearing (how much is "slight"?),
  * "mostly" can't happen (how often is "mostly"?),
  * are caught by "minimal testing" (how much is "minimal"?),
  * are caught by "reasonable tests" (which tests are "reasonable"?),
  * are caught by "comprehensive tests" (how much is "comprehensive"?),
  * come from "complex" subroutines (how do we judge "complex"?),
  * are "hard to control" in specs (how do we judge "hard"?),

etc. The pervasive lack of clarity makes it unnecessarily difficult to
figure out what exactly you're claiming and what your evidence is, and
makes any underlying errors unnecessarily difficult to correct.

In this section of the discussion, a month and a half ago, I was
proposing a guideline for how one might estimate the likelihood
and impact of a bug, and spitballing how various bugs might fall
on that axis.

I am not claiming to have carefully evaluated any of those bugs,
and I have little to no evidence for my estimates.  I was merely
trying to provide a starting point for discussion about how bad
any particular “footgun” is.

For example, in reference to "copying randomness where they should be
using new randomness", you claimed that this would be "unlikely to pass
reasonable tests". I asked whether the round-1 Dilithium tests were
"reasonable". You didn't answer. I can't figure out whether

  * you think the round-1 Dilithium tests weren't "reasonable", or

  * you think the tests were "reasonable" but the round-1 Dilithium
    software security disaster was one of the supposedly "unlikely"
    cases not caught by such tests. (How often is "unlikely"?)

The round-1 Dilithium tests are “reasonable” at this point, but
they would not be “reasonable” for a scheme approaching
widespread deployment.

Meanwhile you seem convinced that matrices of polynomials aren't any
more error-prone than single polynomials.

On the contrary, I've been agreeing with you that matrices of polynomials
are riskier than single polynomials from a footgun point of view.  That’s
why I said that more call points increases the risk of bugs, and why I
said that RLWE is safer than MLWE.  I just don’t think the difference is
very big.

This isn't an "agree to disagree" situation. It's a "one side asking
for clarification and the other side refusing to clarify" situation.

This isn’t our first argument where I have clearly agreed that your
point is valid, and you have construed this as vague, obstinate or
otherwise insufficient.

Readers might also observe that there aren't many candidates using
matrices of polynomials; that you have one of these, modulo polynomials
vs. bigints; that your main reason for jumping into this discussion
seemed to be to deny that the added matrix layer increases the chance of
bugs; and that, when your position was challenged, you evaded questions
and resorted to issuing accusations of bias, dragging this thread into
the mud, which might discourage discussion here but in the end isn't
going to make security holes _less_ likely to happen.

That isn’t what happened at all, Dan.  I joined the thread because I
thought that you were being unfair to Peter.  I then spitballed a “footgun”
risk assessment of various features that was *unfavorable* to my own
design in several axes, *including this one*.  Nor have I been evasive.

I’m sorry to have accused you of bias, but it’s just really frustrating,
OK?  You seem never to be satisfied until every disadvantage of
others' schemes are admitted to be severe dangers, and every
disadvantage of your scheme is admitted to be actually an advantage.
Any attempt to argue otherwise is met by twice the volume of
aggressive argument, and any attempt to pare down the discussion
your treat as evasion.  This is a repeated pattern here, and I think
it discourages participation in this forum.  It certainly makes me more
hesitant to reply on any thread where you have commented.

I don’t mean to use the above paragraph to establish that your points
are wrong.

Furthermore, as usual you are replying to each one of my points with
two or three of your own

Have you considered the possibility that the shorter analysis is
oversimplified?

As an analogy, recall that at one point you painted a simple picture of
bandwidth efficiency and tried to dismiss smooth parameter spaces as
"fine-tuning".

That’s not what happened.  I wrote:
“””
… Fine-tuning is a useful feature, especially when you have
sharp bandwidth and security requirements.  But ThreeBears and
Saber and especially LAC are also inherently more bandwidth-
efficient than much of the field (using current sieve or enumeration
or hybrid estimates).  So use cases with sharp bandwidth constraints
are also likely to be favorable to these systems, even more favorable
to Round5, and unfavorable for NewHope regardless of CPA vs CCA
or tuning gap.
“”
That’s not being dismissive.

Analyzing the actual quantitative information took much
more work (https://cr.yp.to/papers.html#paretoviz), showing among other
things that your picture was oversimplified and that smooth parameter
spaces have a _larger_ effect than various issues that you highlighted.

That paper doesn’t analyze smooth parameter spaces.  Furthermore,
it shows that NTRU LPRime has an advantage against ThreeBears,
by *one bit* of estimated core-sieve security, for 36% of the range of
parameters in the range of [BabyBear … MamaBear).  This range is
favorable to NTRU LPRime: cutting off at MamaBear’s 1307-byte
ciphertext size maximises the ratio.


This comparison would not be improved — except for the “by one bit”
part — by further exploring the smooth parameter space of NTRU
LPRime, because it is in some sense already tuned optimally for this
comparison (where it wins, it wins by the minimum amount).  It would
be made worse if ThreeBears were allowed to be tuned as well, because
a parameter set with D=260, d=3 would have around 1103 bytes of
ciphertext and likely more core-sieve security (my old spreadsheets
suggest 197 bits, but that was with an old version of the estimator)
than any NTRU LPRime instance in that graph.

The comparisons against LAC and SABER are even less favorable to
NTRU LPRime.

So, I don’t see how that paper supports your point.

I would prefer to see you skip the ad-hominem attacks and respond to the
technical points. If you don't have time for a detailed analysis, simply
say so, rather than trying to make it sound as if people carrying out
more detailed analyses are doing something wrong.

I apologize for the ad-hominem attack.  I know that you propose
designs because you think (and you are an expert) that those
design principles are correct.

However, please understand that the way you argue is very irritating.
It’s one thing not to admit when you’re wrong, but it’s even more
annoying to press your opponents on every minor point in favor of
your scheme, even when they mostly agree with you on those points.

As another example, you are beating up on “matrices of polynomials”
and using the Dilithium and Falcon bugs to push your points.  But those
bugs are not related to matrices of coefficients, nor would they be fixed
by rounding.  Furthermore, you have brushed off comments about
fixed-weight sampling as being risky, but since then I found timing leaks
in two schemes’ fixed-weight sampling implementations; only one of
them is fixed, and only in an “alternative” implementation.

Also, you chose a particular reading of Peter’s claim that made it false.
Consider that in the specification of ThreeBears, and also in each of my
implementations of it, the short key elements and noise are sampled by
calling the function “noise”.
So, instead of asking whether the short element and the noise are
sampled by "the same routine" as he wrote, you're asking merely whether
these two routines have _some_ overlap of code?  This is content-free:
e.g., the SABER code that creates noise by rounding uses some of the
same CPU instructions as the code that creates short elements.
This is ridiculous and you know it.

Peter wrote that "Most (?) LWE schemes use the same routine for sampling
this value and for sampling the noise." You're the one claiming that
this criterion means something different from

  {short elements} = {key-noise vectors} = {ciphertext-noise vectors}

but then what exactly do you think it _does_ mean? Again the lack of
clarity makes analysis unnecessarily difficult.

I simply didn’t read Peter’s claim as being quite so mathematically precise.
Rather, I read him as saying that in typical implementations of these schemes,
the bulk of the work will be done by calling — or one might say, “using” — a
common subroutine.

I believe that because of this, the difference in probability of bugs in the noise
sampling calls between MLWE and RLWE is not very large.  It is possible that
“not very large” is so vague as to be “content-free”, but the statement that it
matters that the bulk of the code is shared is not content-free.

But what’s even the point of this?  Why is it so important, at this point in the
discussion, whether Peter’s mathematically imprecise statement from a
month ago is technically incorrect?

— Mike

Vadim Lyubashevsky

unread,
Oct 27, 2019, 12:06:53 AM10/27/19
to pqc-forum

If these issues "happen" to look good for a design that I'm involved in,
perhaps that's because I've put quite a lot of thought into these issues
over the past twenty years, as documented in various talks and papers
starting long before NISTPQC. Perhaps this allocation of time, in turn,
comes from my seeing these issues as critical for the end users and
worthy of careful analysis.

We should be careful to not put the cart in front of the horse here.  The discussion about adding noise vs. rounding seems to have completely ignored the fact that rounding is a newer (and arguably less-natural) assumption that does not directly follow from (Generalized)-LWE.  I am not claiming that there is an attack just waiting to be found, but I have seen enough deviations from "classic" (Generalized)-LWE assumptions to be wary of making changes for small gains. Since you're mentioning the twenty years that you've spent thinking about secure implementations, you should also allow that others (not just me) have spent a similar amount of time thinking about secure designs of lattice primitives.  And if the implementation security advantage of LWR over LWE is that the former requires generating one set of random numbers while the latter requires generating two, it seems like a very small epsilon gain compared to the fact that one is making a newer, less-natural assumption.


Each loop here is another chance for Dilithium-type security disasters.

This is at least the fifth email in which you refer to the Dilithium "disaster". To quote Inigo Montoya: "You keep using that word.  I do not think it means what you think it means."  The definition of "disaster", according to Lexico, is either a "sudden accident or a natural catastrophe that causes great damage or loss of life" or "an event or fact that has unfortunate consequences".  I will give you the benefit of the doubt and assume that you were going with the second, less dramatic, definition.

But I am quite sure that the bug in Dilithium (where a variable was overwritten before being used) or, for that matter, a bug in any Round 1 submission (and there were bugs in many submissions) did not beget any unfortunate consequences since all the schemes were at least 2 rounds and 7 years away from being used (alone) in any application.   It would have of course been nice if everyone had their reference implementations written three months before the deadline and then did rigorous testing, but I imagine that because NIST implied that they didn't really care about implementation bugs, many teams were optimizing their code until the very end.  I am not making excuses for our bug, but simply stating that one should probably not be drawing any general conclusions about security engineering based on code which wasn't meant to be used in secure applications.   
 
Best,

Vadim

Reply all
Reply to author
Forward
0 new messages