Really fast NTRU

602 views
Skip to first unread message

Gregor Seiler

unread,
Jan 17, 2019, 6:09:22 AM1/17/19
to pqc-...@list.nist.gov, Vadim Lyubashevsky
Dear all,

To the best of our knowledge, there are no lattice-based KEM submissions
that use NTTs over rings Z[X]/(f) where the degree of f is not a power
of 2. Since it has sort of been accepted by now that for security we
want the degree of f to be between 700 and 800, this rules out using NTT
for “optimally-set” parameters. Schemes based on generalized-LWE (like
Kyber) get around this issue by working with a random matrix of
256-degree polynomials, but NTRU schemes cannot do this without
increasing their public key size. Since NTRU-like schemes don’t have
many parts to them other than polynomial multiplication / division
(unlike LWE-based schemes that also require sampling uniform polynomials
in the ring which takes more time than actual NTT multiplication),
optimizing this part - especially the division - could lead to a really
efficient KEM.

So to test the potential of an NTT-based NTRU, we have implemented an
NTT in dimension 768 and built a complete CCA-secure constant-time KEM
around it. Our AVX2-optimized KEM (which we called NTTRU) is faster than
the AVX2-optimized NTRU-HRSS implementation by a factor of 30 in key
generation (incorporating the newest speed-ups announced by Dan), 5 in
encapsulation, and 8 in decapsulation. Concretely it only needs about
6.4K cycles for key generation, 6.1K cycles for encapsulation and 7.9K
cycles for decapsulation. All numbers were measured on a Skylake laptop
(decapsulation is not yet as optimized as the rest). The NTT itself over
the ring Z_q[X]/(X^768-X^384+1) for q=7681 only needs 810 cycles.

Our NTT in dimension 768 can also easily be used do define versions of
other NTT-based Ring-LWE encryption schemes (e.g. NewHope, LIMA, etc.)
to work over non-power-of-2 rings, and some of our internal
optimizations should also be applicable to schemes that only need to use
power-of-2 NTTs (e.g. Kyber).

So our conclusion is that if you’re OK with using cyclotomic rings with
splitting moduli (e.g. NTRU Prime purposefully avoids them, while
NTRU-HRSS is over such a cyclotomic ring), then we believe that doing
multiplication via NTT is absolutely the way to go – especially for
NTRU-like schemes because they don’t have many other expensive parts and
their key generation requires division, which is much more expensive
without NTT.

For more details, the source code is at
https://github.com/gregorseiler/NTTRU and the paper should be on eprint
in the coming days.

Regards,
Gregor and Vadim

Nigel Smart

unread,
Jan 17, 2019, 7:09:43 AM1/17/19
to pqc-...@list.nist.gov
See the "safe-prime" version of LIMA...

Nigel
signature.asc

Leo Ducas

unread,
Jan 17, 2019, 8:18:33 AM1/17/19
to pqc-forum, vadim....@gmail.com, gse...@inf.ethz.ch
Isn't there a similar implementation of an NTT in dimension 768 in Falcon ?
-- L

Thomas Prest

unread,
Jan 17, 2019, 9:04:11 AM1/17/19
to Leo Ducas, pqc-forum, vadim....@gmail.com, gse...@inf.ethz.ch

Yes, there is an implementation of the NTT in almost the same ring as mentioned by Gregor (see https://falcon-sign.info/impl/falcon-vrfy.c.html, function mq_NTT_ternary).

Thomas

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
Visit this group at https://groups.google.com/a/list.nist.gov/group/pqc-forum/.

Gregor Seiler

unread,
Jan 17, 2019, 9:54:35 AM1/17/19
to pqc-forum
Hi all,

@Leo, Thomas: Yes, our Falcon signature scheme (not a KEM) uses a very
similar NTT over a larger prime, but there is no optimized AVX2
implementation (yet). In fact, the NTT implementation in NTTRU can now
be almost directly ported to FALCON.

@Nigel: Sorry for the oversight of not mentioning the LIMA safe-prime
version. But the algorithms are different. When we say NTT, we mean an
algorithm that is in a sense "native" to the ring. Our NTT is different
from the Bluestein algorithm in LIMA in that it really does not do more
work than you would do in an optimal power-of-two NTT, apart from a few
additions. We do exactly the same 8x384 multiplications to reduce a
polynomial of degree under 768 to polynomials of degree under 3 as you
would do if 768 were a power of two.

And just to avoid any misunderstandings about our claims -- we are not
saying that we invented a new NTT technique. Our main objective was to
create an as-fast-as-possible KEM using AVX2 optimizations, and a few
small tricks particular to the kind of ring we use. And the result is
that NTRU can be made very fast when it is instantiated over an
NTT-friendly ring.

Regards,
Gregor and Vadim
>> <mailto:pqc-forum+...@list.nist.gov>.

Leo Ducas

unread,
Jan 17, 2019, 11:05:19 PM1/17/19
to pqc-forum, gse...@inf.ethz.ch

And just to avoid any misunderstandings about our claims -- we are not
saying that we invented a new NTT technique.

That's not how it read, I was merely pointing at a point of reference to compare to, and a potential application of you very fast code.

Best
-- Leo

D. J. Bernstein

unread,
Jan 19, 2019, 9:57:31 AM1/19/19
to pqc-...@list.nist.gov
Gregor Seiler writes:
> The NTT itself over the ring Z_q[X]/(X^768-X^384+1)
> for q=7681 only needs 810 cycles.

This is fast enough that it should also save time in implementations of
NIST submissions that use more conservative rings. Multiplication uses 3
FFTs (and fewer if some FFTs can be cached); the usual reduction from
arbitrary polynomials to FFT-friendly polynomials sacrifices a factor of
2 (e.g., multiply modulo x^761-x-1 by multiplying modulo an FFT-friendly
polynomial of degree 1536); the usual reduction from arbitrary primes to
FFT-friendly primes sacrifices a factor of 2 when one input is small (as
it is in NTRU), or 3 in general. The FFT-friendly polynomials don't have
to be irreducible in this context: e.g., one can use the "truncated FFT"
polynomial (x^1024+1)(x^512-1) modulo 12289 and 18433. This will all
adapt straightforwardly to other polynomial lengths and coefficient
sizes, and to big-integer multiplication.

Comments below are regarding your new cryptosystem.

> a really efficient KEM

But your paper says 1248-byte ciphertexts, compared to (e.g.) 1152 bytes
for kyber768, 1088 bytes for saber, 1047 bytes for sntrup4591761.

Typically sending or receiving a byte costs at least three orders of
magnitude more than a clock cycle. Taking bytes+cycles/1000 for
sntrup4591761 gives 1047+45 = 1092 for the sender, 1047+94 = 1141 for
the receiver, which is better than 1248 no matter how few cycles you
use. Of course a full analysis would also have to study the security
level, which leads to many open questions, but presumably your security
loss from a larger q outweighs the slight difference in dimension etc.

Have you considered Rounded NTRU as a smaller alternative to Noisy NTRU?
How about a smaller q, and fixed weight instead of variable weight?

One of the reasons for upwards pressure on your q is that you're using
the old idea of taking f as 1+3*small instead of small. Your paper says
that (\Z/3)[x]/(x^768-x^384+1) doesn't support an NTT, but there are
many fast multiplication methods---for example, as in the HRSS software,
you could simply reuse your (\Z/q)[x]/(...) multiplier. As for division,
that's only an issue for keygen, and it's going to be quite fast anyway
since x^768-x^384+1 factors mod 3 as (x^64+x^32-1)^6 (x^64-x^32-1)^6.

> if you’re OK with using cyclotomic rings with
> splitting moduli (e.g. NTRU Prime purposefully avoids them, while NTRU-HRSS
> is over such a cyclotomic ring), then we believe that doing multiplication
> via NTT is absolutely the way to go

Your polynomial x^768-x^384+1 is irreducible over \Z but has 2^256 monic
divisors (and more non-monic divisors) modulo the prime 7681 that you're
using. Have you at least analyzed the impact that this has upon security
against _known_ NTRU attacks? For example:

* Do you know how small the coefficients can be in divisors of
degree, say, 384? Sufficiently small coefficients would seem to
open up the ancient NTRU attack strategy from, e.g., 2001 Gentry.

* A multi-target attack can scan for keys that reduce to 0 modulo any
of your cubic factors. Doesn't this happen more than once in a
billion keys, allowing a slightly faster lattice attack since g is
then constrained to a lattice of slightly smaller dimension?

You draw an analogy to NTRU-HRSS, but NTRU-HRSS uses a polynomial
(x^701-1)/(x-1) that's irreducible mod 2. The actual modulus there is a
larger power of 2, so there are _some_ quotient rings to worry about,
but not your exponential collection of quotient rings. (Of course you
can point to, e.g., New Hope also having an exponential collection of
quotient rings, but this doesn't answer the security-analysis question.)

I agree that NTTRU is analogous to NTRU-HRSS in the use of cyclotomics:
there are many automorphisms (and quite a few subfields) to worry about.
Here's a larger NTTRU/HRSS/Prime comparison table regarding five NTRU
Prime features designed to reduce attack surface:

NTTRU HRSS Prime feature
----- ----- ----- -------------------------------------------
no yes yes avoid decryption failures
yes yes yes irreducible polynomial (over \Z)
no no yes avoid nontrivial number-field automorphisms
yes no yes prime modulus q
no no yes the ring is a field

The last two "no" entries for NTTRU are inherent in this use of NTT,
but the first "no" would be very easy to fix.

---Dan
signature.asc

Vadim Lyubashevsky

unread,
Jan 19, 2019, 5:30:45 PM1/19/19
to pqc-...@list.nist.gov

Hi Dan, all, 

 Multiplication uses 3
FFTs (and fewer if some FFTs can be cached); the usual reduction from
arbitrary polynomials to FFT-friendly polynomials sacrifices a factor of
2 (e.g., multiply modulo x^761-x-1 by multiplying modulo an FFT-friendly
polynomial of degree 1536); the usual reduction from arbitrary primes to
FFT-friendly primes sacrifices a factor of 2 when one input is small (as
it is in NTRU), or 3 in general.

When you say "the usual reduction", do you mean setting the FFT-friendly prime q large enough so that no reduction modulo q takes place?  If that's the case and your original prime is p and your original degree is d, then you need q to be at least > pd (and that's if you're multiplying by -1/0/1 polynomials).  Gregor can mention much more about this, but I believe that for such large q ( > 16 bits), some of his AVX2 tricks used in NTTRU don't directly carry over.  But anyway, it would certainly be nice if this helps others.          

> a really efficient KEM
 
Of course a full analysis would also have to study the security
level, which leads to many open questions, but presumably your security
loss from a larger q outweighs the slight difference in dimension etc.

We were just comparing to NTRU-HRSS.  Ignoring the ring structure, our dimension is higher and modulus is lower, so it should be at least as hard.
 
Have you considered Rounded NTRU as a smaller alternative to Noisy NTRU?

We wanted to include as few extra assumptions as possible and make things as close as possible to NTRU-HRSS.  But yes, that should reduce the ciphertext size.       
 
How about a smaller q, and fixed weight instead of variable weight?

In order to be able to do NTT, the only smaller compatible q that I see below 2^12 is 3457 ... it's certainly an option.  Using fixed weight is harder to sample and harder to calculate the decryption error.

One of the reasons for upwards pressure on your q is that you're using
the old idea of taking f as 1+3*small instead of small. Your paper says
that (\Z/3)[x]/(x^768-x^384+1) doesn't support an NTT, but there are
many fast multiplication methods---for example, as in the HRSS software,
you could simply reuse your (\Z/q)[x]/(...) multiplier. As for division,
that's only an issue for keygen, and it's going to be quite fast anyway
since x^768-x^384+1 factors mod 3 as (x^64+x^32-1)^6 (x^64-x^32-1)^6.

This would slow things down, but it is something that could be a viable trade-off if we reduce q to 3457 and thus save 768 bits in the public key / ciphertext.  

Now is a good place to discuss decryption errors (since you bring it up at the very end as a way to differentiate schemes).  I am really confused as to why you draw this solid line between 0 decryption error and >0 decryption error.  There are actual reductions e.g. [HHK17] that take small decryption errors into account (though admittedly, [HHK17] is not too convenient to accurately use with NTRU - but is perfectly fine for Regev-type schemes). And if your decryption error e is such that e<< 1/|M|, where M is the message space, then information theoretically, only a negligible fraction of keys will have any decryption errors in your CCA scheme (assuming you believe that SHA acts as a random function).  So you are just as secure as without decryption errors.  

In the NTTRU paper, we also give a simple transformation that reduces the message space of the scheme (since in NTRU schemes, the message space is larger than 256 bits) so you can apply the information theoretic argument.  You mentioned in a previous email that decryption errors may be hard to calculate as a reason to completely avoid them -- but in NTRU schemes where each coefficient is distributed independently, this is not the case at all (and if we made a mistake, it should be easy to catch, so this is not a reason to avoid more efficient schemes).  So I would even suggest that other schemes, like NTRU-HRSS, consider reducing their modulus to get decryption errors and (possibly) use the transformation in NTTRU to provably handle decryption errors.  This both reduces the ciphertext / public key and increases security due to the smaller modulus.  I can sort of understand if you complain if schemes having 2^-80 decryption error claim 128-bit security, but I don't understand why the mere presence of any decryption error is automatically a strike against the scheme. 
 
Back to q=3457 ... I calculate that if we use f=3f'+1, then the decryption error will be around 2^-260.  I could personally live with that, but our security proof doesn't work anymore.  We mention as an open problem that there might be a way to get a better bound for our transformed scheme using ideas from [HHK17] - but I am not completely sure about this.  If it's true, then one should almost definitely go for the smaller prime.  Otherwise, an option would be, as you said, to not use f=3f'+1 and do divisions as in NTRU-HRSS.  Without the extra multiplication by 3, the decryption error goes down to around 2^-460, so would be provably fine to use.


Your polynomial x^768-x^384+1 is irreducible over \Z but has 2^256 monic
divisors (and more non-monic divisors) modulo the prime 7681 that you're
using. Have you at least analyzed the impact that this has upon security
against _known_ NTRU attacks? For example:

   * Do you know how small the coefficients can be in divisors of
     degree, say, 384? Sufficiently small coefficients would seem to
     open up the ancient NTRU attack strategy from, e.g., 2001 Gentry.

I am not discounting your concerns, but I do not want to get into a discussion about attacks that don't exist.  My personal view, for what it's worth, is that all splittable cyclotomic rings are fine, if one uses a large-enough secret / noise.  If the noise is small (i.e. 0/1 as in all NTRU schemes), then I would feel better if more people studied the particular rings -- but this is both a function of the small noise and the extra algebraic structure, and I am not sure which part individually should be of more concern (this is one of the reasons why we use larger errors in Kyber).  The reason that I am a fan of rings with algebraic structure is that, in addition to added efficiency and elegance, they allow more interesting advanced primitives.  You can't do useful FHE in some rings and we're working on much improved zero-knowledge proof in which complete splitting is crucial.  So it would be a huge shame if such rings are actually fine to use, but are discounted without much further study because of "paranoia". 

To have larger secret / error in NTRU schemes, I would even propose to consider working with a noise that is larger than -1,0,1, but still reduce mod 3.  (So the problem is to recover the message mod 3 rather than the message itself).  If you generate the errors as a binomial, then this new problem is provably at least as hard as the former, but there is almost certainly more practical security since the lattice problem is now harder. 

   * A multi-target attack can scan for keys that reduce to 0 modulo any
     of your cubic factors. Doesn't this happen more than once in a
     billion keys, allowing a slightly faster lattice attack since g is
     then constrained to a lattice of slightly smaller dimension?

I don't see the improved attack.  If f is 0 in some NTT coefficient, then you only get knowledge of one *NTT coefficient* of g.  Unless you're doing exhaustive search, I don't know of any way that knowing an NTT coefficient can help in lattice reductions.
 
   NTTRU HRSS  Prime feature
   ----- ----- ----- -------------------------------------------
   no    yes   yes   avoid decryption failures
   yes   yes   yes   irreducible polynomial (over \Z)
   no    no    yes   avoid nontrivial number-field automorphisms
   yes   no    yes   prime modulus q
   no    no    yes   the ring is a field

The last two "no" entries for NTTRU are inherent in this use of NTT,
but the first "no" would be very easy to fix.

No fix needed :)

-Vadim
 

---Dan


--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

Gregor Seiler

unread,
Jan 20, 2019, 1:08:16 AM1/20/19
to pqc-...@list.nist.gov
Hi all,

On 19.01.19 23:30, Vadim Lyubashevsky wrote:
>
> Hi Dan, all,
>
>  Multiplication uses 3
> FFTs (and fewer if some FFTs can be cached); the usual reduction from
> arbitrary polynomials to FFT-friendly polynomials sacrifices a factor of
> 2 (e.g., multiply modulo x^761-x-1 by multiplying modulo an FFT-friendly
> polynomial of degree 1536); the usual reduction from arbitrary primes to
> FFT-friendly primes sacrifices a factor of 2 when one input is small (as
> it is in NTRU), or 3 in general.
>
>
> When you say "the usual reduction", do you mean setting the FFT-friendly
> prime q large enough so that no reduction modulo q takes place?  If
> that's the case and your original prime is p and your original degree is
> d, then you need q to be at least > pd (and that's if you're multiplying
> by -1/0/1 polynomials).  Gregor can mention much more about this, but I
> believe that for such large q ( > 16 bits), some of his AVX2 tricks used
> in NTTRU don't directly carry over.  But anyway, it would certainly be
> nice if this helps others.

I think what Dan means is doing Chinese remaindering modulo 2 or 3
different primes.

Cheers,
Gregor

Vadim Lyubashevsky

unread,
Jan 20, 2019, 1:28:46 AM1/20/19
to Gregor Seiler, pqc-...@list.nist.gov

I think what Dan means is doing Chinese remaindering modulo 2 or 3
different primes.

Ah, that makes sense.  

And about Dan's "Rounded NTRU" idea to reduce ciphertext size, I guess one could incorporate it even without making the extra assumption - i.e. one could add a random message as per the usual NTRU and also drop bits from the ciphertext.  The hard problem is then to recover the message + extra dropped bits mod 3, but that should be ok security-wise and should reduce the ciphertext a bit. 

Cheers,
Vadim



Cheers,
Gregor

D. J. Bernstein

unread,
Jan 20, 2019, 1:06:26 PM1/20/19
to pqc-...@list.nist.gov
Vadim Lyubashevsky writes:
> When you say "the usual reduction", do you mean setting the FFT-friendly
> prime q large enough so that no reduction modulo q takes place?

https://www.ams.org/journals/mcom/1971-25-114/S0025-5718-1971-0301966-0/S0025-5718-1971-0301966-0.pdf

The multiplication method needs a large enough modulus but doesn't need
the modulus to be irreducible. Pollard writes "It is possible to replace
the one possibly very large prime p required by a number of smaller
primes", and gives an example where he chooses three word-size primes.

Two word-size primes are enough in the typical NTRU context. I already
gave the example of M = (x^1024+1)(x^512-1) modulo 12289 and 18433. Each
multiplication in (\Z/4591)[x]/(x^761-x-1) reduces efficiently to an
NTT-friendly multiplication in (\Z/12289)[x]/M and an NTT-friendly
multiplication in (\Z/18433)[x]/M, since one input is small.

Consequently, I expect your work to end up saving more cycles for
sntrup4591761 and ntrulpr4591761 than it has saved for similarly sized
NTT-friendly rings. This should make the NTRU-Prime-vs.-NTT cycle gap
even less noticeable than it was before, compared to the primary cost
incurred by all these systems, namely bytes transmitted.

> Using fixed weight is harder to sample and harder to calculate the
> decryption error.

Sorting int32[768] takes 6512 cycles with verified code, and it's
trivial to choose the weight so that there are no decryption failures.

> I am really confused as to why you draw this solid line between
> 0 decryption error and >0 decryption error. 

eprint 2019/043 appeared a few days ago, saying that ss-ntru-pke "is
attacked with complexity below the claimed security level". Recall that
ss-ntru-pke is part of the original NTRUEncrypt submission; it was
claimed to be "extremely conservative" and provably secure. The attack
exploits decryption failures.

How many lines of failure-probability calculations has this competition
been throwing at security reviewers? And how many lines of alleged
proofs that the failures are too rare to cause trouble? How many
mistakes and security losses do we need to see before deciding that this
minor optimization within lattice-based cryptography is a bad idea?

The important metric here is the complexity of a comprehensive review of
security. This varies considerably from one system to another, and it's
directly connected to the risk of a breakable system being standardized
and/or deployed. See my email dated 8 Dec 2018 04:17:14 -0000 for
analysis and more examples, such as OCB2.

[ quotient rings ]
> I am not discounting your concerns, but I do not want to get into a
> discussion about attacks that don't exist.

Reducing the NTRU key equation modulo a divisor of the polynomial, and
then applying lattice attacks to the reduced equation, is a well-known
attack strategy that breaks some rings. See, e.g., 2001 Gentry.

The quantitative impact depends on the shape of the divisor (at least
the positions and sizes of nonzero coefficients). I don't see how you
can claim security against _known_ attacks if you haven't analyzed the
shape of divisors of x^768-x^384+1 modulo 7681.

How effective are lattice attacks after reduction mod 45x^384+56, for
example? The coefficients here look projectively quite a bit smaller
(compared to 7681) than the coefficients that Jacob was trying for LAC
(compared to 251). How about reduction mod 5x^192+13x^96-5?

Of course there's also a much more recent line of attacks exploiting
automorphisms, subfields, etc., most recently (2018) breaking through
the 2^sqrt approximation-factor barrier. It would hardly be a surprise
if the next few years of developments also exploit divisors. I don't
understand why anyone would recommend giving the attacker an arsenal of
attack tools that have produced attacks at SODA 2016, Eurocrypt 2017,
etc., and that we know how to eliminate at low cost. But, even if you
don't agree, surely you have to analyze whether your system survives the
specific attack strategies that have broken some earlier systems.

> You can't do useful FHE in some rings and we're working on much
> improved zero-knowledge proof in which complete splitting is crucial. 

Assume that performance constraints force NIST to standardize a small
lattice-based system. How do you imagine that we're going to get from
the current situation, with security flaws being announced in 2019 in
"provably secure" lattice systems, to a situation where we can actually
be confident in the security of what NIST standardizes? Do you think
that adding extra desiderata such as FHE is going to help this process?

---Dan
signature.asc

Michael Hamburg

unread,
Jan 20, 2019, 3:02:41 PM1/20/19
to pqc-...@list.nist.gov

On 1/20/19 10:06 AM, D. J. Bernstein wrote:
> Vadim Lyubashevsky writes:
>> I am really confused as to why you draw this solid line between
>> 0 decryption error and >0 decryption error.
> eprint 2019/043 appeared a few days ago, saying that ss-ntru-pke "is
> attacked with complexity below the claimed security level". Recall that
> ss-ntru-pke is part of the original NTRUEncrypt submission; it was
> claimed to be "extremely conservative" and provably secure. The attack
> exploits decryption failures.
>
> How many lines of failure-probability calculations has this competition
> been throwing at security reviewers? And how many lines of alleged
> proofs that the failures are too rare to cause trouble? How many
> mistakes and security losses do we need to see before deciding that this
> minor optimization within lattice-based cryptography is a bad idea?
>
> The important metric here is the complexity of a comprehensive review of
> security. This varies considerably from one system to another, and it's
> directly connected to the risk of a breakable system being standardized
> and/or deployed. See my email dated 8 Dec 2018 04:17:14 -0000 for
> analysis and more examples, such as OCB2.

Hi Dan,

I'm not sure how ePrint 2019/043 supports your point, except in that
writing and reading such a paper takes work.  Perhaps I am misreading
this paper, but as I understand it, it proposes using 2^127
chosen-ciphertext queries (i.e. almost the square of NIST's evaluation
point) to 2^64 public keys, plus 2^217 work, to recover one key in a
system that has 2^-80 failure probability.  While the paper proposes a
concrete attack, most authors have been assuming that attacks of this
form can be much stronger, and have set their failure probabilities much
lower than 2^-80.  So the actual content of the paper does not detract
from Vadim's point that sufficiently smaller failure probabilities are safe.


Overall, my position is somewhere between Dan's and Vadim's.  In sum:

* If conservative parameters are chosen (say 2^-160 and proper domain
separation, a la Kyber), known CCA reaction attacks plus likely
improvements will not lead to a certificational weakness in the form of
an attack that works with high probability.

* There is some question on whether an attack counts as a
certificational weakness if it takes little effort but is extremely
unlikely to work.  Such attacks are harder to rule out.

* If conservative parameters are chosen, even significant advances in
CCA reaction attacks are unlikely to make those attacks feasible.

* For a given scheme without error correction, it is not very difficult
to quantify the above factors, but it is time-consuming to do it for all
schemes.

* Exactly how conservative the parameters should be is still a matter of
debate, and that debate is a time sink.  The most conservative failure
probability, and the easiest to evaluate, is zero.  It may or may not be
worth the efficiency cost.

* Schemes with error correction, such as my own, are more efficient but
compound the difficulty of analysis.



Details:

As far as I am aware, all the proposed reaction attacks on FO-protected
systems work as follows:

* Compute lots of ciphertexts, probably encrypted to many different keys.

** If no domain separation, you can compute encryptions to all keys at once.

* Evaluate, somehow, those which are most failure-prone. Send them to
would-be victims.

** No attack has been proposed which can detect weak keys, or which can
use information from the public key to estimate failure probability of
the ciphertext.  Rather they estimate failure against an unknown secret
drawn from the distribution of private key noise.

* If you see a decryption failure, somehow use that information to
attack that private key.

** Probably this involves repeating the above steps, but you have new
information about the key, and it may be a weak key, so the next
iterations may be faster.

While recent papers have attempted to quantify the practical complexity
of an attack of this form, none of them has improved on this procedure
for the effort required to get the first failure. For a scheme without
error correction, it is not difficult to rigorously compute both the
failure probability and the amount of work required to find the first
failure with a given query limit. However, the first failure can happen
in a few queries with tiny probability, which makes it hard to rule out
a certificational weakness if you consider attacks that usually don't
work.  Also it does take appreciably more work than proving a lack of
decryption failures, so I agree that perfect correctness is a useful
analysis feature.


There is also a risk that someone will discover an advanced attack that
detects weak keys, and/or can estimate failure probability using the
public key.  In the ROM, such an attack still cannot be more efficient
than a search for a failure, which gives advantage at most:

Expectation over collections of K honestly-generated keys of [max over
all messages m of [pr(fail given K,m)]] * (attacker's work)

where the inner "max over all messages m" is generally irrelevant, and
the inner pr(fail) is taken over the coins of the ciphertext.  Again,
this value is not difficult to compute or at least to bound for most
schemes, but it is significant work to examine every scheme to compute
it.  If the probability of failure is set at, eg, 2^-160, then such an
advanced attack could become a certificational weakness if the scheme is
designed for 2^256 security, especially with Grover's algorithm.  The
weakest key of a collection of 2^64 keys might have 2^-140 failure
probability in expectation, which would then entail 2^76 quantum work
with maxdepth 2^64 to find a failure.  Note however that this assumes
large cryptanalytic advances and is still probably infeasible for at
least 40 years, since maxdepth 2^64 is highly optimistic and there is a
large constant out front for reversibility, refrigeration etc.

Cheers,

-- Mike Hamburg

Vadim Lyubashevsky

unread,
Jan 20, 2019, 3:46:08 PM1/20/19
to pqc-...@list.nist.gov
Consequently, I expect your work to end up saving more cycles for
sntrup4591761 and ntrulpr4591761 than it has saved for similarly sized
NTT-friendly rings. This should make the NTRU-Prime-vs.-NTT cycle gap
even less noticeable than it was before, compared to the primary cost
incurred by all these systems, namely bytes transmitted.

Also, maybe rings that don't require inert primes that NTRU Prime does, but still use a non-NTT-friendly cyclotomic of the form (X^p -1)/(x-1), can natively use the NTT-friendly modulus that matches the one for the bigger NTT-friendly ring in which they want to do the multiplication?  This way, they can just do multiplication in the larger ring without worrying about the modulus.  Basically, as far as I understand, the only reason that NTRU-HRSS uses a power-of-2 modulus is to speed up multiplication.  But if multiplication can be made even faster with this NTT over a larger ring when the same prime is used, then maybe there is no reason to use a power-of-2 (except maybe key generation could be affected ... but I have no idea here).
 
Sorting int32[768] takes 6512 cycles with verified code, and it's
trivial to choose the weight so that there are no decryption failures.

That's the number of cycles in the entire key generation / encapsulation algorithms of NTTRU ... and I don't see any advantage of doing this.
 
> I am really confused as to why you draw this solid line between
> 0 decryption error and >0 decryption error. 

eprint 2019/043 appeared a few days ago, saying that ss-ntru-pke "is
attacked with complexity below the claimed security level". Recall that 
ss-ntru-pke is part of the original NTRUEncrypt submission; it was
claimed to be "extremely conservative" and provably secure. The attack
exploits decryption failures.

(My writing of this email crossed with Mike's and our conclusion about this eprint paper appears to be the same). The paper states  "A result of the attack is that it is questionable whether it is possible to achieve high security (say 256 bits in the classic sense) for proposals with midlevel decryption error probability (say 2 −100)".  I am not sure if the NTRUEncrypt paper actually claimed this (or why anyone would need to claim more than 128 bits of security), but this paper is not contradicting any sort of proof. 

There is a big difference between statements of the form "Our error is 2^-100 and we don't think it can be exploited" vs. "here is the security proof that formally takes decryption errors into account".  I understand if someone doesn't like the former, but I cannot understand how anyone could object to the latter.... and yet in the next paragraph you do.  

How many lines of failure-probability calculations has this competition
been throwing at security reviewers? And how many lines of alleged
proofs that the failures are too rare to cause trouble? How many
mistakes and security losses do we need to see before deciding that this
minor optimization within lattice-based cryptography is a bad idea?

For schemes like NTTRU, Kyber, etc, it's very few lines.  All the errors are independent and the error probability is easy to calculate and verify via simple polynomial multiplication.  If there is a mistake, it will be caught (especially if the scheme becomes some sort of a standard).  So maybe you're complaining about not trusting [HHK '17] because it's a long paper.  I obviously object to that, but ok ... how about the 1 line information theory argument in my previous email?  (And regarding Mike's point about weak keys -- I am even saying that you can set your decryption error so that there are no weak keys at all with very high probability and it would still be beneficial over having no decryption errors.)  
 
What bothers me about this "0 decryption errors" movement is that it's actually making schemes less secure and less efficient and getting *nothing* in return.  For example, just extrapolating the numbers from NTTRU, I am reasonably sure that NTRU-HRSS could have divided their modulus by 2 and still had something like 2^-400 decryption error.  I haven't done the calculation, but supposing that this is true, then this would be *information theoretically* secure if they go through the transformation in NTTRU.  This leads to a more secure scheme (modulus is halved, which is especially important for NTRU due to previously shown overstretched modulus attacks) and saves around 87 bytes in the pk and 87-32 bytes in the ciphertext.  And there is 0 penalty.  Or alternatively, they could use secrets that are not restricted to -1/0/1 and again be more secure.  

Or are you making a general statement of: "optimization can lead to errors, so let's not optimize?"  I don't think that you actually believe this; and decryption errors is really one of the simplest things that one can verify in these schemes.  And if you are going to use papers that claim something that later turns out to be not true as evidence that anything in this line of research should be abandoned, then we would not have much crypto left -- and I think that this is basically the QKD marketing argument. 


The important metric here is the complexity of a comprehensive review of
security. This varies considerably from one system to another, and it's
directly connected to the risk of a breakable system being standardized
and/or deployed.

 Then I think that everything should be deployed in MATLAB -- much more readable code that minimizes implementation errors.
 
Of course there's also a much more recent line of attacks exploiting
automorphisms, subfields, etc., most recently (2018) breaking through
the 2^sqrt approximation-factor barrier. It would hardly be a surprise
if the next few years of developments also exploit divisors. I don't
understand why anyone would recommend giving the attacker an arsenal of
attack tools that have produced attacks at SODA 2016, Eurocrypt 2017,
etc., and that we know how to eliminate at low cost. But, even if you
don't agree, surely you have to analyze whether your system survives the
specific attack strategies that have broken some earlier systems.

We don't disagree as much as you think on this issue.  I think I made the statement at the 1st NIST conference that we should pick one ring and concentrate our analysis on it.  This was also the point I was making about more advanced crypto - no matter what gets standardized for basic encryption / signatures, people are most likely going to still use structured rings there ... so it would be better if we really throw all our effort onto these useful rings right now.

As far as the specific cryptanalysis NTTRU, it's not a NIST entrant so we did not think a whole lot about maximizing security.  But if some high frequency trader said to me that he really really needs the absolutely fastest scheme, then I would suggest that they still use NTTRU, but take as wide a distribution as possible (i.e. not -1/0/1)  that still results in ~ 2^-350 decryption error.  

One (additional) reason that the attack of reducing modulo small factors is concerning is because reducing a -1/0/1 polynomial modulo some smaller-degree factor doesn't necessarily create a polynomial with many different coefficients. e.g. say the d-degree f(x) has a factor (x^d/2-r), for a large r. Reducing any -1/0/1 polynomial modulo x^d/2-r will just result in coefficients close to 0, r, and -r.  If there is some element somewhat close to r whose inverse is small, then we multiply by this inverse and get a polynomial with small coefficients.  Not sure if this is possible, but the -1/0/1 coefficients are a concern to me.  This is why I am not a fan of proposals which restrict to such small errors ... (incidentally, the NTRU schemes don't have to have small errors if they permit some decryption error :).)

Cheers,
-Vadim    

             
 

> You can't do useful FHE in some rings and we're working on much
> improved zero-knowledge proof in which complete splitting is crucial. 

Assume that performance constraints force NIST to standardize a small
lattice-based system. How do you imagine that we're going to get from
the current situation, with security flaws being announced in 2019 in
"provably secure" lattice systems, to a situation where we can actually
be confident in the security of what NIST standardizes? Do you think
that adding extra desiderata such as FHE is going to help this process?

---Dan

Donald Costello

unread,
Jan 20, 2019, 7:18:56 PM1/20/19
to Vadim Lyubashevsky, Gregor Seiler, pqc-...@list.nist.gov
You know of course how famous the  Lyubashevsky name is, don’t you? Are you related to the famous fellow?

 

From: Vadim Lyubashevsky <vadim....@gmail.com>
Sent: Sunday, January 20, 2019 12:28 AM
To: Gregor Seiler
Cc: pqc-...@list.nist.gov
Subject: Re: [pqc-forum] Really fast NTRU
 

D. J. Bernstein

unread,
Jan 22, 2019, 9:55:21 AM1/22/19
to pqc-...@list.nist.gov
Vadim Lyubashevsky writes:
> Also, maybe rings that don't require inert primes that NTRU Prime
> does, but still use a non-NTT-friendly cyclotomic of the form
> (X^p -1)/(x-1), can natively use the NTT-friendly modulus that matches
> the one for the bigger NTT-friendly ring in which they want to do the
> multiplication?

Right. Sometimes the standard reduction across polynomials is enough and
it isn't necessary to also pay for the standard reduction across primes.
Whether the prime was inert at the beginning isn't what matters for this
implementation strategy; e.g., Streamlined NTRU Prime 4591^761 could use
size-54 FFTs and even size-270 FFTs, although back-of-the-envelope
calculations suggest that these still won't be competitive with the
reduction from 4591 to primes that allow power-of-2 FFTs.

> I am reasonably sure that NTRU-HRSS could have divided their modulus
> by 2 and still had something like 2^-400 decryption error.

https://cr.yp.to/talks/2018.06.27-2/slides-djb-20180627-lattice-a4.pdf
explains an easy way to divide the HRSS modulus by 2 in the original
dimension while guaranteeing _zero_ decryption failures. Of course
sntrup4591761 follows the same curve in dimension closer to 768.

> What bothers me about this "0 decryption errors" movement is that it's actually
> making schemes less secure and less efficient and getting *nothing* in return. 

There's a long tradition of unquantified hyping of the supposed costs of
eliminating decryption failures in NTRU etc. This goes all the way back
to https://web.securityinnovation.com/hubfs/files/ntru-orig.pdf, page
19, "drawbacks". There's an even longer tradition of overconfidence in
security analyses. A proper comparison has to quantify the costs, and
has to analyze the load placed upon security reviewers.

> So maybe you're complaining about not trusting [HHK '17] because it's
> a long paper.

Persichetti and I found counterexamples to two of the six core theorems
claimed in HHK17. See https://cr.yp.to/papers.html#tightkem, Appendix A.

Even by now, has anyone carefully checked the proofs? We know how to add
an extra assumption to rule out our counterexamples, but this doesn't
mean that these assumptions are enough to rescue the theorems.

Computer verification would drastically reduce the risks of proof
errors. With verification in mind, Persichetti and I wrote up a very
carefully modularized proof of a particularly attractive ROM IND-CCA2
conversion, applicable to _some_ NIST submissions without decryption
failures. But verification is still a big research project beyond this.

Even when we have correct proofs, there's a critical question of whether
the assumptions in the proofs are correct. Here are three aspects of
this:

* Many NIST submissions (I'm not saying all) use proofs that start
from IND-CPA and that lose tightness starting from OW-CPA. Why
should we think that IND-CPA reaches the claimed security levels?
There's much less evidence of cryptanalysts studying IND-CPA than
studying OW-CPA.

* For the QROM, we don't have _any_ tight proofs starting from
OW-CPA. There's a DS assumption that doesn't _sound_ much stronger
than OW-CPA, but how much have cryptanalysts looked at DS?

Persichetti and I have identified a QROM IND-Hash assumption that
seems strictly closer to OW-CPA than DS is, and I hope we'll have
QROM IND-CCA2 from QROM IND-Hash fully written up soon, but this
still needs cryptanalysis.

* By now quite a few people, including you, have noticed that the
attacker is allowed to inspect the secret key in the definition of
failure probability in HHK17. (I mean the final version of HHK17;
the initial version claimed theorems using a different definition,
and as far as I know nobody has proven any of those theorems.)

Often being able to see the secret key makes failing messages
easier to find. How much have cryptanalysts looked at this?

Even if NIST submissions are calculating failure probabilities
correctly, they're usually (I'm not saying always) doing it with
the wrong definition, so we usually don't have IND-CCA2 security
proofs, even assuming IND-CPA.

OCB2 had a security proof at Asiacrypt, was standardized, and 14 years
later turned out to be horribly broken. In retrospect, one piece of the
OCB2 proof was a probability calculation under a definition that turned
out to be inadequate for another piece of the proof. Do we dismiss this
as an isolated mistake? Or do we start explicitly recognizing that
security reviewers are overloaded, and that we have to simplify the task
of security review to get cryptographic risks under control?

---Dan
signature.asc

mi...@shiftleft.org

unread,
Jan 22, 2019, 1:27:32 PM1/22/19
to D. J. Bernstein, pqc-...@list.nist.gov


On Jan 22, 2019 6:55 AM, "D. J. Bernstein" <d...@cr.yp.to> wrote:

   * By now quite a few people, including you, have noticed that the
     attacker is allowed to inspect the secret key in the definition of
     failure probability in HHK17. (I mean the final version of HHK17;
     the initial version claimed theorems using a different definition,
     and as far as I know nobody has proven any of those theorems.)

     Often being able to see the secret key makes failing messages
     easier to find. How much have cryptanalysts looked at this?
    
     Even if NIST submissions are calculating failure probabilities
     correctly, they're usually (I'm not saying always) doing it with
     the wrong definition, so we usually don't have IND-CCA2 security
     proofs, even assuming IND-CPA.

Hi Dan,

I'm a little confused on this point.  Is this an NTRU issue?  For LPR10 and related schemes, the message has a small (< 1 bit) or zero effect on whether decryption fails.  It's all in the coins, and if you're FO transforming, those are chosen by the hash.  So in the ROM, the adversary requires 1/pfail queries (or at most one bit less) in expectation to find a failure even with the private key.  Therefore pfail is the right thing to measure for provable security, and this is what most submissions calculated.

The recent "impact of decryption failures" papers have studied the other problem: how hard is it to find a decryption failure with limited decryption queries, without the private key?  This is probably much harder but I don't think anyone has used this in a "provable security" claim.

Or is your objection about weak keys and multi-target security?  That could multiply the adversary's advantage by up to numkeys, and though I expect in reality the factor is less than sqrt(numkeys), it is probably not negligible.

-- Mike

Mike Hamburg

unread,
Jan 22, 2019, 2:23:28 PM1/22/19
to D. J. Bernstein, pqc-...@list.nist.gov
Once more in plain text instead of HTML-only, sorry for the spam.

Mike Hamburg

unread,
Jan 22, 2019, 7:09:33 PM1/22/19
to D. J. Bernstein, pqc-...@list.nist.gov


> On Jan 22, 2019, at 6:55 AM, D. J. Bernstein <d...@cr.yp.to> wrote:
>
> Vadim Lyubashevsky writes:
>> What bothers me about this "0 decryption errors" movement is that it's actually
>> making schemes less secure and less efficient and getting *nothing* in return.
>
> There's a long tradition of unquantified hyping of the supposed costs of
> eliminating decryption failures in NTRU etc. This goes all the way back
> to https://web.securityinnovation.com/hubfs/files/ntru-orig.pdf, page
> 19, "drawbacks". There's an even longer tradition of overconfidence in
> security analyses. A proper comparison has to quantify the costs, and
> has to analyze the load placed upon security reviewers.

I made a rough estimate by comparing existing systems (ThreeBears, Saber, Kyber, NTRU LPrime) of the cost/benefit of security features. By “efficiency” I mean min(core-sieve, hybrid) security / bytes transmitted, which is an imperfect measure but good enough for this purpose. I used a combination of Schanck’s script and “estimate all the LWE”. I mainly considered class-(1-2) systems under these estimates, including LightSaber, the Round2 candidates near LightSaber, NTRU HRSS 701, KyberLight, BabyBear, and NTRU Prime. I estimated improvements in % more (bytes sent per bit of core-sieve security), abbreviated “costs about n% more”. For MLW(E,R) you can’t tune the parameters arbitrarily, so the efficiency metric might not exactly reflect the cost of the change, but again, maybe good enough for a first pass. So all these are very rough numbers and subject to exact tradeoffs and optimization.



Error correction:
Without error correction, ThreeBears costs 6-9% more.

Very low failure probability:
For ThreeBears, generic attacks (of the type discussed in my previous email) probably won't find any decryption failures using <= 2^64 queries and <= (NIST class level) work. Let’s call this failure level “probably low enough”. If ThreeBears’ error probability is decreased to be 2^-(lattice security level + 32 bits), so that its security impact is likely to be negligible even in a weak key + improved cryptanalysis scenario — call this “very low" failure probability — then this costs 3-6%.

Ring shape:
ThreeBears’ ring shape costs about 6% more than cyclotomic due to noise amplification. NTRU LPrime’s ring shape costs about 11% more than cyclotomic due to noise amplification.

Rounding:
With errors instead of rounding, NTRU LPRime would cost about 15% more. Kyber costs 21% more than Saber, due in large part to more rounding. ThreeBears costs about 9% more than it would if half its noise came from rounding (could use full rounding to save more but the parameters are more annoying).

Perfect correctness:
With perfect correctness, ThreeBears costs 44% more, which is to say 35% more than it would with no error correction. NTRU LPrime costs 33% more than ThreeBears, which squares pretty well the above feature costs (*1.44 for perfect correctness, /1.15 for rounding, *1.05 for ring shape). Likewise NTRU-HRSS estimates 175 bytes cost each in ciphertext and public key, plus 20 bits of security cost, which works out to about 34% more.

Binomial vs +-1 noise: I didn’t consider this too carefully, but LightSaber costs about 15% more than Round2. There are probably other important differences though.



Overall, it appears to me that perfect correctness costs about 33% more than “probably low enough" failure probability, and maybe 27% more than “very low" failure probability. This makes it a significantly greater cost than LWE vs LWR, than no-ECC vs 2-ECC, or than an inert modulus. Perhaps it is still worthwhile, but this doesn’t seem as obvious to me as it does to Dan.

I did this work pretty roughly, but a more detailed study like this could be useful guidance for second round tweaks. For example, it might well be worthwhile for ThreeBears to remove error correction and/or set “very low" failure probability.

Cheers,
— Mike

D. J. Bernstein

unread,
Jan 23, 2019, 8:11:10 AM1/23/19
to pqc-...@list.nist.gov
Let's take simple rounding with no decryption failures as the baseline.
Mike estimates that aggressive size optimization---take noise instead of
rounding, use an error-correcting code, and aim for "probably low
enough" failure probability---reduces total key+ciphertext size by a
factor between 1.19 and 1.25 compared to the baseline. (See below.)

Nobody has pointed to an application where this minor change in size
suddenly has users saying "Okay, _now_ that's small enough to deploy."

Meanwhile these failures are a pain for security reviewers. There's a
huge pile of complicated decryption-failure analyses in lattice
submissions (not to mention code-based submissions with decryption
failures), and now we're seeing various claims that decryption failures
are actually more frequent than previously claimed, that attackers can
pump up the probability further, and that exploiting decryption failures
disproves the security levels claimed for some lattice systems.

Two years from now, are we going to be able to look at the remaining
lattice submissions and confidently say that any nonzero failure rates
aren't a problem? How much more research is this going to take? Wouldn't
it be smarter to simply eliminate decryption failures and focus the same
research effort on trying to reduce other worrisome lattice risks?

For comparison, the Ed25519 design team opted for 64-byte signatures
rather than 48-byte signatures---a 1.33x ratio in signature size, and a
1.20x ratio in key+signature size---to simplify the security analysis.
Of course, years later, someone overconfidently claimed that one could
use 48-byte signatures "without losing security", and of course exposing
errors in that claim required a more complicated security analysis,
which was time taken away from more important issues.

The rest of this message has some comments on Mike's size estimates.

The size changes we're talking about generally accommodate changes in
security level, so the problem of reviewing size comparisons includes
all the problems of reviewing security comparisons. Recall

* one estimate saying that EMBLEM-611 is hundreds of times easier
to break than uRound2.KEM-500 (2^76 vs. 2^84) and

* another estimate saying that, no, uRound2.KEM-500 is thousands of
times easier to break than EMBLEM-611 (2^126 vs. 2^142),

where all four of these wildly inconsistent estimates appear on the same
"Estimate all the LWE/NTRU schemes" page. This is a 1.25x swing in
relative security level. This could be the aggregate result of even
larger swings that the estimates end up assigning to specific design
elements. How are we supposed to extract anything meaningful from this?
What's the scientific justification for the claim that any of these
estimates are "good enough for this purpose"?

One answer would be to say that if a design element has a noticeable
benefit in _every_ estimate then surely it's beneficial. But what if
we're simply aggregating a huge pile of sloppy estimates that are all
missing exactly the point that we're trying to observe?

Here's a concrete example. Attack papers such as

https://www.iacr.org/archive/crypto2007/46220150/46220150.pdf
https://eprint.iacr.org/2011/458.pdf
https://eprint.iacr.org/2014/880.pdf
https://eprint.iacr.org/2015/823.pdf
https://eprint.iacr.org/2016/177.pdf

show how to exploit rotation (multiplication by x) to speed up various
attacks---but the same speedup idea seems somewhat less effective
against NTRU Prime. Fully quantifying this requires more algorithm
analysis. Qualitatively, this effect is in the opposite direction of the
more obvious effect that NTRU Prime has to choose slightly fewer error
positions to handle the extra coefficient of x^n-x-1.

As far as I know, the rotation impact is taken into account in _none_ of
the estimates from the "Estimate all the LWE/NTRU schemes" page. This
isn't surprising given the history of these estimates, pursuing simple
formulas at the expense of accuracy. Presumably Mike's new "combination"
of estimates, producing his conclusion that switching to the NTRU Prime
ring produces a 1.11x size increase, also fails to take the rotation
impact into account.

Another source of noise is the switch from a proper two-dimensional
size-and-security graph (as in eprint 2018/1174) to a one-dimensional
size-per-bit-of-security ratio. Size shouldn't be _exactly_ linear in
bits of security, and we have no way to see what effect this has on tiny
claimed ratios such as 1.06 (Mike's "costs about 6% more") since we
haven't been told any of the security numbers.

Finally, let me comment on the "factor between 1.19 and 1.25" mentioned
above. This comes from dividing the following two estimates provided by
Mike, disregarding all of the above caveats:

* Starting from the baseline, switching from rounding to noise is
estimated to cost something between 1.15x and 1.21x. ("With errors
instead of rounding, NTRU LPRime would cost about 15% more. Kyber
costs 21% more than Saber, due in large part to more rounding.")

* Using an error-correcting code to aim for "probably low enough"
failure probability is then estimated to save 1.44x. ("With perfect
correctness, ThreeBears costs 44% more".)

I'm unable to figure out why Mike claims 1.33x "overall", rather than
something between 1.19x and 1.25x, as the cost for "perfect
correctness". He estimates 1.33x for NTRU LPRime vs. ThreeBears, but
this includes a cost for the ring switch. He also estimates 1.34x for
NTRU-HRSS vs. ThreeBears, but

https://cr.yp.to/talks/2018.06.27-2/slides-djb-20180627-lattice-a4.pdf

and eprint 2018/1174 already showed NTRU-HRSS parameter improvements
that maintain perfect correctness.

---Dan
signature.asc

Vadim Lyubashevsky

unread,
Jan 23, 2019, 9:15:48 AM1/23/19
to pqc-...@list.nist.gov
On Wed, Jan 23, 2019 at 2:11 PM D. J. Bernstein <d...@cr.yp.to> wrote:
Let's take simple rounding with no decryption failures as the baseline.
Mike estimates that aggressive size optimization---take noise instead of
rounding, use an error-correcting code, and aim for "probably low
enough" failure probability---reduces total key+ciphertext size by a
factor between 1.19 and 1.25 compared to the baseline. (See below.)

Nobody has pointed to an application where this minor change in size
suddenly has users saying "Okay, _now_ that's small enough to deploy."

In a previous email, you stated that "Typically sending or receiving a byte costs at least three orders of
magnitude more than a clock cycle."  Your djbsort saves about 25000 cycles over a sorting function from 2001 -- https://sorting.cr.yp.to/refs.html.  Let's presume that the correctness of the 2001 version has already been verified many times and people are comfortable implementing it correctly (I don't know anything about this topic, but I presume there must be some version of relatively efficient sorting that people are comfortable with.)  While I am sure that you implemented your sorting algorithm correctly, it is not simple, fairly new, and will probably lead to mistakes when people try to implemented themselves in various environments.  According to your metric, saving 25000 cycles could also be done by reducing the ciphertext by 25 bytes.  So now, what will lead to fewer mistakes in, for example, NTRU Prime?  Here are the two options:

1. Reducing the size of the ciphertext in a *provable* way by introducing decryption errors.  The proof needs to be verified, but *only done once*.  I am sure that for a NIST standard, we can get the attention of enough top people that we can be quite confident that the math is correct.  There are no discussions about security models here or anything -- it's just pure math.  I hope you trust that people can verify math proofs.   
2. Implement djbsort.  People still need to verify that it's correct and constant time (maybe there are automatic ways to do this ... but then someone has to verify the tools are written properly).  Then every time it's implemented, one has to verify the implementation itself.  Mistakes will surely be made there because it has to be tested in every implementation.

So which approach is more likely to lead to mistakes?  

Incidentally, I am not arguing against djbsort, I am just pointing out that the argument that calculating decryption errors correctly is the most likely point of failure in lattice-based schemes is pretty absurd.  

-Vadim         


 

Mike Hamburg

unread,
Jan 23, 2019, 1:05:55 PM1/23/19
to D. J. Bernstein, pqc-...@list.nist.gov
Hi Dan,

You seem to have misunderstood my writeup.  Sorry if it was too sloppy.

On Jan 23, 2019, at 5:10 AM, D. J. Bernstein <d...@cr.yp.to> wrote:

Let's take simple rounding with no decryption failures as the baseline.
Mike estimates that aggressive size optimization---take noise instead of
rounding, use an error-correcting code, and aim for "probably low
enough" failure probability---reduces total key+ciphertext size by a
factor between 1.19 and 1.25 compared to the baseline. (See below.)

Using noise instead of rounding is the second-largest size
*pessimization*, after perfect correctness.  Calling it “aggressive size
optimization” is simply dishonest.

Two years from now, are we going to be able to look at the remaining
lattice submissions and confidently say that any nonzero failure rates
aren't a problem? How much more research is this going to take? Wouldn't
it be smarter to simply eliminate decryption failures and focus the same
research effort on trying to reduce other worrisome lattice risks?

Indeed, that’s the topic of this thread.

Also, you seem to be ignoring Vadim’s and my point in this thread.  Yes,
if the failure rate is 2^-120, it will take significantly more research to be
confident that no attack is feasible.  To a lesser extent this is true for 2^-160.
If the failure rate is 2^-224 (for say a class III system), then even knowing the
private key, and even if it’s 2^32 times weaker than normal, you’d need a
192-bit brute-force search to find a failing message.  If the failure rate is
2^-300, then additionally for almost all keys that will ever be honestly
generated, there are no failures.

That’s not a CCA2 QROM proof by itself, and I’m aware that writing one
takes significant effort, but your hand-wringing is a little excessive.

The rest of this message has some comments on Mike's size estimates.

The size changes we're talking about generally accommodate changes in
security level, so the problem of reviewing size comparisons includes
all the problems of reviewing security comparisons. Recall

  * one estimate saying that EMBLEM-611 is hundreds of times easier
    to break than uRound2.KEM-500 (2^76 vs. 2^84) and

  * another estimate saying that, no, uRound2.KEM-500 is thousands of
    times easier to break than EMBLEM-611 (2^126 vs. 2^142),

where all four of these wildly inconsistent estimates appear on the same
"Estimate all the LWE/NTRU schemes" page. This is a 1.25x swing in
relative security level. This could be the aggregate result of even
larger swings that the estimates end up assigning to specific design
elements. How are we supposed to extract anything meaningful from this?
What's the scientific justification for the claim that any of these
estimates are "good enough for this purpose”?

Take a look at the graphs of security vs security at

(These are with an outdated version of “estimate all the LWE”; sorry.)

You will see that Round2 and RLizard are the only off-curve systems.
You will also see that Round2 is the estimate I qualified with “I didn’t
consider this too carefully”.

Here's a concrete example. Attack papers such as

  https://www.iacr.org/archive/crypto2007/46220150/46220150.pdf
  https://eprint.iacr.org/2011/458.pdf
  https://eprint.iacr.org/2014/880.pdf
  https://eprint.iacr.org/2015/823.pdf
  https://eprint.iacr.org/2016/177.pdf

show how to exploit rotation (multiplication by x) to speed up various
attacks---but the same speedup idea seems somewhat less effective
against NTRU Prime. Fully quantifying this requires more algorithm
analysis. Qualitatively, this effect is in the opposite direction of the
more obvious effect that NTRU Prime has to choose slightly fewer error
positions to handle the extra coefficient of x^n-x-1.

As far as I know, the rotation impact is taken into account in _none_ of
the estimates from the "Estimate all the LWE/NTRU schemes" page. This
isn't surprising given the history of these estimates, pursuing simple
formulas at the expense of accuracy. Presumably Mike's new "combination"
of estimates, producing his conclusion that switching to the NTRU Prime
ring produces a 1.11x size increase, also fails to take the rotation
impact into account.

Sorry, I haven’t read that section of those five papers yet.  Can you quantify
the effect this has on, eg, NTRU-HRSS?

Another source of noise is the switch from a proper two-dimensional
size-and-security graph (as in eprint 2018/1174) to a one-dimensional
size-per-bit-of-security ratio. Size shouldn't be _exactly_ linear in
bits of security, and we have no way to see what effect this has on tiny
claimed ratios such as 1.06 (Mike's "costs about 6% more") since we
haven't been told any of the security numbers.

Fair.  But this effect probably changes things by at most a couple %.  It is good
enough to roughly quantify the costs at play.  I don’t think the cost of perfect
correctness actually being 31% or 37% really changes the discussion here.

Finally, let me comment on the "factor between 1.19 and 1.25" mentioned
above. This comes from dividing the following two estimates provided by
Mike, disregarding all of the above caveats:

  * Starting from the baseline, switching from rounding to noise is
    estimated to cost something between 1.15x and 1.21x. ("With errors
    instead of rounding, NTRU LPRime would cost about 15% more. Kyber
    costs 21% more than Saber, due in large part to more rounding.")

  * Using an error-correcting code to aim for "probably low enough"
    failure probability is then estimated to save 1.44x. ("With perfect
    correctness, ThreeBears costs 44% more".)

I'm unable to figure out why Mike claims 1.33x "overall", rather than
something between 1.19x and 1.25x, as the cost for "perfect
correctness". He estimates 1.33x for NTRU LPRime vs. ThreeBears, but
this includes a cost for the ring switch. He also estimates 1.34x for
NTRU-HRSS vs. ThreeBears, but

  https://cr.yp.to/talks/2018.06.27-2/slides-djb-20180627-lattice-a4.pdf

and eprint 2018/1174 already showed NTRU-HRSS parameter improvements
that maintain perfect correctness.

—Dan

Sorry to be unclear.

For ring shapes, I estimated ThreeBears’ security with additional noise, which
would be possible if it were not amplified by the ring shape.  I also estimated
ThreeBears’ security with NTRU LPRime’s ring shape.  Likewise for “very low”
failure probability.  I did this using the same script (Schanck) that I had used
to estimate security in the first place, since it’s faster than “Estimate all the
LWE” and I’m lazy, and compared the results from that script to each other.

For perfect correctness:
For ThreeBears, since I am most familiar with that system, I calculated
several parameter sets sufficient for perfect correctness.  I used the same
script that I had used to estimate security in the first place.  For the same
core-sieve security level, the best new parameters cost about 44% more
than ThreeBears as submitted.  (It happens that such parameters existed that
put the security within a few bits of BabyBear; this makes the “costs more”
a more accurate representation of size ratio in that case.)  They also cost
about 35% more than ThreeBears with no error correction, which I had already
calculated parameters for earlier, to decide whether to use error correction at
all. I used this same technique for “very low failure probability”.

For NTRU-HRSS, I went by the submitting team’s estimates of the cost of
page 24.

NTRU LPRime has a similar security level to BabyBear (as rated by
Estimate All the LWE), so also I compared those two, and found 1.33 cost
ratio.  The chief differences are IMLW[ER] vs RLW[ER] which has only a
small effect on size, plus error correction, perfect correctness, ring shape, and
rounding.  We already have 1.44 for error correction vs perfect correctness
in ThreeBears; 1.05 ~ 1.11/1.05 for the relative cost of their ring shapes;
and 1.15 as an estimate for NTRU LPRime with noise vs rounding.  You
see that 1.44/1.15*1.05 ~ 1.31, which is pretty close to 1.33.

I take this as very weak validation that my costing methodology wasn’t terrible.
I simply don’t have time to write a 20-page survey paper on the topic complete
with charts and figures, but I think the above technique contributes to the
discussion.

If you have time, what was the cost of perfect correctness in NTRU LPRime,
vs some other possibilities (eg 2^-133 (BabyBear claimed), 2^-161 (Kyber),
or 2^-288 (my “very low” for your class V estimate))?

Cheers,
— Mike

D. J. Bernstein

unread,
Jan 24, 2019, 4:56:07 AM1/24/19
to pqc-...@list.nist.gov
Vadim Lyubashevsky writes:
> I hope you trust that people can verify math proofs.   

Isn't this the same thread where, a few messages ago, I mentioned that
two of the six core "theorems" claimed in HHK17 are incorrect? And
surely you've heard about

https://eprint.iacr.org/2018/1040
https://eprint.iacr.org/2018/1087
https://eprint.iacr.org/2018/1090

completely breaking the OCB2 standard, which had a security "proof" from
Rogaway. Perhaps you mean something special by "math proofs", but the
literature on pure mathematics has many bogus "proofs", as the following
sample illustrates:

https://en.wikipedia.org/wiki/List_of_incomplete_proofs

Of course formal logic has a long tradition of defining proofs in a
rigid way that allows trivial verification. In principle, converting
today's "proofs" into formal proofs will rule out the bogus "proofs".
There are serious efforts in this direction; see, e.g., mizar.org. But
you're kidding yourself if you think that all of the "proofs" cited in
lattice-based submissions will be converted into formal proofs within
the timeframe of the competition.

> the argument that calculating decryption errors correctly is the most
> likely point of failure in lattice-based schemes is pretty absurd.  

Recent literature claims security losses via decryption failures in some
systems submitted to NIST. The August 2018 Round5 proposal claimed NIST
security level 5, but turned out to be breakable via decryption failures
(under reasonable assumptions) with <2^64 ciphertexts and a feasible
calculation. The analysis of these issues is complicated, but clearly
one of the big sources of screwups is people thinking that calculating
decryption-failure probabilities is simpler than it actually is.

I don't understand how you can be so cavalier in dismissing the security
risks in this area. It also isn't helpful when you mischaracterize
attention to a risk as an irrelevant claim that the risk is the single
"most likely point of failure".

> Your djbsort saves about 25000 cycles over a sorting function from
> 2001 -- https://sorting.cr.yp.to/refs.html.  Let's presume that the
> correctness of the 2001 version has already been verified many times
> and people are comfortable implementing it correctly

It's fascinating for me to read your comments about sorting. Despite
your 100% confidence in "math proofs", evidently you have in mind a
principle that speedups increase risks of software bugs. You're so
convinced of the universal applicability of this principle that you post
a message applying this principle to sorting software, where your
conclusions are simply wrong.

The actual situation is the following.

djbsort has been computer-verified to correctly sort all inputs for
every array size that we use in cryptography. (See the documentation for
discussion of the remaining risks.) It's also constant-time.

The 2001 software you mention doesn't have either of these features. If
I were forced to guess, I'd guess that the 2001 software is bug-free;
but guesses like this are sometimes disastrously wrong. Anyway, since
the 2001 software isn't constant-time, it can't even be considered for
these applications.

There was some work before djbsort on verifying sorting software (see,
e.g., https://link.springer.com/article/10.1007/s10817-017-9426-4), but
I haven't found any previous verified _constant-time_ software. Such
software wouldn't have been impossible to create with previous
techniques, but it would have been intolerably slow.

You claim that "mistakes will surely be made" in every implementation
for every new environment. The automated djbsort verification tools are
general-purpose tools working directly with machine code, precisely to
resolve the conventional tension between optimization and verification.

You say that "maybe there are automatic ways" to verify correctness of
sorting software. Well, yes, there are, and if you're going to comment
on risks of bugs then maybe you should first learn what's been done in
the area to control these risks, rather than spreading generic FUD.

> someone has to verify the tools are written properly

Of course. This also applies to proof-verification tools, as nicely
illustrated by the following "proof": https://github.com/clarus/falso

> According to your metric, saving 25000 cycles could also be done by
> reducing the ciphertext by 25 bytes. So now, what will lead to fewer
> mistakes in, for example, NTRU Prime?

Beyond my comments on the details of these risk assessments (see above),
let me briefly emphasize what this question is missing about NTRU Prime.

Within small lattice systems, the primary design goal of NTRU Prime is
to minimize attack surface, making the security reviewer's job as easy
as possible. If there's a way to save 25 bytes that makes the security
reviewer's job harder, NTRU Prime says no. If there's a way to save
25000 cycles that makes the security reviewer's job harder, NTRU Prime
says no. Asking whether "yes no" is more risky than "no yes" is
irrelevant; NTRU Prime says "no no" to both risks.

---Dan
signature.asc

Vadim Lyubashevsky

unread,
Jan 24, 2019, 5:55:35 AM1/24/19
to pqc-...@list.nist.gov

Perhaps you mean something special by "math proofs", but the
literature on pure mathematics has many bogus "proofs", as the following
sample illustrates:

   https://en.wikipedia.org/wiki/List_of_incomplete_proofs

You are now actually arguing for minimizing the reliance on math in real life.  Let's apply this to other situations: do you fly on airplanes that rely on the least amount of math? And in the more fuel-efficient planes, is the trade-off between fuel economy and the amount of math worth it? Should we have (tall) buildings -- their design requires math and how many people really checked the calculations?  I think that this argument has run its course, so this is my last response on this topic. 

-Vadim
Reply all
Reply to author
Forward
0 new messages