OFFICIAL COMMENT: NTRU Prime

4akolz...@gmail.com

unread,

Mar 14, 2018, 7:42:49 AM3/14/18

to pqc-forum

Dear NTRU Prime submitters!

In this comment we suggest some improvement for multiplication operation.

NTRU Prime algorithm uses the combined method of multiplication (Toom and Karatsuba) to multiply polynomials, which doesn’t take into account the special form of polynomials (1, -1, 0). This method is realized in function rq_mult.

A special type of polynomial may be used for multiplication, as we’ve investigated. Two arrays may be defined for a polynomial: an array with idices of positive elements and an array with indices of negative elements. The processing of arrays is parallelized.

Our method, like NTRU Prime, uses AVX2 commands, the critical code is written in assembler.

For parameters and keys that were generated by the NTRU Prime algorithm (q = 4591; p = 761; t = 125) on Linux, we got an acceleration of about 1.5 times when executing the operation of multiplying polynomials, compared to the function rq_mult.

Processor: Intel (R) Core (TM) i5-4440 CPU @3.1 GHz, Memory: 8GB

Best regards,

I. Gorbenko, E. Kachko, M. Yesina, O. Akolzina

Ukraine

William Whyte

unread,

Mar 14, 2018, 8:11:34 AM3/14/18

to 4akolz...@gmail.com, pqc-forum

Hi researchers,

Is your implementation constant time? We've found (in the context of "original" NTRU) that multiplication methods that use the trinary form of the private key are hard to make constant time and in general violate the principle that there should be no control flow from secret information to operations.

If you've found a way to make this constant time, then there are additional speedups to be had from taking f (and r) to be of the form (f1*f2) +f3, as described in

J. Hoffstein, J.H. SilvermanRandom Small Hamming Weight Products with applications to cryptography

Discrete Appl. Math., 130 (1) (2003), pp. 37-49

... though you have to be a little careful with the parameters.

Cheers,

William

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+unsubscribe@list.nist.gov.
Visit this group at https://groups.google.com/a/list.nist.gov/group/pqc-forum/.

--

PLEASE UPDATE YOUR ADDRESS BOOKS WITH MY NEW ADDRESS: wwh...@onboardsecurity.com

William Whyte

unread,

Mar 14, 2018, 10:51:19 AM3/14/18

to Olga Akolzina, pqc-forum

Hi Olga,

>> Execution time is defined by total number of non-zero elements, this quantity is set by algorithm (t parameter).

This is true to a first order, but in our experience there was additional variation.

>> Time doesn’t depend on coefficients indices.

We observed some dependency in our experiments. Do you have experimental results showing that there's no dependency? Can you share those?

Cheers,

William

On Wed, Mar 14, 2018 at 9:34 AM, Olga Akolzina <4akolz...@gmail.com> wrote:

Execution time is defined by total number of non-zero elements, this quantity is set by algorithm (t parameter). Time doesn’t depend on coefficients indices.
We didn't investigate the form (f1*f2)+f3.

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

Visit this group at https://groups.google.com/a/list.nist.gov/group/pqc-forum/.

--

PLEASE UPDATE YOUR ADDRESS BOOKS WITH MY NEW ADDRESS: wwh...@onboardsecurity.com

D. J. Bernstein

unread,

Mar 14, 2018, 11:13:05 AM3/14/18

to pqc-...@list.nist.gov

The central claim from Gorbenko, Kachko, Yesina, and Akolzina is that it
"makes no sense" to switch from the traditional NTRU multiplication
algorithms (using the sparsity of one input) to "complex" multiplication
algorithms (Karatsuba, Toom, etc.). From a performance perspective, the
evidence presented for this claim has at least two serious flaws:

* The speeds claimed here for "complex" multiplication algorithms are
worse than previously published software. My understanding is that
the authors measured their own implementation of these algorithms,
not using state-of-the-art implementation techniques for this CPU.

* My understanding is that the claimed bottom line, "acceleration of
about 1.5 times", is actually comparing a 4-core implementation to
a 1-core implementation. This use of 4 cores has worse throughput,
energy consumption, etc. than simply running separate computations
on separate cores.

More importantly, from a security perspective, we require constant-time
algorithms. Even if sparse techniques can be competitive in speed (which
is unproven), I agree with William's assessment that those techniques
are hard to make constant time.

The bigger picture is that the constant-time cycle counts reported on
https://ntruprime.cr.yp.to are already so fast that it's hard to find
any applications that can't afford them. Obviously even more speed is
nice if we can get it (which is why new sorting code is coming soon!),
but speed should be measured properly, and it shouldn't come at the
expense of security.

---Dan

P.S. To be clear: I'm _not_ saying that applications can always handle
the sizes of lattice-based ciphertexts, typically around a kilobyte.

Olga Akolzina

unread,

Mar 15, 2018, 9:26:18 AM3/15/18

to pqc-...@list.nist.gov

Hello!

>> My understanding is that the authors measured their own implementation of these algorithms, not using state-of-the-art implementation techniques for this CPU.

We used NTRU Prime optimized code (AVX + assembler inserts) for speed measurement.

>> My understanding is that the claimed bottom line, "acceleration of about 1.5 times", is actually comparing a 4-core implementation to a 1-core implementation. This use of 4 cores has worse throughput, energy consumption, etc. than simply running separate computations on separate cores.

Yes, but both functions were performed on a 4-core processor, we do not see the possibility of effective vectorizing your algorithm, as the size of data being multiplied is gradually decreasing.

Thank you for recommendation, we’ll do an experiment on time and indices independence.

Best regards,

I. Gorbenko, E. Kachko, M. Yesina, O. Akolzina

Olga Akolzina

unread,

Mar 15, 2018, 10:22:02 AM3/15/18

to pqc-...@list.nist.gov

We've done 2000 tests with different keys, time dispersion doesn't exceed 12%.

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

William Whyte

unread,

Mar 15, 2018, 10:29:50 AM3/15/18

to Olga Akolzina, pqc-forum

Right, the time dispersion isn't huge, but it is observable. It would be great if there was a way to make it go away, as naively it seems that the index-based version should be faster, especially with the f = f1*f2+f3 trick, but we couldn't find a way to get rid of it.

Cheers,

William

To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+unsubscribe@list.nist.gov.

Visit this group at https://groups.google.com/a/list.nist.gov/group/pqc-forum/.

4akolz...@gmail.com

unread,

Mar 28, 2018, 6:25:57 AM3/28/18

to pqc-forum

More precise research results

1. For 10,000 keys’ maximum dispersion of all measurements is

-5.19676<=e<=6.62797(%).

The key numbers for which the minimum and maximum values were obtained when the measurements were repeated didn't match.

2. For 100 keys, the maximum dispersion is -4.142 <= e <= 6.06054 (%).

3. For 100 keys and 100 times measurements, the maximum dispersion of minima, min: -0.745778 <= e <= 0.732426 (%), maxima dispersion from the average maximum, max: -0.891563 <= e <= 0.977177 (%). Thus, the measurement error can be up to 3%.

4 At the moment, the function rand (standard C library) was used to generate the numbers of polynomials coefficients with nonzero values. In future we suppose to use for this purpose more reliable generator and perform a full statistical analysis of the results in order to identify the possibility of obtaining a key, taking into account the time difference in measurements.

Thank You!

Olga Akolzina

unread,

Jun 20, 2018, 9:09:43 AM6/20/18

to pqc-forum

Dear William Whyte!

Thank you very much for your positive evaluation of our work!

In accordance with your recommendations, we continued our work in the field of polynomial multiplication function for NTRU-similar algorithms that use the properties of a small polynomial and computation time for them doesn’t depend on this polynomial.

The following results were obtained.

Processor Intel (R), Core (TM) i5-4440 CPU @ 3.1 Ghz.

Programming languages: C, Assembler, 64 bits

Small polynomial representation in the form of nonzero coefficients set is:

The polynomial for NTRUPrime is (N = 761, q = 4591, d = 125)

Precalculations is 1540 tacts, depend only on the second (open) factor. The more often can be done in advance.

Multiplication is 12856 cycles.

The deviation from the mean doesn’t exceed 2.12% and no more than the measurement error.

Small polynomial representation is in the form A1 * A2 + A3

The polynomial: N = 743, D1 = 11, D2 = 11, D3 = 15, q = 2048

Multiplication is 10809 cycles. There is no precomputation.

The execution time deviation from the mean is ± 1.8559 and no more than the measurement error

Best regards,

I. Gorbenko, E. Kachko, M. Yesina, O. Akolzina

--

Olga Akolzina

unread,

Jun 27, 2018, 9:40:07 AM6/27/18

to pqc-forum

Dear William Whyte!

Thank you very much for your positive evaluation of our work!

In accordance with your recommendations, we continued our work in the field of polynomial multiplication function for NTRU-similar algorithms that use the properties of a small polynomial and computation time for them doesn’t depend on this polynomial.
The following results were obtained.
Processor Intel (R), Core (TM) i5-4440 CPU @ 3.1 Ghz.
Programming languages: C, Assembler, 64 bits
Small polynomial representation in the form of nonzero coefficients set is:
The polynomial for NTRUPrime is (N = 761, q = 4591, d = 125)
Precalculations is 1540 tacts, depend only on the second (open) factor. The more often can be done in advance.
Multiplication is 12856 cycles.
The deviation from the mean doesn’t exceed 2.12% and no more than the measurement error.
Small polynomial representation is in the form A1 * A2 + A3
The polynomial: N = 743, D1 = 11, D2 = 11, D3 = 15, q = 2048
Multiplication is 10809 cycles. There is no precomputation.

The execution time deviation from the mean is ± 1.8559 and no more than the measurement error

Best regards, I. Gorbenko, E. Kachko, M. Yesina, O. Akolzina.

Kenny Herold

unread,

Jun 27, 2018, 1:30:15 PM6/27/18

to Olga Akolzina, pqc-forum

Hello,

I have a question about the performance information on these chips for the tests that are being conducted. Are these chips on patched machines for the CPU vulnerabilities that have been found in the most recent dates? Those performance hits will be very significant.

I am speaking in regards to Spectre, Meltdown and the variations of the side-channel attacks that have been discovered since the initial findings as well as TLBleed.

Pertinent information would be whether or not it is a newer CPU that was designed to be protected from such attacks or patched after the discovery.

Thanks,

Kenny Herold, GWAPT

Principal Security Consultant

Telephone: +1 (612) 432 0479 | Email: ke...@odinseye.net | Web: www.odinseye.net

Address: P.O. Box 1440, Minnetonka, MN 55345 USA

--

image003.png

Reply all

Reply to author

Forward