KpqC SUPERCOP integration status

D. J. Bernstein

unread,

Jun 26, 2024, 5:39:51 AM6/26/24

to kpqc-b...@googlegroups.com

supercop-20240625 includes HAETAE (crypto_sign/haetae*), MQ-Sign
(crypto_sign/mqsign*), and NTRU+ (crypto_kem/ntruplus*). Benchmarks are
starting on many machines, and many results will appear within days. One
128-core machine already has results online:

https://bench.cr.yp.to/results-sign.html#amd64-rome0
https://bench.cr.yp.to/results-kem.html#amd64-rome0

In the case of crypto_sign/mqsign*, there's a tradeoff between speed and
security. I marked the slower ref software as goal-const{branch,index},
meaning that it's designed to avoid secret branch conditions and secret
array indices. The documentation also seems to indicate that the faster
avx2 software is designed to be constant-time; however, the software has
secret array indices, so I didn't mark goal-const{branch,index}. For
cases like this, SUPERCOP shows benchmarks for both, with a red "T:" for
the faster software. I wouldn't be surprised if the faster software is
exploitable; I recommend changing the multiplier to avoid using secret
array indices.

For crypto_kem/ntruplus*, the page

https://bench.cr.yp.to/web-impl/amd64-rome0-crypto_kem-ntruplus864.html

shows "Test failure" for the avx2 implementation with some compilers, an
issue that I mentioned before and that should be investigated. This type
of compiler variation is often easy to diagnose with -fsanitize=address.

The same page shows "Compiler output" with clang -mcpu=native, but that
can be safely ignored in this case: it's simply reflecting the fact that
clang -mcpu=native doesn't support AVX2.

Status of SUPERCOP integration of the other submissions:

* AIMer and SMAUG-T: My understanding is that software updates (and
spec updates for SMAUG-T) are in progress. These should be easy to
integrate into SUPERCOP once the updates are done.

* NCC-Sign: As mentioned before, the original reference code and
optimized code produce different results; this means that they
can't simultaneously pass SUPERCOP's checksums. There are also more
timing variations that need to be handled.

* Paloma: As mentioned before, I recommend rewriting the software to
use the techniques of https://eprint.iacr.org/2017/793. I think
this will give a big speedup while also eliminating some timing
variations, so I don't think the current speeds are reflecting the
speeds that users will see.

* REDOG: My understanding is that the submission team is still
working on their initial C code.

---D. J. Bernstein

signature.asc

D. J. Bernstein

unread,

Jul 25, 2024, 5:11:59 PM7/25/24

to kpqc-b...@googlegroups.com

Just a quick update on what's happening with KpqC in SUPERCOP:

* AIMer: crypto_sign/aimer* was added to supercop-20240716.

* HAETAE: crypto_sign/haetae* was added to supercop-20240625, and was
updated in supercop-20240716 to avoid some timing variations caught
by TIMECOP. I'm updating this for the HAETAE software that was
announced 20240717 (which produces different output).

* MQ-Sign: crypto_sign/mqsign* was added to supercop-20240625. I'm
updating this for the MQ-Sign software that was announced 20240717.
There is still timing variation in the avx2 implementation.

* NCC-Sign: I'm adding the NCC-Sign software that was announced
20240717. My understanding is that the six options supported by
this software are nccsign{1,3,5} and nccsign{1,3,5}aes, in each
case using trinomials. I've fixed most of the code that TIMECOP
complains about, but (depending on compiler options) there are
still variable-time divisions.

* NTRU+: crypto_kem/ntruplus* was added to supercop-20240625. I'm
updating this for the NTRU+ software that was announced 20240725
(which produces the same output).

For Paloma, REDOG, and SMAUG-T, my understanding is that updates are in
progress from the submission teams.

---D. J. Bernstein

signature.asc

D. J. Bernstein

unread,

Sep 3, 2024, 10:01:08 PM9/3/24

to kpqc-b...@googlegroups.com

I had written:

> * MQ-Sign: crypto_sign/mqsign* was added to supercop-20240625. I'm
> updating this for the MQ-Sign software that was announced 20240717.
> There is still timing variation in the avx2 implementation.

One way to remove the timing variation is to change gf256v_madd_avx2 to
gf256v_madd_avx2_ct in blas_matrix_avx2.c (both times), where
gf256v_madd_avx2_ct is a new function shown below. This makes signing
10-20% slower compared to the variable-time avx2 code, but it's much
faster than the reference code. Probably the m_tab computation here can
be sped up.

Next SUPERCOP release is coming soon and will include this as avx2ct.

---D. J. Bernstein

#include "crypto_uint32.h"

static inline
void gf256v_madd_avx2_ct( uint8_t * accu_c, const uint8_t * a , uint8_t _b, unsigned _num_byte ) {
crypto_uint32 b = _b;
__m256i m_tab = _mm256_set1_epi32(crypto_uint32_bottombit_mask(b)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*1) );
__m256i m_tab1 = _mm256_set1_epi32(crypto_uint32_bitmod_mask(b,1)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*2) );
m_tab ^= _mm256_set1_epi32(crypto_uint32_bitmod_mask(b,2)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*4) );
m_tab1 ^= _mm256_set1_epi32(crypto_uint32_bitmod_mask(b,3)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*8) );
m_tab ^= _mm256_set1_epi32(crypto_uint32_bitmod_mask(b,4)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*16) );
m_tab1 ^= _mm256_set1_epi32(crypto_uint32_bitmod_mask(b,5)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*32) );
m_tab ^= _mm256_set1_epi32(crypto_uint32_bitmod_mask(b,6)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*64) );
m_tab1 ^= _mm256_set1_epi32(crypto_uint32_bitmod_mask(b,7)) & _mm256_load_si256( (__m256i*) (__gf256_mul + 32*128) );
m_tab ^= m_tab1;
__m256i ml = _mm256_permute2x128_si256( m_tab , m_tab , 0 );
__m256i mh = _mm256_permute2x128_si256( m_tab , m_tab , 0x11 );
__m256i mask = _mm256_load_si256( (__m256i*) __mask_low );

linearmap_8x8_accu_ymm( accu_c , a , ml , mh , mask , _num_byte );
}

signature.asc

D. J. Bernstein

unread,

Oct 9, 2024, 12:02:06 PM10/9/24

to kpqc-b...@googlegroups.com

The next SUPERCOP release will include smaugt{1,3,5} and the updated
ntruplus{576,768,864,1152}. In both cases, TIMECOP passes after some
variables are declassified.

(I've tested with only one compiler option. It won't be surprising if
testing more compilers shows TIMECOP failures. I'll rewrite things such
as "return (-(uint64_t)r) >> 63;" later using cryptoint.)

I used TIMECOP 2's crypto_declassify(x,xlen) function for most of the
declassification. However, I decided to take a different approach for
the ntruplus AVX2 code. In that code, the early aborts are tested in
assembly, specifically tests of the form

test %r10, %r10
jnz _loopend

in baseinv.s. It's possible to call crypto_declassify() from assembly,
but I decided to eliminate the function-call overhead in favor of
directly declassifying r10.

The rest of this message explains how to declassify a register, in this
case r10, in assembly. Similar comments apply to declassifying other
objects in assembly.

crypto_declassify(x,xlen) is implemented inside TIMECOP as
VALGRIND_MAKE_MEM_DEFINED(x,xlen), which internally calls a
general-purpose valgrind "client request", which, in the case of amd64,
internally works as follows:

* Set up rax pointing to a 48-byte valgrind request structure.

* Run "rolq $3,%rdi; rolq $13,%rdi; rolq $61,%rdi; rolq $51,%rdi;
xchgq %rbx,%rbx". Outside valgrind, this has no effect and never
happens in normal code (or at least valgrind reasonably hopes it
never happens); when it does happen, valgrind runs the request.

* The result of the request is in rdx.

For VALGRIND_MAKE_MEM_DEFINED, the 48-byte request structure contains
the following six 64-bit integers: 1296236546, x, xlen, 0, 0, 0.

In particular, declassifying an 8-byte value needs 56 bytes of memory:
the 48-byte request plus an 8-byte array to hold the value. I didn't
want to check whether 56 bytes are available within the outputs, and
baseinv.s isn't using the stack, so I allocated 56 bytes on the stack by
doing

subq $56,%rsp

at the top of each function and

addq $56,%rsp

before each ret. On Linux, this rsp adjustment can even be skipped,
since functions are free to use a 128-byte "red zone" under rsp.

I decided to put the 8 bytes to declassify at 0(%rsp), meaning that the
56 bytes are (value, 1296236546, rsp, 8, 0, 0, 0). The request structure
is unchanged throughout the function, so I created it at the top of the
function, right after adjusting rsp:

movq $1296236546,8(%rsp)
movq %rsp,16(%rsp)
movq $8,24(%rsp)
movq $0,32(%rsp)
movq $0,40(%rsp)
movq $0,48(%rsp)
leaq 8(%rsp),%rax

I replaced other uses of rax and rdx with r11 and r8, which weren't
being used otherwise. Then I inserted

movq %r10,0(%rsp)

# VALGRIND_MAKE_MEM_DEFINED(rsp,8)
rolq $3,%rdi
rolq $13,%rdi
rolq $61,%rdi
rolq $51,%rdi
xchgq %rbx,%rbx

movq 0(%rsp),%r10

before each r10 test. That's it.

This is actually declassifying more than just what's public: it's
declassifying the entire value in r10, whereas what's public is merely
whether r10 is zero. So an auditor reading this code has to check that
the declassified value isn't used in any other ways. An alternative is
to use constant-time code to flag whether r10 is zero, and then just
declassify that flag.

For comparison, calling crypto_declassify() would involve saving various
caller-save registers to comply with the C ABI, even if the code ends up
being linked to a non-valgrind crypto_declassify() that returns without
doing anything.

---D. J. Bernstein

signature.asc

Reply all

Reply to author

Forward