KpqC2 submissions and SUPERCOP

Skip to first unread message

D. J. Bernstein

Apr 14, 2024, 9:58:03 AMApr 14
to is a script that
downloads SUPERCOP, downloads the submissions, patches SUPERCOP for a
preliminary integration of software from the submissions (see below for
details), and runs various SUPERCOP tests. I ran this script (modulo
tweaks of comments) on a 3GHz Skylake CPU core, where it used 8GB of
disk space and finished in 8 hours. The results are linked here:

After any feedback based on this preliminary integration, I'm planning
to include similar integrations in an upcoming SUPERCOP release, and
then will collect results on many CPUs.

As a general note, SUPERCOP's built-in tests cover much more than NIST's
KATs do. For example, SUPERCOP automatically tries multiple compilers
and multiple compiler options; includes a TIMECOP option to check for
data flow from secrets to array indices and branch conditions; tries
modified ciphertexts, for example noticing PALOMA crashing on those
ciphertexts; and compares checksums (which are hashes; no need to
distribute large KAT files) across implementations.

In some cases, the integrated code covers fewer options than the
original code (and there might be further options that I missed):

* For NCC-Sign, the space of options in the original code wasn't
immediately clear to me. The integration uses specifically

* For NTRU+, the integration covers only the KEM, not the PKE.

* For PALOMA, the integration uses only w_openssl, not wo_openssl.

* For REDOG, my understanding is that the original C code is still in
progress; e.g., rbc_97_vec_clear varies in the number of arguments.
The integration of REDOG doesn't currently work.

* For SMAUG-T, I didn't try to figure out whether the additional
implementation is usable (and whether it's the TiMER mentioned in
the documentation).

The script changes various API details to meet SUPERCOP's requirements,
removes various files that aren't needed for SUPERCOP, and makes various
further changes described below.

For the options that are included, the results of the script include the

* "report": a short summary;
* "bench": SUPERCOP output, including compiler output and benchmarks;
* "timecop": TIMECOP output.

The short summary includes the following:

* Sizes.

* Cycle counts (quartiles and medians) for the main operations such
as "keypair".

* Checksums.

* A ranking of each (implementation,compiler) pair by sign+open time
for signatures or dec+enc time for KEMs.

I encourage submitters to check that the timings are as expected. Some
submissions mentioned Comet Lake, which should generally have the same
cycle counts as Skylake except when DRAM speed is an issue. I did a few
spot-checks and noticed some cases where the integration is faster than
the documentation, perhaps because SUPERCOP is trying more compiler
options, but I haven't looked systematically, and there could be cases
where the integration is slower.

I did check all sizes against the submission documents. All sizes match
for AIMer, MQ-Sign, NTRU+, and PALOMA. The private-key sizes in the
integration are longer than the private-key sizes in the documentation
for HAETAE, NCC-Sign, and SMAUG-T. The longest difference is for
SMAUG-T: the integration extends the secret key to include a copy of the
public key so that it can use the standard dec API; the original code
changed the API to take the public key as a separate argument to dec.
The shortest difference is for HAETAE, 32 bytes.

One of SUPERCOP's signature tests is whether sm can overlap m. For
MQ-Sign, the original code doesn't allow this; the integration changes
that. Other SUPERCOP tests detect failures for MQLR that the integration
eliminates by 0-initializing pk and sk; I encourage the submitters to
investigate this. Variable declarations after labels are allowed by some
compilers but not others; the integration changes "label: int foo;" to
"label: ; int foo;" to fix that. I also noticed that the original code
assumes message lengths fit into 32 bits; I didn't try to fix this.

For NCC-Sign, SUPERCOP reports different checksums, indicating that
different (implementation,compiler) pairs are producing different
results even when randombytes() is returning the same results. I
encourage the NCC-Sign submitters to investigate this.

For PALOMA, the original code has uninitialized hash output buffers in
utility.c. The integration changes this in a way that seems to produce
consistent output, but I don't know if this matches the specification.
The integration also removes the exit-on-weight-0 test; it's not
immediately obvious to me what the specification asks for here, but I
would guess that weights different from t should trigger implicit
rejection. I encourage the submitters to investigate both of these

The "bench" files often show "fromcompiler" lines that I encourage
submitters to investigate, although it's safe to ignore lines saying
that clang -mcpu=native doesn't support AVX2. Developers can often catch
bugs by running gcc or clang with options -g -fsanitize=address -Wall
-Wextra, although -Wall and -Wextra very often have false positives. One
way to add more confidence in C implementations is to write a Python
implementation from the spec and check that the Python implementation
produces the same results as the C implementations.

For constant-time code, the "timecop" files should show "timecop_pass"
lines with no "timecop_error"/"timecop_fail" lines. Currently these
files show many "timecop_error"/"timecop_fail" lines that I recommend
changing code to eliminate. If data is actually public then it's safe
to call crypto_declassify() on the data before it's used for a branch or
array index, and this will remove the TIMECOP complaint. However, each
crypto_declassify() is something reviewers have to check for safety, so
it's better to avoid it if possible. Meanwhile there are many cases
where data is secret and using it for a branch or array index can be a
security problem. In the case of AIMer, my impression is that the data
is public, but that it would also be cheap to use the relevant bits as
masks to avoid the issue.

---D. J. Bernstein

D. J. Bernstein

Apr 16, 2024, 5:29:24 PMApr 16
Here's an example of how to modify code to convert variable array
indices into constant array indices.

In the timecop-aimer128f file produced by the script mentioned in my
last message, there are "timecop_fail" lines saying

Use of uninitialised value of size 8
at 0x10F4F3: reveal_all_but (in /home/kpqc2/kpqc-supercop-build/supercop-20240107/bench/samba/work/compile/try-timecop)

coming from, e.g., the reference software having data flow from secrets
to the index used in the following line in the reveal_all_but()

memcpy(reveal_path[depth], nodes[index ^ 1], AIMER_SEED_SIZE);

One way to stop TIMECOP from complaining about this would be to add
something like crypto_declassify(sig->proofs,sizeof(sig->proofs)) in
aimer.c (after including "crypto_declassify.h") before the loop that
calls reveal_all_but(). But then a reviewer has to check that it's safe
to declare this data to be public---which seems likely given that "sig"
sounds like a signature being created, but sometimes signing code puts
secret data into signature buffers and overwrites the data later.

An alternative is to replace the nodes[index ^ 1] with arithmetic that
has the same effect. Here's a helper function that does arithmetic on
in[0],...,in[inlen-1] to extract in[inpos] without using inpos as an
array index:

#include "crypto_int64.h"

static void lookup(uint8_t out[AIMER_SEED_SIZE],
const uint8_t (*in)[AIMER_SEED_SIZE],
crypto_int64 inlen,
crypto_int64 inpos)
while (inlen > 0) {
size_t i;
crypto_int64 mask = crypto_int64_zero_mask(inpos);
for (i = 0;i < AIMER_SEED_SIZE;++i)
out[i] |= mask & (*in)[i];

The crypto_int64_zero_mask() function used here returns -1 if its input
is 0, and returns 0 otherwise, using bit operations rather than a
branch. ( has a saferewrite tool
that uses SMT solvers to efficiently test that the bit operations work
correctly for all 64-bit inputs; also, there are only a few possible
inputs anyway inside this lookup() function, and all inputs are tested
on each run.)

The memcpy() line can then be replaced with

lookup(reveal_path[depth], nodes, 2 * AIMER_NUM_MPC_PARTIES, index ^ 1);

and I would presume that this is just fine in speed since the nodes
array is small. An alternative is to replace the original function

void reveal_all_but(const uint8_t nodes[2 * AIMER_NUM_MPC_PARTIES][AIMER_SEED_SIZE],
const size_t cover_index,
size_t index = cover_index + AIMER_NUM_MPC_PARTIES;
for (size_t depth = 0; depth < AIMER_NUM_MPC_PARTIES_LOG2; depth++)
// index ^ 1 is sibling index
memcpy(reveal_path[depth], nodes[index ^ 1], AIMER_SEED_SIZE);

// go to parent node
index >>= 1;


void reveal_all_but(const uint8_t nodes[2 * AIMER_NUM_MPC_PARTIES][AIMER_SEED_SIZE],
size_t index,
size_t indexrange = AIMER_NUM_MPC_PARTIES;
for (size_t depth = 0; depth < AIMER_NUM_MPC_PARTIES_LOG2; depth++)
lookup(reveal_path[depth], nodes + indexrange, indexrange, index ^ 1);
// go to parent node
index >>= 1;
indexrange >>= 1;

so that lookup() is given the actual range of nodes used on each loop.
I tried both ways, and they produced the same SUPERCOP checksums as the
original software.

---D. J. Bernstein

D. J. Bernstein

Apr 18, 2024, 1:46:11 PMApr 18
Here's another example of converting variable array indices into
constant array indices. This example also shows how to eliminate a
variable branch, shows how to eliminate a variable division, and points
out an implementation bug beyond timing issues.

For this example, unlike the example in my previous message, I would
guess that the original code is exploitable by timing attacks similar to
the attacks from

The topic here is SMAUG-T, specifically the "HWT" subroutine, which
converts a secret seed into a secret ternary vector of weight h and
length n. Weight h means that there are exactly h nonzero coefficients.
Ternary means that the nonzero coefficients are +-1.

It's easy to choose secret signs, so the main problem is to compute h
secret positions for the n nonzero coefficients. Some algorithms that
naturally handle this without variable array indices are explained in

but SMAUG-T specifies a different algorithm based on Fisher--Yates
shuffling. My understanding is that the specified algorithm has to be
used for interoperability between enc and dec.

The timecop-smaugt1 file shows "timecop_fail" lines saying, e.g.,

Conditional jump or move depends on uninitialised value(s)
at 0x115091: cryptolab_smaug1_hwt (in /home/kpqc2/kpqc-supercop-build/supercop-20240107/bench/samba/work/compile/try-timecop)

where cryptolab_smaug1_hwt implements HWT. TIMECOP doesn't look for
divisions, and doesn't notice that in this subroutine the line

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

has a secret-dependent denominator and will take variable time on many
CPUs; I'll deal with that below. TIMECOP _does_ catch data flow from
secrets to branches and indices, in particular in the following code:

if (((0xffffffff - div) > deg) && (pos < hmwt)) {
res[DIMENSION - hmwt + pos] = res[div];
res[div] =
((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;
} else {
garbage = res[div];
garbage =
((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;

Presumably the "garbage" code here is intended to balance timings
between the two branches, but there are three reasons this isn't safe:
(1) the compiler will probably remove that code; (2) even if the code
isn't removed, microarchitectural effects mean that the branch timing
almost certainly won't be balanced; (3) even if timings are balanced,
the code path will be exposed through, e.g., branch-prediction attacks.
Also, the variable indices into res[] will influence cache timing.

So let's remove the variable indices and the variable branch. First,
here's a generic function to read in[inpos] from in[0...inlen-1],
similar to the lookup() function in my previous message but slightly
simpler since the array entries are just bytes:

#include "crypto_int64.h"

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)
uint8_t result = 0;
while (inlen > 0) {
result |= *in & crypto_int64_zero_mask(inpos);
return result;

I've noticed that, unfortunately, ARM patched various compilers in 2021
in a way that can turn crypto_int64_zero_mask into variable-time code.
(Assembly language is safer, of course.) One way to fight against that
compiler problem is the following slightly different version of

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)
uint8_t result = 0;
while (inlen > 0) {
result |= *in & ~((inpos >> 62) | (-inpos >> 62));
return result;

I also wrote a similar function storebyte(), and then replaced the line

res[DIMENSION - hmwt + pos] = res[div];

with the following:

storebyte(res,DIMENSION,DIMENSION - hmwt + pos,loadbyte(res,DIMENSION,div));

To test this so far, I copied smaugt1/opt to smaugt1/opt2, made this
change in smaugt1/opt2, and then checked that the SUPERCOP checksums
still matched (by running ./do-part crypto_kem smaugt1 in the supercop
directory and looking at the "try" lines in bench/*/data).

I then similarly changed the "res[div] = ..." line to use storebyte(),
and removed the "garbage" lines. This still left the branch:

if (((0xffffffff - div) > deg) && (pos < hmwt)) { ... }

As a first step in eliminating this branch, I changed the variables from
unsigned (specifically uint32_t) to signed (specifically crypto_int64),
and computed a mask as -1 if the branch condition holds, else 0:

crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);

With that setup, the following lines compute DIMENSION - hmwt + pos if
the branch condition holds, else -1:

crypto_int64 storepos1 = -1;
storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));

The point here is that storebyte() doesn't do anything if the position
is -1, so it's safe to do


which won't change any bytes unless the branch condition holds.

After similarly adjusting the second storebyte(), I looked at the
division. There's a bug in the original code at this point: the line

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

that I quoted above has 9 f's instead of the 8 listed in the spec. I
assume this bug produces an incorrect distribution of outputs, although
maybe not something exploitable.

I changed 0xfffffffff to 0xffffffff in opt/hwt.c and ref/hwt.c; as
expected, this changed the SUPERCOP checksums. For the constant-time
code in progress in opt2/hwt.c, since the goal of

uint32_t remain;
remain = 0xfffffffff / (DIMENSION - hmwt + pos);
div = 0xffffffff - remain * (DIMENSION - hmwt + pos);

was actually to compute 0xffffffff modulo DIMENSION-hwmt+pos, I wrote a
function two321mod() shown below to compute mod in constant time.

This left two more obvious timing variations in the hwt() code. One was

if (pos != hmwt)
fprintf(stderr, "hwt sampling error\n");

which I simply removed: the code should be verified to work in the first
place, and shouldn't be generating debugging messages. The other was

size_t cnt_arr_idx = 0;
for (i = 0; i < DIMENSION; ++i) {
cnt_arr_idx = ((i & 0x700) >> 8) & (-(res[i] & 0x01));
cnt_arr[cnt_arr_idx] += (res[i] & 0x01);

using something based on res[i] as an array index; I also changed this
to use loadbyte() and storebyte().

At this point running TIMECOP---

env TIMECOP=1 ./do-part crypto_kem smaugt1

---showed that clang no longer produced "timecop_fail" lines for hwt,
but gcc still did. I looked at the assembly that gcc was generating for
loadbyte() and saw that, before the loop, gcc was doing, basically,

if (inlen <= 0) return 0;
counter = -inpos;
inlen -= inpos;
result = 0;
in += inpos;

indicating that, unfortunately, the gcc optimizer was figuring out that
nothing was happening until inpos moved down to 0. (This is gcc 11.4.) I
fixed that by changing "--inpos" as shown in the code below.

With these changes, I don't see further TIMECOP complaints about the
hwt() function, although there are still TIMECOP complaints about other
parts of the SMAUG-T code.

---D. J. Bernstein

#include "hwt.h"
#include "crypto_int64.h"

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)
uint8_t result = 0;
while (inlen > 0) {
result |= *in & ~((inpos >> 62) | (-inpos >> 62));
inpos += (inpos >> 62) | ((-1-inpos) >> 62);
return result;

void storebyte(uint8_t *out,crypto_int64 outlen,
crypto_int64 outpos,uint8_t byte)
while (outlen > 0) {
uint8_t olddata = *out;
uint8_t change = olddata ^ byte;
change &= ~((outpos >> 62) | (-outpos >> 62));
*out = olddata ^ change;

crypto_int64 two321mod(crypto_int64 m)
crypto_int64 result = 0xffffffff;
crypto_int64 sub = m;
sub <<= 31;
int j;
for (j = 0;j < 32;++j) {
crypto_int64 resultminus = result - sub;
crypto_int64 mask = resultminus >> 62;
result = resultminus ^ (mask & (resultminus ^ result));
sub >>= 1;
return result;

* Name: hwt
* Description: Hamming weight sampling deterministically to generate sparse
* polynomial r(x) from a seed. shake256 is the Extendable-Output
* Function from the SHA-3 family.
* Arguments: - uint8_t *res: pointer to ouptput polynomial r(x)
* (of length LWE), assumed to be already initialized
* - uint8_t *input: pointer to input seed (of length input_size)
* - size_t input_size: length of input seed
* - uint16_t hmwt: hamming weight to sample
void hwt(uint8_t *res, uint8_t *cnt_arr, const uint8_t *input,
const size_t input_size, const uint16_t hmwt) {

crypto_int64 i = 0, pos = 0;
uint32_t buf[SHAKE256_RATE * 2] = {0};

uint8_t xof_block = (hmwt == HS) ? HS_XOF : HR_XOF;
shake256((uint8_t *)buf, xof_block * SHAKE256_RATE, input, input_size);

for (i = 0; i < xof_block * 32; i++) {
crypto_int64 deg = buf[i];
crypto_int64 div = two321mod(DIMENSION - hmwt + pos) + 1;
crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);
crypto_int64 storepos1 = -1;
crypto_int64 storepos2 = -1;
storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));
storepos2 ^= mask & (storepos2 ^ div);
storebyte(res,DIMENSION,storepos2,((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1);
pos -= mask;

for (i = 0; i < DIMENSION; ++i) {
uint8_t bit = res[i] & 0x01;
size_t cnt_arr_idx = ((i & 0x700) >> 8) & -bit;
uint8_t old = loadbyte(cnt_arr,DIMENSION,cnt_arr_idx);
storebyte(cnt_arr,DIMENSION,cnt_arr_idx,old + bit);

D. J. Bernstein

Apr 20, 2024, 2:04:25 PMApr 20
Here's an example of simpler code changes to eliminate TIMECOP warnings
for rejection sampling.

In HAETAE, a buffer generated from a random (presumably secret) seed is
passed to a rej_uniform() function with a loop around the following:

t = buf[pos++];
t |= (uint32_t)buf[pos++] << 8;

if (t < Q)
a[ctr++] = t;

Obviously t < Q is branching based on the input data, and the array
index ctr also ends up dependent on the input data. This triggers
warnings from TIMECOP.

In rejection sampling, typically it's safe for the rejection condition
to be public. What I'll do in this message is illustrate how to use
crypto_declassify() to tell TIMECOP that the rejection condition is
public. Of course, a reviewer then has to check whether this public
information is leaking anything about the secrets that are being used.

(It isn't absolutely necessary to make rejection conditions public. An
alternative strategy is to convincingly analyze the amount of data
needed so that rejection sampling will fail with probability below, say,
1/2^256. Then the code can change to always generate that amount of
data, and to perform the necessary computations on that data in constant
time. But in this message I'll focus on using crypto_declassify().)

I added

#include "crypto_uint32.h"
#include "crypto_declassify.h"

at the top of the file and changed

if (t < Q)
a[ctr++] = t;

to the following:

crypto_uint32 tlessQ = crypto_uint32_smaller_mask(t,Q);
crypto_declassify(&tlessQ,sizeof tlessQ);

if (tlessQ)
a[ctr++] = t;

Here crypto_uint32_smaller_mask(t,Q) computes -1 if t < Q, otherwise 0;
and crypto_declassify(&tlessQ,sizeof tlessQ) informs TIMECOP to stop
tracking data flow from the tlessQ variable. This stops TIMECOP from
complaining about the branch based on tlessQ, the effect of tlessQ on
ctr, etc.; meanwhile TIMECOP continues to track data flow from, e.g.,
the t variable. The reviewer has to look for crypto_declassify() lines
and analyze whether it's safe for those variables to be public.

This change eliminated all TIMECOP warnings for rej_uniform(), while
SUPERCOP checksums remained the same. I then did the same

* in rej_eta(), with new variables tless243, t0less15, t1less15;

* in crypto_sign_keypair(), with a new rejectmask variable (used at
two positions);

* in crypto_sign_signature(), with a new rejectmask variable;

* in polyfixveclk_sample_hyperball(), with a new rejectmask variable;

I also declassified some existing variables:

* the "accepted" variable in sample_gauss();

* the "hb_z1" and "h" variables in crypto_sign_signature()---these
aren't for rejection sampling but seem to be public variables that
are then encoded in a variable-time way.

I fixed some TIMECOP warnings about sample_gauss_sigma76() in a
different way. The arguments to that function include "rand" pointing to
bytes, and then there's code

const uint64_t *rand_gauss16_ptr = (uint64_t *)rand,
*rand_rej_ptr = (uint64_t *)(&rand[2]);
const uint64_t rand_gauss16 = (*rand_gauss16_ptr) & ((1ULL << 16) - 1);
const uint64_t rand_rej = (*rand_rej_ptr) & ((1ULL << 48) - 1);

that isn't a safe way to read 16-bit and 48-bit pieces from bytes: some
CPUs don't allow unaligned reads, and big-endian CPUs will give the
wrong results. I changed this to portable code:

const uint64_t rand_gauss16 = rand[0] | ((uint64_t) rand[1] << 8);
const uint64_t rand_rej = rand[2] | ((uint64_t) rand[3] << 8) | ((uint64_t) rand[4] << 16) | ((uint64_t) rand[5] << 24) | ((uint64_t) rand[6] << 32) | ((uint64_t) rand[7] << 40);

This left just one function that TIMECOP was complaining about, namely
poly_challenge() using a "b" index derived from secret data. I decided
to just declassify the b variable since I think it's public. Other
approaches would be to eliminate the index as in my previous message, or
to generate constant-weight vectors using different algorithms cited in
my previous message.

These changes gave the first timecop_pass results that I've seen so far
in KpqC. However, I encourage the submitters to check whether all of
these variables are in fact safe to declassify.

---D. J. Bernstein

D. J. Bernstein

Apr 22, 2024, 8:57:32 AMApr 22
I looked at eliminating TIMECOP warnings for ntruplus. This turned out
to be the easiest case so far: there's just one warning, and it's for a
simple rejection-sampling loop in key generation. So I added

#include "crypto_declassify.h"

and added

crypto_declassify(&r,sizeof r);

right before the "while(r)" line; TIMECOP then reported timecop_pass. I
tried all of ntruplus576/{avx2,opt,ref}.

---D. J. Bernstein

D. J. Bernstein

Apr 24, 2024, 1:51:21 PMApr 24
I also looked at MQ-Sign. This was another case where TIMECOP warnings
were easy to eliminate, at least for mqsignlr2567246/ref, as in the
patch below:

* There are some rejection-sampling loops in mqs.c. I added

* There's a real timing leak in rng.c. I rewrote those lines to use
crypto_uint8_zero_mask(). (I mentioned in an earlier message that
there are some compiler "optimizations" that can introduce
variable-time code into the mask functions; the next version of
SUPERCOP will tweak how the mask functions work internally to stop
those optimizations.)

This rng.c comes from code that NIST encouraged people to use for seed
expansion but that was never designed to run in constant time. I think
that, in context, the timing leak is limited to a small number of bytes
from a fresh buffer for each message, but the usage of AES inside the
NIST code will probably leak further information in some platforms, as
illustrated by My recommendation is to
eliminate that seed expander and call SHAKE256 with the seed as input to
generate the desired number of bytes of output.

---D. J. Bernstein

diff -ruw ref/mqs.c ref2/mqs.c
--- ref/mqs.c 2024-04-13 17:30:47.363673556 -0500
+++ ref2/mqs.c 2024-04-24 12:01:24.938727667 -0500
@@ -11,6 +11,8 @@
#include "utils_hash.h"
#include "utils_malloc.h"

+#include "crypto_declassify.h"
#define _MAX_O _O
#define _MAX_O_BYTE _O
@@ -97,6 +99,7 @@

rr = gf256mat_solve_linear_eq(mat_tmp, _O_BYTE, B, C, D, A_inv);
+ crypto_declassify(&rr,sizeof rr);
goto rej;

@@ -115,6 +118,7 @@
gf256mat_prod(temp_vec2, A_inv, H_half, H_half, temp_vec);

rr &= gf256mat_gaussian_elim(temp_mat, temp_vec + H_half, H_half);
+ crypto_declassify(&rr,sizeof rr);
if(!rr) goto rej;

gf256mat_back_substitute(temp_vec + H_half, temp_mat, H_half);
diff -ruw ref/rng.c ref2/rng.c
--- ref/rng.c 2024-04-13 17:30:47.359673467 -0500
+++ ref2/rng.c 2024-04-24 11:38:02.521684078 -0500
@@ -181,6 +181,8 @@

+#include "crypto_uint8.h"
AES256_CTR_DRBG_Update(unsigned char *provided_data,
unsigned char *Key,
@@ -190,16 +192,13 @@

for(int i = 0 ; i < 3 ; i++)
+ crypto_uint8 carry = -1;
//increment V
for(int j = 15 ; j >= 0 ; j--)
- if(V[j] == 0xff)
- V[j] = 0x00;
- else
- {
- V[j]++;
- break;
- }
+ V[j] -= carry;
+ carry &= crypto_uint8_zero_mask(V[j]);

AES256_ECB(Key, V, temp + 16 * i);
@@ -242,14 +241,12 @@
int i = 0;

while ( xlen > 0 ) {
- //increment V
- for (int j=15; j>=0; j--) {
- if ( states->V[j] == 0xff )
- states->V[j] = 0x00;
- else {
- states->V[j]++;
- break;
- }
+ crypto_uint8 carry = -1;
+ for(int j = 15 ; j >= 0 ; j--)
+ {
+ states->V[j] -= carry;
+ carry &= crypto_uint8_zero_mask(states->V[j]);
AES256_ECB(states->Key, states->V, block);
if ( xlen > 15 ) {

D. J. Bernstein

Apr 26, 2024, 4:49:20 PMApr 26
TIMECOP identifies quite a few issues with the Paloma reference code. I
would expect these issues to allow timing attacks. This message looks at
one of the issues, namely how to multiply in F_{2^{13}}.

The code represents F_{2^{13}} as F_2[x]/(x^13 + x^7 + x^6 + x^5 + 1),
and stores 13-bit polynomials as uint16_t. One of TIMECOP's complaints
is about the variable array indexing in the following function:

gf gf2m_mul_w_tab(IN gf in1, IN gf in2, IN const gf2m_tab* gf2m_tables)
in1 &= MASKBITS;
in2 &= MASKBITS;

gf result = 0;

gf int1high = (in1 >> GF2M_SPLIT_LOW) & GF2M_SPLIT_MASK_HIGH_BIT;
gf int1low = (in1)&GF2M_SPLIT_MASK_LOW_BIT;
gf int2high = (in2 >> GF2M_SPLIT_LOW) & GF2M_SPLIT_MASK_HIGH_BIT;
gf int2low = (in2)&GF2M_SPLIT_MASK_LOW_BIT;

result ^= gf2m_tables->mul_11_tab[int1high][int2high];
result ^= gf2m_tables->mul_10_tab[int1high][int2low];
result ^= gf2m_tables->mul_10_tab[int2high][int1low];
result ^= gf2m_tables->mul_00_tab[int1low][int2low];

return result;

This is splitting each 13-bit input into 6 high bits and 7 low bits, and
separately multiplying high*high, high*low, low*high, low*low by looking
up the products in tables having 2^(6+6), 2^(6+7), and 2^(7+7) entries.
The tables are precomputed.

There's a gf2m_mul() function that was used to compute the multiplication
tables in the first place. Eliminating the tables is simply a matter of
calling that function:

gf gf2m_mul_w_tab(IN gf in1, IN gf in2, IN const gf2m_tab* gf2m_tables)
return gf2m_mul(in1,in2);

There are also tables for some other operations such as squarings; I'll
focus here on multiplications.

Calling gf2m_mul works correctly, but then TIMECOP complains about
variable branches inside that function:

gf gf2m_mul(IN gf in1, gf in2)
in1 &= MASKBITS;
in2 &= MASKBITS;

gf result = 0;
gf t1 = in1;
gf t2 = in2;

for (; t2; t2 >>= 1)
result ^= (t1 * (t2 & 1));
if (t1 & 0x1000)
t1 = ((t1 << 1)) ^ IRR_POLY;
t1 <<= 1;

return result & MASKBITS;

I changed this to always do 13 iterations of the outer loop, and to use
bit operations for the inner branch:

gf gf2m_mul(IN gf in1, gf in2)
in1 &= MASKBITS;
in2 &= MASKBITS;

gf result = 0;
gf t1 = in1;
gf t2 = in2;
int i;

for (i = 0;i < 13;++i)
result ^= t1 * (t2 & 1);
t1 <<= 1;
t1 ^= IRR_POLY * (t1 >> 13);
t2 >>= 1;

return result & MASKBITS;

This eliminated the TIMECOP warnings for gf2m_mul.

There are more than 100 arithmetic operations here. Here's a rewrite
using 63 operations:

gf gf2m_mul(IN gf x, gf y)
uint64_t xy, xy0, xy1, z0, z1, q0, q1;
uint32_t x0, x1, y0, y1, top0, top1, bot0, bot1;

xy = x | (((uint64_t) y) << 32); // will handle x and y in parallel

// bits of x from bottom: 0123456789abcdef................
xy = ((xy & 0x0000ff000000ff00) << 8) | (xy & 0x000000ff000000ff);
// 01234567........89abcdef........
xy = ((xy & 0x00f000f000f000f0) << 4) | (xy & 0x000f000f000f000f);
// 0123....4567....89ab....cdef....
xy = ((xy & 0x0c0c0c0c0c0c0c0c) << 2) | (xy & 0x0303030303030303);

xy0 = xy & 0x1111111111111111; // 0...2...4...6...8...a...c...e...
xy1 = xy & 0x2222222222222222; // .1...3...5...7...9...b...d...f..
x0 = xy0 & 0xffffffff; y0 = xy0 >> 32;
x1 = xy1 & 0xffffffff; y1 = xy1 >> 32;

// actually had only 13 bits in x and y, so 25 bits in z0 and z1 below

z0 = 0x1111111111111111 & (x0 * (uint64_t) y0 + ((x1 * (uint64_t) y1) << 2));
// 0...2...4...6...8...a...c...e...g...i...k...m...o
z1 = 0x2222222222222222 & (x0 * (uint64_t) y1 + x1 * (uint64_t) y0);
// .1...3...5...7...9...b...d...f...h...j...l...n

top0 = z1 >> 25; // d...f...h...j...l...n
top1 = z0 >> 27; // .e...g...i...k...m...o
q0 = 0x1111111111111111 & (top0 * (uint64_t) 0x1001 + top1 * (uint64_t) 0x8008800);
q1 = 0x2222222222222222 & (top0 * (uint64_t) 0x2002200 + top1 * (uint64_t) 0x1001);
z0 ^= q0; z1 ^= q1;

top0 = z1 >> 25; // d...f...h...j...l...n
top1 = z0 >> 27; // .e...g...i...k...m...o
q0 = 0x1111111111111111 & (top0 * (uint64_t) 0x1001 + top1 * (uint64_t) 0x8008800);
q1 = 0x2222222222222222 & (top0 * (uint64_t) 0x2002200 + top1 * (uint64_t) 0x1001);
z0 ^= q0; z1 ^= q1;

z0 |= z1; // 01..23..45..67..89..ab..c
z0 = 0xf0f0f0f & (z0 | (z0>>2)); // 0123....4567....89ab....c
z0 = 0xff00ff & (z0 | (z0>>4)); // 01234567........89abc
return 0xffff & (z0 | (z0>>8));

This is still many more operations than the original code, and overall I
see considerable slowdowns for Paloma on, e.g., Zen 2, such as dec
slowing down from 11 million cycles to 39 million cycles. However:

* About half of the operations above are converting back and forth
between the packed gf format and an unpacked format that's better
for this style of arithmetic. If more of the code were written in
the unpacked format then most of the conversions would disappear.
Probably the reduction steps (q0, q1) can also be sped up.

* Further speedups are possible from "bitslicing" arithmetic
operations across CPU words, for example carrying out 64 bit
operations in parallel with a single 64-bit operation. I would
expect to be the fastest approach
to the Paloma computations.

* Many CPUs have instructions for binary-polynomial multiplication,
and then a multiplication in F_{2^{13}} is just a few instructions.
Typically these can again be carried out on more than one input at

All of these approaches avoid table lookups, and the bitslicing approach
avoids problems on CPUs with variable-time multipliers.

---D. J. Bernstein

D. J. Bernstein

Apr 28, 2024, 12:01:56 PMApr 28
The following bug in the NCC-Sign code makes the output array much less
random than it should be, although this code applies only to parameter
sets where ETA is 1:

if (t0 < 3) {
a[ctr++] = 1 - t0;
if (t1 < 3 && ctr < len) {
a[ctr++] = 1 - t0;
if (t2 < 3 && ctr < len) {
a[ctr++] = 1 - t0;
if (t3 < 3 && ctr < len) {
a[ctr++] = 1 - t0;

I noticed this while looking at functions that TIMECOP was complaining
about. As I commented in an earlier message, there are also differences
in SUPERCOP checksums between the original ref and opt implementations.

Some of the TIMECOP complaints came from rejection sampling, which I
fixed with

t0less3 = crypto_uint32_smaller_mask(t0,3);
crypto_declassify(&t0less3,sizeof t0less3);
if (t0less3) {
a[ctr++] = 1 - t0;

etc. in this function, and similar changes in poly_chknorm() (where I
think the early return is, in context, only for rejection sampling),
poly_uniform(), poly_challenge(), and crypto_sign_signature().

Further complaints came from lines

c->coeffs[i] = c->coeffs[b];
c->coeffs[b] = 1 - 2 * (signs & 1);

appearing (twice) in poly_challenge(). I addressed those the same way as
in my earlier message regarding SMAUG-T, again with a big slowdown.
Changing the spec to use one of the approaches from

would improve efficiency.

There are more problems. For example, TIMECOP complains about the
branches in decompose(); looking at that function, I also see "% Q",
which TIMECOP won't detect and which (here and in other functions) needs
to be replaced with constant-time divisions.

---D. J. Bernstein
Reply all
Reply to author
0 new messages