403 views

Skip to first unread message

Apr 14, 2024, 9:58:03â€¯AMApr 14

to kpqc-b...@googlegroups.com

https://cr.yp.to/2024/20240414/kpqc-supercop.sh is a script that

downloads SUPERCOP, downloads the submissions, patches SUPERCOP for a

preliminary integration of software from the submissions (see below for

details), and runs various SUPERCOP tests. I ran this script (modulo

tweaks of comments) on a 3GHz Skylake CPU core, where it used 8GB of

disk space and finished in 8 hours. The results are linked here:

https://cr.yp.to/2024/20240414/results.html

After any feedback based on this preliminary integration, I'm planning

to include similar integrations in an upcoming SUPERCOP release, and

then https://bench.cr.yp.to will collect results on many CPUs.

As a general note, SUPERCOP's built-in tests cover much more than NIST's

KATs do. For example, SUPERCOP automatically tries multiple compilers

and multiple compiler options; includes a TIMECOP option to check for

data flow from secrets to array indices and branch conditions; tries

modified ciphertexts, for example noticing PALOMA crashing on those

ciphertexts; and compares checksums (which are hashes; no need to

distribute large KAT files) across implementations.

In some cases, the integrated code covers fewer options than the

original code (and there might be further options that I missed):

* For NCC-Sign, the space of options in the original code wasn't

immediately clear to me. The integration uses specifically

NIMS_TRI_NTT_MODE and NIMS_TRI_USE_AES.

* For NTRU+, the integration covers only the KEM, not the PKE.

* For PALOMA, the integration uses only w_openssl, not wo_openssl.

* For REDOG, my understanding is that the original C code is still in

progress; e.g., rbc_97_vec_clear varies in the number of arguments.

The integration of REDOG doesn't currently work.

* For SMAUG-T, I didn't try to figure out whether the additional

implementation is usable (and whether it's the TiMER mentioned in

the documentation).

The script changes various API details to meet SUPERCOP's requirements,

removes various files that aren't needed for SUPERCOP, and makes various

further changes described below.

For the options that are included, the results of the script include the

following:

* "report": a short summary;

* "bench": SUPERCOP output, including compiler output and benchmarks;

* "timecop": TIMECOP output.

The short summary includes the following:

* Sizes.

* Cycle counts (quartiles and medians) for the main operations such

as "keypair".

* Checksums.

* A ranking of each (implementation,compiler) pair by sign+open time

for signatures or dec+enc time for KEMs.

I encourage submitters to check that the timings are as expected. Some

submissions mentioned Comet Lake, which should generally have the same

cycle counts as Skylake except when DRAM speed is an issue. I did a few

spot-checks and noticed some cases where the integration is faster than

the documentation, perhaps because SUPERCOP is trying more compiler

options, but I haven't looked systematically, and there could be cases

where the integration is slower.

I did check all sizes against the submission documents. All sizes match

for AIMer, MQ-Sign, NTRU+, and PALOMA. The private-key sizes in the

integration are longer than the private-key sizes in the documentation

for HAETAE, NCC-Sign, and SMAUG-T. The longest difference is for

SMAUG-T: the integration extends the secret key to include a copy of the

public key so that it can use the standard dec API; the original code

changed the API to take the public key as a separate argument to dec.

The shortest difference is for HAETAE, 32 bytes.

One of SUPERCOP's signature tests is whether sm can overlap m. For

MQ-Sign, the original code doesn't allow this; the integration changes

that. Other SUPERCOP tests detect failures for MQLR that the integration

eliminates by 0-initializing pk and sk; I encourage the submitters to

investigate this. Variable declarations after labels are allowed by some

compilers but not others; the integration changes "label: int foo;" to

"label: ; int foo;" to fix that. I also noticed that the original code

assumes message lengths fit into 32 bits; I didn't try to fix this.

For NCC-Sign, SUPERCOP reports different checksums, indicating that

different (implementation,compiler) pairs are producing different

results even when randombytes() is returning the same results. I

encourage the NCC-Sign submitters to investigate this.

For PALOMA, the original code has uninitialized hash output buffers in

utility.c. The integration changes this in a way that seems to produce

consistent output, but I don't know if this matches the specification.

The integration also removes the exit-on-weight-0 test; it's not

immediately obvious to me what the specification asks for here, but I

would guess that weights different from t should trigger implicit

rejection. I encourage the submitters to investigate both of these

issues.

The "bench" files often show "fromcompiler" lines that I encourage

submitters to investigate, although it's safe to ignore lines saying

that clang -mcpu=native doesn't support AVX2. Developers can often catch

bugs by running gcc or clang with options -g -fsanitize=address -Wall

-Wextra, although -Wall and -Wextra very often have false positives. One

way to add more confidence in C implementations is to write a Python

implementation from the spec and check that the Python implementation

produces the same results as the C implementations.

For constant-time code, the "timecop" files should show "timecop_pass"

lines with no "timecop_error"/"timecop_fail" lines. Currently these

files show many "timecop_error"/"timecop_fail" lines that I recommend

changing code to eliminate. If data is actually public then it's safe

to call crypto_declassify() on the data before it's used for a branch or

array index, and this will remove the TIMECOP complaint. However, each

crypto_declassify() is something reviewers have to check for safety, so

it's better to avoid it if possible. Meanwhile there are many cases

where data is secret and using it for a branch or array index can be a

security problem. In the case of AIMer, my impression is that the data

is public, but that it would also be cheap to use the relevant bits as

masks to avoid the issue.

---D. J. Bernstein

downloads SUPERCOP, downloads the submissions, patches SUPERCOP for a

preliminary integration of software from the submissions (see below for

details), and runs various SUPERCOP tests. I ran this script (modulo

tweaks of comments) on a 3GHz Skylake CPU core, where it used 8GB of

disk space and finished in 8 hours. The results are linked here:

https://cr.yp.to/2024/20240414/results.html

After any feedback based on this preliminary integration, I'm planning

to include similar integrations in an upcoming SUPERCOP release, and

then https://bench.cr.yp.to will collect results on many CPUs.

As a general note, SUPERCOP's built-in tests cover much more than NIST's

KATs do. For example, SUPERCOP automatically tries multiple compilers

and multiple compiler options; includes a TIMECOP option to check for

data flow from secrets to array indices and branch conditions; tries

modified ciphertexts, for example noticing PALOMA crashing on those

ciphertexts; and compares checksums (which are hashes; no need to

distribute large KAT files) across implementations.

In some cases, the integrated code covers fewer options than the

original code (and there might be further options that I missed):

* For NCC-Sign, the space of options in the original code wasn't

immediately clear to me. The integration uses specifically

NIMS_TRI_NTT_MODE and NIMS_TRI_USE_AES.

* For NTRU+, the integration covers only the KEM, not the PKE.

* For PALOMA, the integration uses only w_openssl, not wo_openssl.

* For REDOG, my understanding is that the original C code is still in

progress; e.g., rbc_97_vec_clear varies in the number of arguments.

The integration of REDOG doesn't currently work.

* For SMAUG-T, I didn't try to figure out whether the additional

implementation is usable (and whether it's the TiMER mentioned in

the documentation).

The script changes various API details to meet SUPERCOP's requirements,

removes various files that aren't needed for SUPERCOP, and makes various

further changes described below.

For the options that are included, the results of the script include the

following:

* "report": a short summary;

* "bench": SUPERCOP output, including compiler output and benchmarks;

* "timecop": TIMECOP output.

The short summary includes the following:

* Sizes.

* Cycle counts (quartiles and medians) for the main operations such

as "keypair".

* Checksums.

* A ranking of each (implementation,compiler) pair by sign+open time

for signatures or dec+enc time for KEMs.

I encourage submitters to check that the timings are as expected. Some

submissions mentioned Comet Lake, which should generally have the same

cycle counts as Skylake except when DRAM speed is an issue. I did a few

spot-checks and noticed some cases where the integration is faster than

the documentation, perhaps because SUPERCOP is trying more compiler

options, but I haven't looked systematically, and there could be cases

where the integration is slower.

I did check all sizes against the submission documents. All sizes match

for AIMer, MQ-Sign, NTRU+, and PALOMA. The private-key sizes in the

integration are longer than the private-key sizes in the documentation

for HAETAE, NCC-Sign, and SMAUG-T. The longest difference is for

SMAUG-T: the integration extends the secret key to include a copy of the

public key so that it can use the standard dec API; the original code

changed the API to take the public key as a separate argument to dec.

The shortest difference is for HAETAE, 32 bytes.

One of SUPERCOP's signature tests is whether sm can overlap m. For

MQ-Sign, the original code doesn't allow this; the integration changes

that. Other SUPERCOP tests detect failures for MQLR that the integration

eliminates by 0-initializing pk and sk; I encourage the submitters to

investigate this. Variable declarations after labels are allowed by some

compilers but not others; the integration changes "label: int foo;" to

"label: ; int foo;" to fix that. I also noticed that the original code

assumes message lengths fit into 32 bits; I didn't try to fix this.

For NCC-Sign, SUPERCOP reports different checksums, indicating that

different (implementation,compiler) pairs are producing different

results even when randombytes() is returning the same results. I

encourage the NCC-Sign submitters to investigate this.

For PALOMA, the original code has uninitialized hash output buffers in

utility.c. The integration changes this in a way that seems to produce

consistent output, but I don't know if this matches the specification.

The integration also removes the exit-on-weight-0 test; it's not

immediately obvious to me what the specification asks for here, but I

would guess that weights different from t should trigger implicit

rejection. I encourage the submitters to investigate both of these

issues.

The "bench" files often show "fromcompiler" lines that I encourage

submitters to investigate, although it's safe to ignore lines saying

that clang -mcpu=native doesn't support AVX2. Developers can often catch

bugs by running gcc or clang with options -g -fsanitize=address -Wall

-Wextra, although -Wall and -Wextra very often have false positives. One

way to add more confidence in C implementations is to write a Python

implementation from the spec and check that the Python implementation

produces the same results as the C implementations.

For constant-time code, the "timecop" files should show "timecop_pass"

lines with no "timecop_error"/"timecop_fail" lines. Currently these

files show many "timecop_error"/"timecop_fail" lines that I recommend

changing code to eliminate. If data is actually public then it's safe

to call crypto_declassify() on the data before it's used for a branch or

array index, and this will remove the TIMECOP complaint. However, each

crypto_declassify() is something reviewers have to check for safety, so

it's better to avoid it if possible. Meanwhile there are many cases

where data is secret and using it for a branch or array index can be a

security problem. In the case of AIMer, my impression is that the data

is public, but that it would also be cheap to use the relevant bits as

masks to avoid the issue.

---D. J. Bernstein

Apr 16, 2024, 5:29:24â€¯PMApr 16

to kpqc-b...@googlegroups.com

Here's an example of how to modify code to convert variable array

indices into constant array indices.

In the timecop-aimer128f file produced by the script mentioned in my

last message, there are "timecop_fail" lines saying

Use of uninitialised value of size 8

at 0x10F4F3: reveal_all_but (in /home/kpqc2/kpqc-supercop-build/supercop-20240107/bench/samba/work/compile/try-timecop)

coming from, e.g., the reference software having data flow from secrets

to the index used in the following line in the reveal_all_but()

function:

memcpy(reveal_path[depth], nodes[index ^ 1], AIMER_SEED_SIZE);

One way to stop TIMECOP from complaining about this would be to add

something like crypto_declassify(sig->proofs,sizeof(sig->proofs)) in

aimer.c (after including "crypto_declassify.h") before the loop that

calls reveal_all_but(). But then a reviewer has to check that it's safe

to declare this data to be public---which seems likely given that "sig"

sounds like a signature being created, but sometimes signing code puts

secret data into signature buffers and overwrites the data later.

An alternative is to replace the nodes[index ^ 1] with arithmetic that

has the same effect. Here's a helper function that does arithmetic on

in[0],...,in[inlen-1] to extract in[inpos] without using inpos as an

array index:

#include "crypto_int64.h"

static void lookup(uint8_t out[AIMER_SEED_SIZE],

const uint8_t (*in)[AIMER_SEED_SIZE],

crypto_int64 inlen,

crypto_int64 inpos)

{

memset(out,0,AIMER_SEED_SIZE);

while (inlen > 0) {

size_t i;

crypto_int64 mask = crypto_int64_zero_mask(inpos);

for (i = 0;i < AIMER_SEED_SIZE;++i)

out[i] |= mask & (*in)[i];

++in;

--inlen;

--inpos;

}

}

The crypto_int64_zero_mask() function used here returns -1 if its input

is 0, and returns 0 otherwise, using bit operations rather than a

branch. (https://pqsrc.cr.yp.to/downloads.html has a saferewrite tool

that uses SMT solvers to efficiently test that the bit operations work

correctly for all 64-bit inputs; also, there are only a few possible

inputs anyway inside this lookup() function, and all inputs are tested

on each run.)

The memcpy() line can then be replaced with

lookup(reveal_path[depth], nodes, 2 * AIMER_NUM_MPC_PARTIES, index ^ 1);

and I would presume that this is just fine in speed since the nodes

array is small. An alternative is to replace the original function

void reveal_all_but(const uint8_t nodes[2 * AIMER_NUM_MPC_PARTIES][AIMER_SEED_SIZE],

const size_t cover_index,

uint8_t reveal_path[AIMER_NUM_MPC_PARTIES_LOG2][AIMER_SEED_SIZE])

{

size_t index = cover_index + AIMER_NUM_MPC_PARTIES;

for (size_t depth = 0; depth < AIMER_NUM_MPC_PARTIES_LOG2; depth++)

{

// index ^ 1 is sibling index

memcpy(reveal_path[depth], nodes[index ^ 1], AIMER_SEED_SIZE);

// go to parent node

index >>= 1;

}

}

with

void reveal_all_but(const uint8_t nodes[2 * AIMER_NUM_MPC_PARTIES][AIMER_SEED_SIZE],

size_t index,

uint8_t reveal_path[AIMER_NUM_MPC_PARTIES_LOG2][AIMER_SEED_SIZE])

{

size_t indexrange = AIMER_NUM_MPC_PARTIES;

for (size_t depth = 0; depth < AIMER_NUM_MPC_PARTIES_LOG2; depth++)

{

lookup(reveal_path[depth], nodes + indexrange, indexrange, index ^ 1);

// go to parent node

index >>= 1;

indexrange >>= 1;

}

}

so that lookup() is given the actual range of nodes used on each loop.

I tried both ways, and they produced the same SUPERCOP checksums as the

original software.

---D. J. Bernstein

indices into constant array indices.

In the timecop-aimer128f file produced by the script mentioned in my

last message, there are "timecop_fail" lines saying

Use of uninitialised value of size 8

at 0x10F4F3: reveal_all_but (in /home/kpqc2/kpqc-supercop-build/supercop-20240107/bench/samba/work/compile/try-timecop)

coming from, e.g., the reference software having data flow from secrets

to the index used in the following line in the reveal_all_but()

function:

memcpy(reveal_path[depth], nodes[index ^ 1], AIMER_SEED_SIZE);

One way to stop TIMECOP from complaining about this would be to add

something like crypto_declassify(sig->proofs,sizeof(sig->proofs)) in

aimer.c (after including "crypto_declassify.h") before the loop that

calls reveal_all_but(). But then a reviewer has to check that it's safe

to declare this data to be public---which seems likely given that "sig"

sounds like a signature being created, but sometimes signing code puts

secret data into signature buffers and overwrites the data later.

An alternative is to replace the nodes[index ^ 1] with arithmetic that

has the same effect. Here's a helper function that does arithmetic on

in[0],...,in[inlen-1] to extract in[inpos] without using inpos as an

array index:

#include "crypto_int64.h"

static void lookup(uint8_t out[AIMER_SEED_SIZE],

const uint8_t (*in)[AIMER_SEED_SIZE],

crypto_int64 inlen,

crypto_int64 inpos)

{

memset(out,0,AIMER_SEED_SIZE);

while (inlen > 0) {

size_t i;

crypto_int64 mask = crypto_int64_zero_mask(inpos);

for (i = 0;i < AIMER_SEED_SIZE;++i)

out[i] |= mask & (*in)[i];

++in;

--inlen;

--inpos;

}

}

The crypto_int64_zero_mask() function used here returns -1 if its input

is 0, and returns 0 otherwise, using bit operations rather than a

branch. (https://pqsrc.cr.yp.to/downloads.html has a saferewrite tool

that uses SMT solvers to efficiently test that the bit operations work

correctly for all 64-bit inputs; also, there are only a few possible

inputs anyway inside this lookup() function, and all inputs are tested

on each run.)

The memcpy() line can then be replaced with

lookup(reveal_path[depth], nodes, 2 * AIMER_NUM_MPC_PARTIES, index ^ 1);

and I would presume that this is just fine in speed since the nodes

array is small. An alternative is to replace the original function

void reveal_all_but(const uint8_t nodes[2 * AIMER_NUM_MPC_PARTIES][AIMER_SEED_SIZE],

const size_t cover_index,

uint8_t reveal_path[AIMER_NUM_MPC_PARTIES_LOG2][AIMER_SEED_SIZE])

{

size_t index = cover_index + AIMER_NUM_MPC_PARTIES;

for (size_t depth = 0; depth < AIMER_NUM_MPC_PARTIES_LOG2; depth++)

{

// index ^ 1 is sibling index

memcpy(reveal_path[depth], nodes[index ^ 1], AIMER_SEED_SIZE);

// go to parent node

index >>= 1;

}

}

with

void reveal_all_but(const uint8_t nodes[2 * AIMER_NUM_MPC_PARTIES][AIMER_SEED_SIZE],

size_t index,

uint8_t reveal_path[AIMER_NUM_MPC_PARTIES_LOG2][AIMER_SEED_SIZE])

{

size_t indexrange = AIMER_NUM_MPC_PARTIES;

for (size_t depth = 0; depth < AIMER_NUM_MPC_PARTIES_LOG2; depth++)

{

lookup(reveal_path[depth], nodes + indexrange, indexrange, index ^ 1);

// go to parent node

index >>= 1;

indexrange >>= 1;

}

}

so that lookup() is given the actual range of nodes used on each loop.

I tried both ways, and they produced the same SUPERCOP checksums as the

original software.

---D. J. Bernstein

Apr 18, 2024, 1:46:11â€¯PMApr 18

to kpqc-b...@googlegroups.com

Here's another example of converting variable array indices into

constant array indices. This example also shows how to eliminate a

variable branch, shows how to eliminate a variable division, and points

out an implementation bug beyond timing issues.

For this example, unlike the example in my previous message, I would

guess that the original code is exploitable by timing attacks similar to

the attacks from https://eprint.iacr.org/2021/1485.

The topic here is SMAUG-T, specifically the "HWT" subroutine, which

converts a secret seed into a secret ternary vector of weight h and

length n. Weight h means that there are exactly h nonzero coefficients.

Ternary means that the nonzero coefficients are +-1.

It's easy to choose secret signs, so the main problem is to compute h

secret positions for the n nonzero coefficients. Some algorithms that

naturally handle this without variable array indices are explained in

https://cr.yp.to/papers.html#divergence

https://eprint.iacr.org/2024/548

but SMAUG-T specifies a different algorithm based on Fisher--Yates

shuffling. My understanding is that the specified algorithm has to be

used for interoperability between enc and dec.

The timecop-smaugt1 file shows "timecop_fail" lines saying, e.g.,

Conditional jump or move depends on uninitialised value(s)

at 0x115091: cryptolab_smaug1_hwt (in /home/kpqc2/kpqc-supercop-build/supercop-20240107/bench/samba/work/compile/try-timecop)

where cryptolab_smaug1_hwt implements HWT. TIMECOP doesn't look for

divisions, and doesn't notice that in this subroutine the line

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

has a secret-dependent denominator and will take variable time on many

CPUs; I'll deal with that below. TIMECOP _does_ catch data flow from

secrets to branches and indices, in particular in the following code:

if (((0xffffffff - div) > deg) && (pos < hmwt)) {

res[DIMENSION - hmwt + pos] = res[div];

res[div] =

((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;

pos++;

} else {

garbage = res[div];

garbage =

((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;

}

Presumably the "garbage" code here is intended to balance timings

between the two branches, but there are three reasons this isn't safe:

(1) the compiler will probably remove that code; (2) even if the code

isn't removed, microarchitectural effects mean that the branch timing

almost certainly won't be balanced; (3) even if timings are balanced,

the code path will be exposed through, e.g., branch-prediction attacks.

Also, the variable indices into res[] will influence cache timing.

So let's remove the variable indices and the variable branch. First,

here's a generic function to read in[inpos] from in[0...inlen-1],

similar to the lookup() function in my previous message but slightly

simpler since the array entries are just bytes:

#include "crypto_int64.h"

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)

{

uint8_t result = 0;

while (inlen > 0) {

result |= *in & crypto_int64_zero_mask(inpos);

++in;

--inlen;

--inpos;

}

return result;

}

I've noticed that, unfortunately, ARM patched various compilers in 2021

in a way that can turn crypto_int64_zero_mask into variable-time code.

(Assembly language is safer, of course.) One way to fight against that

compiler problem is the following slightly different version of

loadbyte():

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)

{

uint8_t result = 0;

while (inlen > 0) {

result |= *in & ~((inpos >> 62) | (-inpos >> 62));

++in;

--inlen;

--inpos;

}

return result;

}

I also wrote a similar function storebyte(), and then replaced the line

res[DIMENSION - hmwt + pos] = res[div];

with the following:

storebyte(res,DIMENSION,DIMENSION - hmwt + pos,loadbyte(res,DIMENSION,div));

To test this so far, I copied smaugt1/opt to smaugt1/opt2, made this

change in smaugt1/opt2, and then checked that the SUPERCOP checksums

still matched (by running ./do-part crypto_kem smaugt1 in the supercop

directory and looking at the "try" lines in bench/*/data).

I then similarly changed the "res[div] = ..." line to use storebyte(),

and removed the "garbage" lines. This still left the branch:

if (((0xffffffff - div) > deg) && (pos < hmwt)) { ... }

As a first step in eliminating this branch, I changed the variables from

unsigned (specifically uint32_t) to signed (specifically crypto_int64),

and computed a mask as -1 if the branch condition holds, else 0:

crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);

With that setup, the following lines compute DIMENSION - hmwt + pos if

the branch condition holds, else -1:

crypto_int64 storepos1 = -1;

storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));

The point here is that storebyte() doesn't do anything if the position

is -1, so it's safe to do

storebyte(res,DIMENSION,storepos1,loadbyte(res,DIMENSION,div));

which won't change any bytes unless the branch condition holds.

After similarly adjusting the second storebyte(), I looked at the

division. There's a bug in the original code at this point: the line

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

that I quoted above has 9 f's instead of the 8 listed in the spec. I

assume this bug produces an incorrect distribution of outputs, although

maybe not something exploitable.

I changed 0xfffffffff to 0xffffffff in opt/hwt.c and ref/hwt.c; as

expected, this changed the SUPERCOP checksums. For the constant-time

code in progress in opt2/hwt.c, since the goal of

uint32_t remain;

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

div = 0xffffffff - remain * (DIMENSION - hmwt + pos);

was actually to compute 0xffffffff modulo DIMENSION-hwmt+pos, I wrote a

function two321mod() shown below to compute mod in constant time.

This left two more obvious timing variations in the hwt() code. One was

if (pos != hmwt)

fprintf(stderr, "hwt sampling error\n");

which I simply removed: the code should be verified to work in the first

place, and shouldn't be generating debugging messages. The other was

size_t cnt_arr_idx = 0;

for (i = 0; i < DIMENSION; ++i) {

cnt_arr_idx = ((i & 0x700) >> 8) & (-(res[i] & 0x01));

cnt_arr[cnt_arr_idx] += (res[i] & 0x01);

}

using something based on res[i] as an array index; I also changed this

to use loadbyte() and storebyte().

At this point running TIMECOP---

env TIMECOP=1 ./do-part crypto_kem smaugt1

---showed that clang no longer produced "timecop_fail" lines for hwt,

but gcc still did. I looked at the assembly that gcc was generating for

loadbyte() and saw that, before the loop, gcc was doing, basically,

if (inlen <= 0) return 0;

counter = -inpos;

inlen -= inpos;

result = 0;

in += inpos;

indicating that, unfortunately, the gcc optimizer was figuring out that

nothing was happening until inpos moved down to 0. (This is gcc 11.4.) I

fixed that by changing "--inpos" as shown in the code below.

With these changes, I don't see further TIMECOP complaints about the

hwt() function, although there are still TIMECOP complaints about other

parts of the SMAUG-T code.

---D. J. Bernstein

#include "hwt.h"

#include "crypto_int64.h"

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)

{

uint8_t result = 0;

while (inlen > 0) {

result |= *in & ~((inpos >> 62) | (-inpos >> 62));

++in;

--inlen;

inpos += (inpos >> 62) | ((-1-inpos) >> 62);

}

return result;

}

void storebyte(uint8_t *out,crypto_int64 outlen,

crypto_int64 outpos,uint8_t byte)

{

while (outlen > 0) {

uint8_t olddata = *out;

uint8_t change = olddata ^ byte;

change &= ~((outpos >> 62) | (-outpos >> 62));

*out = olddata ^ change;

++out;

--outlen;

--outpos;

}

}

crypto_int64 two321mod(crypto_int64 m)

{

crypto_int64 result = 0xffffffff;

crypto_int64 sub = m;

sub <<= 31;

int j;

for (j = 0;j < 32;++j) {

crypto_int64 resultminus = result - sub;

crypto_int64 mask = resultminus >> 62;

result = resultminus ^ (mask & (resultminus ^ result));

sub >>= 1;

}

return result;

}

/*************************************************

* Name: hwt

*

* Description: Hamming weight sampling deterministically to generate sparse

* polynomial r(x) from a seed. shake256 is the Extendable-Output

* Function from the SHA-3 family.

*

* Arguments: - uint8_t *res: pointer to ouptput polynomial r(x)

* (of length LWE), assumed to be already initialized

* - uint8_t *input: pointer to input seed (of length input_size)

* - size_t input_size: length of input seed

* - uint16_t hmwt: hamming weight to sample

**************************************************/

void hwt(uint8_t *res, uint8_t *cnt_arr, const uint8_t *input,

const size_t input_size, const uint16_t hmwt) {

crypto_int64 i = 0, pos = 0;

uint32_t buf[SHAKE256_RATE * 2] = {0};

uint8_t xof_block = (hmwt == HS) ? HS_XOF : HR_XOF;

shake256((uint8_t *)buf, xof_block * SHAKE256_RATE, input, input_size);

for (i = 0; i < xof_block * 32; i++) {

crypto_int64 deg = buf[i];

crypto_int64 div = two321mod(DIMENSION - hmwt + pos) + 1;

crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);

crypto_int64 storepos1 = -1;

crypto_int64 storepos2 = -1;

storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));

storepos2 ^= mask & (storepos2 ^ div);

storebyte(res,DIMENSION,storepos1,loadbyte(res,DIMENSION,div));

storebyte(res,DIMENSION,storepos2,((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1);

pos -= mask;

}

for (i = 0; i < DIMENSION; ++i) {

uint8_t bit = res[i] & 0x01;

size_t cnt_arr_idx = ((i & 0x700) >> 8) & -bit;

uint8_t old = loadbyte(cnt_arr,DIMENSION,cnt_arr_idx);

storebyte(cnt_arr,DIMENSION,cnt_arr_idx,old + bit);

}

}

constant array indices. This example also shows how to eliminate a

variable branch, shows how to eliminate a variable division, and points

out an implementation bug beyond timing issues.

For this example, unlike the example in my previous message, I would

guess that the original code is exploitable by timing attacks similar to

the attacks from https://eprint.iacr.org/2021/1485.

The topic here is SMAUG-T, specifically the "HWT" subroutine, which

converts a secret seed into a secret ternary vector of weight h and

length n. Weight h means that there are exactly h nonzero coefficients.

Ternary means that the nonzero coefficients are +-1.

It's easy to choose secret signs, so the main problem is to compute h

secret positions for the n nonzero coefficients. Some algorithms that

naturally handle this without variable array indices are explained in

https://cr.yp.to/papers.html#divergence

https://eprint.iacr.org/2024/548

but SMAUG-T specifies a different algorithm based on Fisher--Yates

shuffling. My understanding is that the specified algorithm has to be

used for interoperability between enc and dec.

The timecop-smaugt1 file shows "timecop_fail" lines saying, e.g.,

Conditional jump or move depends on uninitialised value(s)

at 0x115091: cryptolab_smaug1_hwt (in /home/kpqc2/kpqc-supercop-build/supercop-20240107/bench/samba/work/compile/try-timecop)

where cryptolab_smaug1_hwt implements HWT. TIMECOP doesn't look for

divisions, and doesn't notice that in this subroutine the line

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

has a secret-dependent denominator and will take variable time on many

CPUs; I'll deal with that below. TIMECOP _does_ catch data flow from

secrets to branches and indices, in particular in the following code:

if (((0xffffffff - div) > deg) && (pos < hmwt)) {

res[DIMENSION - hmwt + pos] = res[div];

res[div] =

((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;

pos++;

} else {

garbage = res[div];

garbage =

((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;

}

Presumably the "garbage" code here is intended to balance timings

between the two branches, but there are three reasons this isn't safe:

(1) the compiler will probably remove that code; (2) even if the code

isn't removed, microarchitectural effects mean that the branch timing

almost certainly won't be balanced; (3) even if timings are balanced,

the code path will be exposed through, e.g., branch-prediction attacks.

Also, the variable indices into res[] will influence cache timing.

So let's remove the variable indices and the variable branch. First,

here's a generic function to read in[inpos] from in[0...inlen-1],

similar to the lookup() function in my previous message but slightly

simpler since the array entries are just bytes:

#include "crypto_int64.h"

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)

{

uint8_t result = 0;

while (inlen > 0) {

result |= *in & crypto_int64_zero_mask(inpos);

++in;

--inlen;

--inpos;

}

return result;

}

I've noticed that, unfortunately, ARM patched various compilers in 2021

in a way that can turn crypto_int64_zero_mask into variable-time code.

(Assembly language is safer, of course.) One way to fight against that

compiler problem is the following slightly different version of

loadbyte():

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)

{

uint8_t result = 0;

while (inlen > 0) {

result |= *in & ~((inpos >> 62) | (-inpos >> 62));

++in;

--inlen;

--inpos;

}

return result;

}

I also wrote a similar function storebyte(), and then replaced the line

res[DIMENSION - hmwt + pos] = res[div];

with the following:

storebyte(res,DIMENSION,DIMENSION - hmwt + pos,loadbyte(res,DIMENSION,div));

To test this so far, I copied smaugt1/opt to smaugt1/opt2, made this

change in smaugt1/opt2, and then checked that the SUPERCOP checksums

still matched (by running ./do-part crypto_kem smaugt1 in the supercop

directory and looking at the "try" lines in bench/*/data).

I then similarly changed the "res[div] = ..." line to use storebyte(),

and removed the "garbage" lines. This still left the branch:

if (((0xffffffff - div) > deg) && (pos < hmwt)) { ... }

As a first step in eliminating this branch, I changed the variables from

unsigned (specifically uint32_t) to signed (specifically crypto_int64),

and computed a mask as -1 if the branch condition holds, else 0:

crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);

With that setup, the following lines compute DIMENSION - hmwt + pos if

the branch condition holds, else -1:

crypto_int64 storepos1 = -1;

storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));

The point here is that storebyte() doesn't do anything if the position

is -1, so it's safe to do

storebyte(res,DIMENSION,storepos1,loadbyte(res,DIMENSION,div));

which won't change any bytes unless the branch condition holds.

After similarly adjusting the second storebyte(), I looked at the

division. There's a bug in the original code at this point: the line

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

that I quoted above has 9 f's instead of the 8 listed in the spec. I

assume this bug produces an incorrect distribution of outputs, although

maybe not something exploitable.

I changed 0xfffffffff to 0xffffffff in opt/hwt.c and ref/hwt.c; as

expected, this changed the SUPERCOP checksums. For the constant-time

code in progress in opt2/hwt.c, since the goal of

uint32_t remain;

remain = 0xfffffffff / (DIMENSION - hmwt + pos);

div = 0xffffffff - remain * (DIMENSION - hmwt + pos);

was actually to compute 0xffffffff modulo DIMENSION-hwmt+pos, I wrote a

function two321mod() shown below to compute mod in constant time.

This left two more obvious timing variations in the hwt() code. One was

if (pos != hmwt)

fprintf(stderr, "hwt sampling error\n");

which I simply removed: the code should be verified to work in the first

place, and shouldn't be generating debugging messages. The other was

size_t cnt_arr_idx = 0;

for (i = 0; i < DIMENSION; ++i) {

cnt_arr_idx = ((i & 0x700) >> 8) & (-(res[i] & 0x01));

cnt_arr[cnt_arr_idx] += (res[i] & 0x01);

}

using something based on res[i] as an array index; I also changed this

to use loadbyte() and storebyte().

At this point running TIMECOP---

env TIMECOP=1 ./do-part crypto_kem smaugt1

---showed that clang no longer produced "timecop_fail" lines for hwt,

but gcc still did. I looked at the assembly that gcc was generating for

loadbyte() and saw that, before the loop, gcc was doing, basically,

if (inlen <= 0) return 0;

counter = -inpos;

inlen -= inpos;

result = 0;

in += inpos;

indicating that, unfortunately, the gcc optimizer was figuring out that

nothing was happening until inpos moved down to 0. (This is gcc 11.4.) I

fixed that by changing "--inpos" as shown in the code below.

With these changes, I don't see further TIMECOP complaints about the

hwt() function, although there are still TIMECOP complaints about other

parts of the SMAUG-T code.

---D. J. Bernstein

#include "hwt.h"

#include "crypto_int64.h"

uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)

{

uint8_t result = 0;

while (inlen > 0) {

result |= *in & ~((inpos >> 62) | (-inpos >> 62));

++in;

--inlen;

inpos += (inpos >> 62) | ((-1-inpos) >> 62);

}

return result;

}

void storebyte(uint8_t *out,crypto_int64 outlen,

crypto_int64 outpos,uint8_t byte)

{

while (outlen > 0) {

uint8_t olddata = *out;

uint8_t change = olddata ^ byte;

change &= ~((outpos >> 62) | (-outpos >> 62));

*out = olddata ^ change;

++out;

--outlen;

--outpos;

}

}

crypto_int64 two321mod(crypto_int64 m)

{

crypto_int64 result = 0xffffffff;

crypto_int64 sub = m;

sub <<= 31;

int j;

for (j = 0;j < 32;++j) {

crypto_int64 resultminus = result - sub;

crypto_int64 mask = resultminus >> 62;

result = resultminus ^ (mask & (resultminus ^ result));

sub >>= 1;

}

return result;

}

/*************************************************

* Name: hwt

*

* Description: Hamming weight sampling deterministically to generate sparse

* polynomial r(x) from a seed. shake256 is the Extendable-Output

* Function from the SHA-3 family.

*

* Arguments: - uint8_t *res: pointer to ouptput polynomial r(x)

* (of length LWE), assumed to be already initialized

* - uint8_t *input: pointer to input seed (of length input_size)

* - size_t input_size: length of input seed

* - uint16_t hmwt: hamming weight to sample

**************************************************/

void hwt(uint8_t *res, uint8_t *cnt_arr, const uint8_t *input,

const size_t input_size, const uint16_t hmwt) {

crypto_int64 i = 0, pos = 0;

uint32_t buf[SHAKE256_RATE * 2] = {0};

uint8_t xof_block = (hmwt == HS) ? HS_XOF : HR_XOF;

shake256((uint8_t *)buf, xof_block * SHAKE256_RATE, input, input_size);

for (i = 0; i < xof_block * 32; i++) {

crypto_int64 deg = buf[i];

crypto_int64 div = two321mod(DIMENSION - hmwt + pos) + 1;

crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);

crypto_int64 storepos1 = -1;

crypto_int64 storepos2 = -1;

storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));

storepos2 ^= mask & (storepos2 ^ div);

storebyte(res,DIMENSION,storepos1,loadbyte(res,DIMENSION,div));

storebyte(res,DIMENSION,storepos2,((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1);

pos -= mask;

}

for (i = 0; i < DIMENSION; ++i) {

uint8_t bit = res[i] & 0x01;

size_t cnt_arr_idx = ((i & 0x700) >> 8) & -bit;

uint8_t old = loadbyte(cnt_arr,DIMENSION,cnt_arr_idx);

storebyte(cnt_arr,DIMENSION,cnt_arr_idx,old + bit);

}

}

Apr 20, 2024, 2:04:25â€¯PMApr 20

to kpqc-b...@googlegroups.com

Here's an example of simpler code changes to eliminate TIMECOP warnings

for rejection sampling.

In HAETAE, a buffer generated from a random (presumably secret) seed is

passed to a rej_uniform() function with a loop around the following:

t = buf[pos++];

t |= (uint32_t)buf[pos++] << 8;

if (t < Q)

a[ctr++] = t;

Obviously t < Q is branching based on the input data, and the array

index ctr also ends up dependent on the input data. This triggers

warnings from TIMECOP.

In rejection sampling, typically it's safe for the rejection condition

to be public. What I'll do in this message is illustrate how to use

crypto_declassify() to tell TIMECOP that the rejection condition is

public. Of course, a reviewer then has to check whether this public

information is leaking anything about the secrets that are being used.

(It isn't absolutely necessary to make rejection conditions public. An

alternative strategy is to convincingly analyze the amount of data

needed so that rejection sampling will fail with probability below, say,

1/2^256. Then the code can change to always generate that amount of

data, and to perform the necessary computations on that data in constant

time. But in this message I'll focus on using crypto_declassify().)

I added

#include "crypto_uint32.h"

#include "crypto_declassify.h"

at the top of the file and changed

if (t < Q)

a[ctr++] = t;

to the following:

crypto_uint32 tlessQ = crypto_uint32_smaller_mask(t,Q);

crypto_declassify(&tlessQ,sizeof tlessQ);

if (tlessQ)

a[ctr++] = t;

Here crypto_uint32_smaller_mask(t,Q) computes -1 if t < Q, otherwise 0;

and crypto_declassify(&tlessQ,sizeof tlessQ) informs TIMECOP to stop

tracking data flow from the tlessQ variable. This stops TIMECOP from

complaining about the branch based on tlessQ, the effect of tlessQ on

ctr, etc.; meanwhile TIMECOP continues to track data flow from, e.g.,

the t variable. The reviewer has to look for crypto_declassify() lines

and analyze whether it's safe for those variables to be public.

This change eliminated all TIMECOP warnings for rej_uniform(), while

SUPERCOP checksums remained the same. I then did the same

* in rej_eta(), with new variables tless243, t0less15, t1less15;

* in crypto_sign_keypair(), with a new rejectmask variable (used at

two positions);

* in crypto_sign_signature(), with a new rejectmask variable;

* in polyfixveclk_sample_hyperball(), with a new rejectmask variable;

I also declassified some existing variables:

* the "accepted" variable in sample_gauss();

* the "hb_z1" and "h" variables in crypto_sign_signature()---these

aren't for rejection sampling but seem to be public variables that

are then encoded in a variable-time way.

I fixed some TIMECOP warnings about sample_gauss_sigma76() in a

different way. The arguments to that function include "rand" pointing to

bytes, and then there's code

const uint64_t *rand_gauss16_ptr = (uint64_t *)rand,

*rand_rej_ptr = (uint64_t *)(&rand[2]);

const uint64_t rand_gauss16 = (*rand_gauss16_ptr) & ((1ULL << 16) - 1);

const uint64_t rand_rej = (*rand_rej_ptr) & ((1ULL << 48) - 1);

that isn't a safe way to read 16-bit and 48-bit pieces from bytes: some

CPUs don't allow unaligned reads, and big-endian CPUs will give the

wrong results. I changed this to portable code:

const uint64_t rand_gauss16 = rand[0] | ((uint64_t) rand[1] << 8);

const uint64_t rand_rej = rand[2] | ((uint64_t) rand[3] << 8) | ((uint64_t) rand[4] << 16) | ((uint64_t) rand[5] << 24) | ((uint64_t) rand[6] << 32) | ((uint64_t) rand[7] << 40);

This left just one function that TIMECOP was complaining about, namely

poly_challenge() using a "b" index derived from secret data. I decided

to just declassify the b variable since I think it's public. Other

approaches would be to eliminate the index as in my previous message, or

to generate constant-weight vectors using different algorithms cited in

my previous message.

These changes gave the first timecop_pass results that I've seen so far

in KpqC. However, I encourage the submitters to check whether all of

these variables are in fact safe to declassify.

---D. J. Bernstein

for rejection sampling.

In HAETAE, a buffer generated from a random (presumably secret) seed is

passed to a rej_uniform() function with a loop around the following:

t = buf[pos++];

t |= (uint32_t)buf[pos++] << 8;

if (t < Q)

a[ctr++] = t;

Obviously t < Q is branching based on the input data, and the array

index ctr also ends up dependent on the input data. This triggers

warnings from TIMECOP.

In rejection sampling, typically it's safe for the rejection condition

to be public. What I'll do in this message is illustrate how to use

crypto_declassify() to tell TIMECOP that the rejection condition is

public. Of course, a reviewer then has to check whether this public

information is leaking anything about the secrets that are being used.

(It isn't absolutely necessary to make rejection conditions public. An

alternative strategy is to convincingly analyze the amount of data

needed so that rejection sampling will fail with probability below, say,

1/2^256. Then the code can change to always generate that amount of

data, and to perform the necessary computations on that data in constant

time. But in this message I'll focus on using crypto_declassify().)

I added

#include "crypto_uint32.h"

#include "crypto_declassify.h"

at the top of the file and changed

if (t < Q)

a[ctr++] = t;

to the following:

crypto_uint32 tlessQ = crypto_uint32_smaller_mask(t,Q);

crypto_declassify(&tlessQ,sizeof tlessQ);

if (tlessQ)

a[ctr++] = t;

Here crypto_uint32_smaller_mask(t,Q) computes -1 if t < Q, otherwise 0;

and crypto_declassify(&tlessQ,sizeof tlessQ) informs TIMECOP to stop

tracking data flow from the tlessQ variable. This stops TIMECOP from

complaining about the branch based on tlessQ, the effect of tlessQ on

ctr, etc.; meanwhile TIMECOP continues to track data flow from, e.g.,

the t variable. The reviewer has to look for crypto_declassify() lines

and analyze whether it's safe for those variables to be public.

This change eliminated all TIMECOP warnings for rej_uniform(), while

SUPERCOP checksums remained the same. I then did the same

* in rej_eta(), with new variables tless243, t0less15, t1less15;

* in crypto_sign_keypair(), with a new rejectmask variable (used at

two positions);

* in crypto_sign_signature(), with a new rejectmask variable;

* in polyfixveclk_sample_hyperball(), with a new rejectmask variable;

I also declassified some existing variables:

* the "accepted" variable in sample_gauss();

* the "hb_z1" and "h" variables in crypto_sign_signature()---these

aren't for rejection sampling but seem to be public variables that

are then encoded in a variable-time way.

I fixed some TIMECOP warnings about sample_gauss_sigma76() in a

different way. The arguments to that function include "rand" pointing to

bytes, and then there's code

const uint64_t *rand_gauss16_ptr = (uint64_t *)rand,

*rand_rej_ptr = (uint64_t *)(&rand[2]);

const uint64_t rand_gauss16 = (*rand_gauss16_ptr) & ((1ULL << 16) - 1);

const uint64_t rand_rej = (*rand_rej_ptr) & ((1ULL << 48) - 1);

that isn't a safe way to read 16-bit and 48-bit pieces from bytes: some

CPUs don't allow unaligned reads, and big-endian CPUs will give the

wrong results. I changed this to portable code:

const uint64_t rand_gauss16 = rand[0] | ((uint64_t) rand[1] << 8);

const uint64_t rand_rej = rand[2] | ((uint64_t) rand[3] << 8) | ((uint64_t) rand[4] << 16) | ((uint64_t) rand[5] << 24) | ((uint64_t) rand[6] << 32) | ((uint64_t) rand[7] << 40);

This left just one function that TIMECOP was complaining about, namely

poly_challenge() using a "b" index derived from secret data. I decided

to just declassify the b variable since I think it's public. Other

approaches would be to eliminate the index as in my previous message, or

to generate constant-weight vectors using different algorithms cited in

my previous message.

These changes gave the first timecop_pass results that I've seen so far

in KpqC. However, I encourage the submitters to check whether all of

these variables are in fact safe to declassify.

---D. J. Bernstein

Apr 22, 2024, 8:57:32â€¯AMApr 22

to kpqc-b...@googlegroups.com

I looked at eliminating TIMECOP warnings for ntruplus. This turned out

to be the easiest case so far: there's just one warning, and it's for a

simple rejection-sampling loop in key generation. So I added

#include "crypto_declassify.h"

and added

crypto_declassify(&r,sizeof r);

right before the "while(r)" line; TIMECOP then reported timecop_pass. I

tried all of ntruplus576/{avx2,opt,ref}.

---D. J. Bernstein

to be the easiest case so far: there's just one warning, and it's for a

simple rejection-sampling loop in key generation. So I added

#include "crypto_declassify.h"

and added

crypto_declassify(&r,sizeof r);

right before the "while(r)" line; TIMECOP then reported timecop_pass. I

tried all of ntruplus576/{avx2,opt,ref}.

---D. J. Bernstein

Apr 24, 2024, 1:51:21â€¯PMApr 24

to kpqc-b...@googlegroups.com

I also looked at MQ-Sign. This was another case where TIMECOP warnings

were easy to eliminate, at least for mqsignlr2567246/ref, as in the

patch below:

* There are some rejection-sampling loops in mqs.c. I added

crypto_declassify().

* There's a real timing leak in rng.c. I rewrote those lines to use

crypto_uint8_zero_mask(). (I mentioned in an earlier message that

there are some compiler "optimizations" that can introduce

variable-time code into the mask functions; the next version of

SUPERCOP will tweak how the mask functions work internally to stop

those optimizations.)

This rng.c comes from code that NIST encouraged people to use for seed

expansion but that was never designed to run in constant time. I think

that, in context, the timing leak is limited to a small number of bytes

from a fresh buffer for each message, but the usage of AES inside the

NIST code will probably leak further information in some platforms, as

illustrated by https://eprint.iacr.org/2019/996. My recommendation is to

eliminate that seed expander and call SHAKE256 with the seed as input to

generate the desired number of bytes of output.

---D. J. Bernstein

diff -ruw ref/mqs.c ref2/mqs.c

--- ref/mqs.c 2024-04-13 17:30:47.363673556 -0500

+++ ref2/mqs.c 2024-04-24 12:01:24.938727667 -0500

@@ -11,6 +11,8 @@

#include "utils_hash.h"

#include "utils_malloc.h"

+#include "crypto_declassify.h"

+

#define MAX_ATTEMPT_FRMAT 128

#define _MAX_O _O

#define _MAX_O_BYTE _O

@@ -97,6 +99,7 @@

rr = gf256mat_solve_linear_eq(mat_tmp, _O_BYTE, B, C, D, A_inv);

+ crypto_declassify(&rr,sizeof rr);

if(!rr)

goto rej;

@@ -115,6 +118,7 @@

gf256mat_prod(temp_vec2, A_inv, H_half, H_half, temp_vec);

rr &= gf256mat_gaussian_elim(temp_mat, temp_vec + H_half, H_half);

+ crypto_declassify(&rr,sizeof rr);

if(!rr) goto rej;

gf256mat_back_substitute(temp_vec + H_half, temp_mat, H_half);

diff -ruw ref/rng.c ref2/rng.c

--- ref/rng.c 2024-04-13 17:30:47.359673467 -0500

+++ ref2/rng.c 2024-04-24 11:38:02.521684078 -0500

@@ -181,6 +181,8 @@

return RNG_SUCCESS;

}

+#include "crypto_uint8.h"

+

void

AES256_CTR_DRBG_Update(unsigned char *provided_data,

unsigned char *Key,

@@ -190,16 +192,13 @@

for(int i = 0 ; i < 3 ; i++)

{

+ crypto_uint8 carry = -1;

+

//increment V

for(int j = 15 ; j >= 0 ; j--)

{

- if(V[j] == 0xff)

- V[j] = 0x00;

- else

- {

- V[j]++;

- break;

- }

+ V[j] -= carry;

+ carry &= crypto_uint8_zero_mask(V[j]);

}

AES256_ECB(Key, V, temp + 16 * i);

@@ -242,14 +241,12 @@

int i = 0;

while ( xlen > 0 ) {

- //increment V

- for (int j=15; j>=0; j--) {

- if ( states->V[j] == 0xff )

- states->V[j] = 0x00;

- else {

- states->V[j]++;

- break;

- }

+ crypto_uint8 carry = -1;

+

+ for(int j = 15 ; j >= 0 ; j--)

+ {

+ states->V[j] -= carry;

+ carry &= crypto_uint8_zero_mask(states->V[j]);

}

AES256_ECB(states->Key, states->V, block);

if ( xlen > 15 ) {

were easy to eliminate, at least for mqsignlr2567246/ref, as in the

patch below:

* There are some rejection-sampling loops in mqs.c. I added

crypto_declassify().

* There's a real timing leak in rng.c. I rewrote those lines to use

crypto_uint8_zero_mask(). (I mentioned in an earlier message that

there are some compiler "optimizations" that can introduce

variable-time code into the mask functions; the next version of

SUPERCOP will tweak how the mask functions work internally to stop

those optimizations.)

This rng.c comes from code that NIST encouraged people to use for seed

expansion but that was never designed to run in constant time. I think

that, in context, the timing leak is limited to a small number of bytes

from a fresh buffer for each message, but the usage of AES inside the

NIST code will probably leak further information in some platforms, as

illustrated by https://eprint.iacr.org/2019/996. My recommendation is to

eliminate that seed expander and call SHAKE256 with the seed as input to

generate the desired number of bytes of output.

---D. J. Bernstein

diff -ruw ref/mqs.c ref2/mqs.c

--- ref/mqs.c 2024-04-13 17:30:47.363673556 -0500

+++ ref2/mqs.c 2024-04-24 12:01:24.938727667 -0500

@@ -11,6 +11,8 @@

#include "utils_hash.h"

#include "utils_malloc.h"

+#include "crypto_declassify.h"

+

#define MAX_ATTEMPT_FRMAT 128

#define _MAX_O _O

#define _MAX_O_BYTE _O

@@ -97,6 +99,7 @@

rr = gf256mat_solve_linear_eq(mat_tmp, _O_BYTE, B, C, D, A_inv);

+ crypto_declassify(&rr,sizeof rr);

if(!rr)

goto rej;

@@ -115,6 +118,7 @@

gf256mat_prod(temp_vec2, A_inv, H_half, H_half, temp_vec);

rr &= gf256mat_gaussian_elim(temp_mat, temp_vec + H_half, H_half);

+ crypto_declassify(&rr,sizeof rr);

if(!rr) goto rej;

gf256mat_back_substitute(temp_vec + H_half, temp_mat, H_half);

diff -ruw ref/rng.c ref2/rng.c

--- ref/rng.c 2024-04-13 17:30:47.359673467 -0500

+++ ref2/rng.c 2024-04-24 11:38:02.521684078 -0500

@@ -181,6 +181,8 @@

return RNG_SUCCESS;

}

+#include "crypto_uint8.h"

+

void

AES256_CTR_DRBG_Update(unsigned char *provided_data,

unsigned char *Key,

@@ -190,16 +192,13 @@

for(int i = 0 ; i < 3 ; i++)

{

+ crypto_uint8 carry = -1;

+

//increment V

for(int j = 15 ; j >= 0 ; j--)

{

- if(V[j] == 0xff)

- V[j] = 0x00;

- else

- {

- V[j]++;

- break;

- }

+ V[j] -= carry;

+ carry &= crypto_uint8_zero_mask(V[j]);

}

AES256_ECB(Key, V, temp + 16 * i);

@@ -242,14 +241,12 @@

int i = 0;

while ( xlen > 0 ) {

- //increment V

- for (int j=15; j>=0; j--) {

- if ( states->V[j] == 0xff )

- states->V[j] = 0x00;

- else {

- states->V[j]++;

- break;

- }

+ crypto_uint8 carry = -1;

+

+ for(int j = 15 ; j >= 0 ; j--)

+ {

+ states->V[j] -= carry;

+ carry &= crypto_uint8_zero_mask(states->V[j]);

}

AES256_ECB(states->Key, states->V, block);

if ( xlen > 15 ) {

Apr 26, 2024, 4:49:20â€¯PMApr 26

to kpqc-b...@googlegroups.com

TIMECOP identifies quite a few issues with the Paloma reference code. I

would expect these issues to allow timing attacks. This message looks at

one of the issues, namely how to multiply in F_{2^{13}}.

The code represents F_{2^{13}} as F_2[x]/(x^13 + x^7 + x^6 + x^5 + 1),

and stores 13-bit polynomials as uint16_t. One of TIMECOP's complaints

is about the variable array indexing in the following function:

gf gf2m_mul_w_tab(IN gf in1, IN gf in2, IN const gf2m_tab* gf2m_tables)

{

in1 &= MASKBITS;

in2 &= MASKBITS;

gf result = 0;

gf int1high = (in1 >> GF2M_SPLIT_LOW) & GF2M_SPLIT_MASK_HIGH_BIT;

gf int1low = (in1)&GF2M_SPLIT_MASK_LOW_BIT;

gf int2high = (in2 >> GF2M_SPLIT_LOW) & GF2M_SPLIT_MASK_HIGH_BIT;

gf int2low = (in2)&GF2M_SPLIT_MASK_LOW_BIT;

result ^= gf2m_tables->mul_11_tab[int1high][int2high];

result ^= gf2m_tables->mul_10_tab[int1high][int2low];

result ^= gf2m_tables->mul_10_tab[int2high][int1low];

result ^= gf2m_tables->mul_00_tab[int1low][int2low];

return result;

}

This is splitting each 13-bit input into 6 high bits and 7 low bits, and

separately multiplying high*high, high*low, low*high, low*low by looking

up the products in tables having 2^(6+6), 2^(6+7), and 2^(7+7) entries.

The tables are precomputed.

There's a gf2m_mul() function that was used to compute the multiplication

tables in the first place. Eliminating the tables is simply a matter of

calling that function:

gf gf2m_mul_w_tab(IN gf in1, IN gf in2, IN const gf2m_tab* gf2m_tables)

{

return gf2m_mul(in1,in2);

}

There are also tables for some other operations such as squarings; I'll

focus here on multiplications.

Calling gf2m_mul works correctly, but then TIMECOP complains about

variable branches inside that function:

gf gf2m_mul(IN gf in1, gf in2)

{

in1 &= MASKBITS;

in2 &= MASKBITS;

gf result = 0;

gf t1 = in1;

gf t2 = in2;

for (; t2; t2 >>= 1)

{

result ^= (t1 * (t2 & 1));

if (t1 & 0x1000)

t1 = ((t1 << 1)) ^ IRR_POLY;

else

t1 <<= 1;

}

return result & MASKBITS;

}

I changed this to always do 13 iterations of the outer loop, and to use

bit operations for the inner branch:

gf gf2m_mul(IN gf in1, gf in2)

{

in1 &= MASKBITS;

in2 &= MASKBITS;

gf result = 0;

gf t1 = in1;

gf t2 = in2;

int i;

for (i = 0;i < 13;++i)

{

result ^= t1 * (t2 & 1);

t1 <<= 1;

t1 ^= IRR_POLY * (t1 >> 13);

t2 >>= 1;

}

return result & MASKBITS;

}

This eliminated the TIMECOP warnings for gf2m_mul.

There are more than 100 arithmetic operations here. Here's a rewrite

using 63 operations:

gf gf2m_mul(IN gf x, gf y)

{

uint64_t xy, xy0, xy1, z0, z1, q0, q1;

uint32_t x0, x1, y0, y1, top0, top1, bot0, bot1;

xy = x | (((uint64_t) y) << 32); // will handle x and y in parallel

// bits of x from bottom: 0123456789abcdef................

xy = ((xy & 0x0000ff000000ff00) << 8) | (xy & 0x000000ff000000ff);

// 01234567........89abcdef........

xy = ((xy & 0x00f000f000f000f0) << 4) | (xy & 0x000f000f000f000f);

// 0123....4567....89ab....cdef....

xy = ((xy & 0x0c0c0c0c0c0c0c0c) << 2) | (xy & 0x0303030303030303);

// 01..23..45..67..89..ab..cd..ef..

xy0 = xy & 0x1111111111111111; // 0...2...4...6...8...a...c...e...

xy1 = xy & 0x2222222222222222; // .1...3...5...7...9...b...d...f..

x0 = xy0 & 0xffffffff; y0 = xy0 >> 32;

x1 = xy1 & 0xffffffff; y1 = xy1 >> 32;

// actually had only 13 bits in x and y, so 25 bits in z0 and z1 below

z0 = 0x1111111111111111 & (x0 * (uint64_t) y0 + ((x1 * (uint64_t) y1) << 2));

// 0...2...4...6...8...a...c...e...g...i...k...m...o

z1 = 0x2222222222222222 & (x0 * (uint64_t) y1 + x1 * (uint64_t) y0);

// .1...3...5...7...9...b...d...f...h...j...l...n

top0 = z1 >> 25; // d...f...h...j...l...n

top1 = z0 >> 27; // .e...g...i...k...m...o

q0 = 0x1111111111111111 & (top0 * (uint64_t) 0x1001 + top1 * (uint64_t) 0x8008800);

q1 = 0x2222222222222222 & (top0 * (uint64_t) 0x2002200 + top1 * (uint64_t) 0x1001);

z0 ^= q0; z1 ^= q1;

top0 = z1 >> 25; // d...f...h...j...l...n

top1 = z0 >> 27; // .e...g...i...k...m...o

q0 = 0x1111111111111111 & (top0 * (uint64_t) 0x1001 + top1 * (uint64_t) 0x8008800);

q1 = 0x2222222222222222 & (top0 * (uint64_t) 0x2002200 + top1 * (uint64_t) 0x1001);

z0 ^= q0; z1 ^= q1;

z0 |= z1; // 01..23..45..67..89..ab..c

z0 = 0xf0f0f0f & (z0 | (z0>>2)); // 0123....4567....89ab....c

z0 = 0xff00ff & (z0 | (z0>>4)); // 01234567........89abc

return 0xffff & (z0 | (z0>>8));

}

This is still many more operations than the original code, and overall I

see considerable slowdowns for Paloma on, e.g., Zen 2, such as dec

slowing down from 11 million cycles to 39 million cycles. However:

* About half of the operations above are converting back and forth

between the packed gf format and an unpacked format that's better

for this style of arithmetic. If more of the code were written in

the unpacked format then most of the conversions would disappear.

Probably the reduction steps (q0, q1) can also be sped up.

* Further speedups are possible from "bitslicing" arithmetic

operations across CPU words, for example carrying out 64 bit

operations in parallel with a single 64-bit operation. I would

expect https://eprint.iacr.org/2017/793 to be the fastest approach

to the Paloma computations.

* Many CPUs have instructions for binary-polynomial multiplication,

and then a multiplication in F_{2^{13}} is just a few instructions.

Typically these can again be carried out on more than one input at

once.

All of these approaches avoid table lookups, and the bitslicing approach

avoids problems on CPUs with variable-time multipliers.

---D. J. Bernstein

would expect these issues to allow timing attacks. This message looks at

one of the issues, namely how to multiply in F_{2^{13}}.

The code represents F_{2^{13}} as F_2[x]/(x^13 + x^7 + x^6 + x^5 + 1),

and stores 13-bit polynomials as uint16_t. One of TIMECOP's complaints

is about the variable array indexing in the following function:

gf gf2m_mul_w_tab(IN gf in1, IN gf in2, IN const gf2m_tab* gf2m_tables)

{

in1 &= MASKBITS;

in2 &= MASKBITS;

gf result = 0;

gf int1high = (in1 >> GF2M_SPLIT_LOW) & GF2M_SPLIT_MASK_HIGH_BIT;

gf int1low = (in1)&GF2M_SPLIT_MASK_LOW_BIT;

gf int2high = (in2 >> GF2M_SPLIT_LOW) & GF2M_SPLIT_MASK_HIGH_BIT;

gf int2low = (in2)&GF2M_SPLIT_MASK_LOW_BIT;

result ^= gf2m_tables->mul_11_tab[int1high][int2high];

result ^= gf2m_tables->mul_10_tab[int1high][int2low];

result ^= gf2m_tables->mul_10_tab[int2high][int1low];

result ^= gf2m_tables->mul_00_tab[int1low][int2low];

return result;

}

This is splitting each 13-bit input into 6 high bits and 7 low bits, and

separately multiplying high*high, high*low, low*high, low*low by looking

up the products in tables having 2^(6+6), 2^(6+7), and 2^(7+7) entries.

The tables are precomputed.

There's a gf2m_mul() function that was used to compute the multiplication

tables in the first place. Eliminating the tables is simply a matter of

calling that function:

gf gf2m_mul_w_tab(IN gf in1, IN gf in2, IN const gf2m_tab* gf2m_tables)

{

return gf2m_mul(in1,in2);

}

There are also tables for some other operations such as squarings; I'll

focus here on multiplications.

Calling gf2m_mul works correctly, but then TIMECOP complains about

variable branches inside that function:

gf gf2m_mul(IN gf in1, gf in2)

{

in1 &= MASKBITS;

in2 &= MASKBITS;

gf result = 0;

gf t1 = in1;

gf t2 = in2;

for (; t2; t2 >>= 1)

{

result ^= (t1 * (t2 & 1));

if (t1 & 0x1000)

t1 = ((t1 << 1)) ^ IRR_POLY;

else

t1 <<= 1;

}

return result & MASKBITS;

}

I changed this to always do 13 iterations of the outer loop, and to use

bit operations for the inner branch:

gf gf2m_mul(IN gf in1, gf in2)

{

in1 &= MASKBITS;

in2 &= MASKBITS;

gf result = 0;

gf t1 = in1;

gf t2 = in2;

int i;

for (i = 0;i < 13;++i)

{

result ^= t1 * (t2 & 1);

t1 <<= 1;

t1 ^= IRR_POLY * (t1 >> 13);

t2 >>= 1;

}

return result & MASKBITS;

}

This eliminated the TIMECOP warnings for gf2m_mul.

There are more than 100 arithmetic operations here. Here's a rewrite

using 63 operations:

gf gf2m_mul(IN gf x, gf y)

{

uint64_t xy, xy0, xy1, z0, z1, q0, q1;

uint32_t x0, x1, y0, y1, top0, top1, bot0, bot1;

xy = x | (((uint64_t) y) << 32); // will handle x and y in parallel

// bits of x from bottom: 0123456789abcdef................

xy = ((xy & 0x0000ff000000ff00) << 8) | (xy & 0x000000ff000000ff);

// 01234567........89abcdef........

xy = ((xy & 0x00f000f000f000f0) << 4) | (xy & 0x000f000f000f000f);

// 0123....4567....89ab....cdef....

xy = ((xy & 0x0c0c0c0c0c0c0c0c) << 2) | (xy & 0x0303030303030303);

// 01..23..45..67..89..ab..cd..ef..

xy0 = xy & 0x1111111111111111; // 0...2...4...6...8...a...c...e...

xy1 = xy & 0x2222222222222222; // .1...3...5...7...9...b...d...f..

x0 = xy0 & 0xffffffff; y0 = xy0 >> 32;

x1 = xy1 & 0xffffffff; y1 = xy1 >> 32;

// actually had only 13 bits in x and y, so 25 bits in z0 and z1 below

z0 = 0x1111111111111111 & (x0 * (uint64_t) y0 + ((x1 * (uint64_t) y1) << 2));

// 0...2...4...6...8...a...c...e...g...i...k...m...o

z1 = 0x2222222222222222 & (x0 * (uint64_t) y1 + x1 * (uint64_t) y0);

// .1...3...5...7...9...b...d...f...h...j...l...n

top0 = z1 >> 25; // d...f...h...j...l...n

top1 = z0 >> 27; // .e...g...i...k...m...o

q0 = 0x1111111111111111 & (top0 * (uint64_t) 0x1001 + top1 * (uint64_t) 0x8008800);

q1 = 0x2222222222222222 & (top0 * (uint64_t) 0x2002200 + top1 * (uint64_t) 0x1001);

z0 ^= q0; z1 ^= q1;

top0 = z1 >> 25; // d...f...h...j...l...n

top1 = z0 >> 27; // .e...g...i...k...m...o

q0 = 0x1111111111111111 & (top0 * (uint64_t) 0x1001 + top1 * (uint64_t) 0x8008800);

q1 = 0x2222222222222222 & (top0 * (uint64_t) 0x2002200 + top1 * (uint64_t) 0x1001);

z0 ^= q0; z1 ^= q1;

z0 |= z1; // 01..23..45..67..89..ab..c

z0 = 0xf0f0f0f & (z0 | (z0>>2)); // 0123....4567....89ab....c

z0 = 0xff00ff & (z0 | (z0>>4)); // 01234567........89abc

return 0xffff & (z0 | (z0>>8));

}

This is still many more operations than the original code, and overall I

see considerable slowdowns for Paloma on, e.g., Zen 2, such as dec

slowing down from 11 million cycles to 39 million cycles. However:

* About half of the operations above are converting back and forth

between the packed gf format and an unpacked format that's better

for this style of arithmetic. If more of the code were written in

the unpacked format then most of the conversions would disappear.

Probably the reduction steps (q0, q1) can also be sped up.

* Further speedups are possible from "bitslicing" arithmetic

operations across CPU words, for example carrying out 64 bit

operations in parallel with a single 64-bit operation. I would

expect https://eprint.iacr.org/2017/793 to be the fastest approach

to the Paloma computations.

* Many CPUs have instructions for binary-polynomial multiplication,

and then a multiplication in F_{2^{13}} is just a few instructions.

Typically these can again be carried out on more than one input at

once.

All of these approaches avoid table lookups, and the bitslicing approach

avoids problems on CPUs with variable-time multipliers.

---D. J. Bernstein

Apr 28, 2024, 12:01:56â€¯PMApr 28

to kpqc-b...@googlegroups.com

The following bug in the NCC-Sign code makes the output array much less

random than it should be, although this code applies only to parameter

sets where ETA is 1:

if (t0 < 3) {

a[ctr++] = 1 - t0;

}

if (t1 < 3 && ctr < len) {

a[ctr++] = 1 - t0;

}

if (t2 < 3 && ctr < len) {

a[ctr++] = 1 - t0;

}

if (t3 < 3 && ctr < len) {

a[ctr++] = 1 - t0;

}

I noticed this while looking at functions that TIMECOP was complaining

about. As I commented in an earlier message, there are also differences

in SUPERCOP checksums between the original ref and opt implementations.

Some of the TIMECOP complaints came from rejection sampling, which I

fixed with

t0less3 = crypto_uint32_smaller_mask(t0,3);

crypto_declassify(&t0less3,sizeof t0less3);

if (t0less3) {

a[ctr++] = 1 - t0;

}

etc. in this function, and similar changes in poly_chknorm() (where I

think the early return is, in context, only for rejection sampling),

poly_uniform(), poly_challenge(), and crypto_sign_signature().

Further complaints came from lines

c->coeffs[i] = c->coeffs[b];

c->coeffs[b] = 1 - 2 * (signs & 1);

appearing (twice) in poly_challenge(). I addressed those the same way as

in my earlier message regarding SMAUG-T, again with a big slowdown.

Changing the spec to use one of the approaches from

https://cr.yp.to/papers.html#divergence

https://eprint.iacr.org/2024/548

would improve efficiency.

There are more problems. For example, TIMECOP complains about the

branches in decompose(); looking at that function, I also see "% Q",

which TIMECOP won't detect and which (here and in other functions) needs

to be replaced with constant-time divisions.

---D. J. Bernstein

random than it should be, although this code applies only to parameter

sets where ETA is 1:

if (t0 < 3) {

a[ctr++] = 1 - t0;

}

if (t1 < 3 && ctr < len) {

a[ctr++] = 1 - t0;

}

if (t2 < 3 && ctr < len) {

a[ctr++] = 1 - t0;

}

if (t3 < 3 && ctr < len) {

a[ctr++] = 1 - t0;

}

I noticed this while looking at functions that TIMECOP was complaining

about. As I commented in an earlier message, there are also differences

in SUPERCOP checksums between the original ref and opt implementations.

Some of the TIMECOP complaints came from rejection sampling, which I

fixed with

t0less3 = crypto_uint32_smaller_mask(t0,3);

crypto_declassify(&t0less3,sizeof t0less3);

if (t0less3) {

a[ctr++] = 1 - t0;

}

etc. in this function, and similar changes in poly_chknorm() (where I

think the early return is, in context, only for rejection sampling),

poly_uniform(), poly_challenge(), and crypto_sign_signature().

Further complaints came from lines

c->coeffs[i] = c->coeffs[b];

c->coeffs[b] = 1 - 2 * (signs & 1);

appearing (twice) in poly_challenge(). I addressed those the same way as

in my earlier message regarding SMAUG-T, again with a big slowdown.

Changing the spec to use one of the approaches from

https://cr.yp.to/papers.html#divergence

https://eprint.iacr.org/2024/548

would improve efficiency.

There are more problems. For example, TIMECOP complains about the

branches in decompose(); looking at that function, I also see "% Q",

which TIMECOP won't detect and which (here and in other functions) needs

to be replaced with constant-time divisions.

---D. J. Bernstein

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu