Here's another example of converting variable array indices into
constant array indices. This example also shows how to eliminate a
variable branch, shows how to eliminate a variable division, and points
out an implementation bug beyond timing issues.
For this example, unlike the example in my previous message, I would
guess that the original code is exploitable by timing attacks similar to
the attacks from
https://eprint.iacr.org/2021/1485.
The topic here is SMAUG-T, specifically the "HWT" subroutine, which
converts a secret seed into a secret ternary vector of weight h and
length n. Weight h means that there are exactly h nonzero coefficients.
Ternary means that the nonzero coefficients are +-1.
It's easy to choose secret signs, so the main problem is to compute h
secret positions for the n nonzero coefficients. Some algorithms that
naturally handle this without variable array indices are explained in
https://cr.yp.to/papers.html#divergence
https://eprint.iacr.org/2024/548
but SMAUG-T specifies a different algorithm based on Fisher--Yates
shuffling. My understanding is that the specified algorithm has to be
used for interoperability between enc and dec.
The timecop-smaugt1 file shows "timecop_fail" lines saying, e.g.,
Conditional jump or move depends on uninitialised value(s)
at 0x115091: cryptolab_smaug1_hwt (in /home/kpqc2/kpqc-supercop-build/supercop-20240107/bench/samba/work/compile/try-timecop)
where cryptolab_smaug1_hwt implements HWT. TIMECOP doesn't look for
divisions, and doesn't notice that in this subroutine the line
remain = 0xfffffffff / (DIMENSION - hmwt + pos);
has a secret-dependent denominator and will take variable time on many
CPUs; I'll deal with that below. TIMECOP _does_ catch data flow from
secrets to branches and indices, in particular in the following code:
if (((0xffffffff - div) > deg) && (pos < hmwt)) {
res[DIMENSION - hmwt + pos] = res[div];
res[div] =
((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;
pos++;
} else {
garbage = res[div];
garbage =
((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1;
}
Presumably the "garbage" code here is intended to balance timings
between the two branches, but there are three reasons this isn't safe:
(1) the compiler will probably remove that code; (2) even if the code
isn't removed, microarchitectural effects mean that the branch timing
almost certainly won't be balanced; (3) even if timings are balanced,
the code path will be exposed through, e.g., branch-prediction attacks.
Also, the variable indices into res[] will influence cache timing.
So let's remove the variable indices and the variable branch. First,
here's a generic function to read in[inpos] from in[0...inlen-1],
similar to the lookup() function in my previous message but slightly
simpler since the array entries are just bytes:
#include "crypto_int64.h"
uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)
{
uint8_t result = 0;
while (inlen > 0) {
result |= *in & crypto_int64_zero_mask(inpos);
++in;
--inlen;
--inpos;
}
return result;
}
I've noticed that, unfortunately, ARM patched various compilers in 2021
in a way that can turn crypto_int64_zero_mask into variable-time code.
(Assembly language is safer, of course.) One way to fight against that
compiler problem is the following slightly different version of
loadbyte():
uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)
{
uint8_t result = 0;
while (inlen > 0) {
result |= *in & ~((inpos >> 62) | (-inpos >> 62));
++in;
--inlen;
--inpos;
}
return result;
}
I also wrote a similar function storebyte(), and then replaced the line
res[DIMENSION - hmwt + pos] = res[div];
with the following:
storebyte(res,DIMENSION,DIMENSION - hmwt + pos,loadbyte(res,DIMENSION,div));
To test this so far, I copied smaugt1/opt to smaugt1/opt2, made this
change in smaugt1/opt2, and then checked that the SUPERCOP checksums
still matched (by running ./do-part crypto_kem smaugt1 in the supercop
directory and looking at the "try" lines in bench/*/data).
I then similarly changed the "res[div] = ..." line to use storebyte(),
and removed the "garbage" lines. This still left the branch:
if (((0xffffffff - div) > deg) && (pos < hmwt)) { ... }
As a first step in eliminating this branch, I changed the variables from
unsigned (specifically uint32_t) to signed (specifically crypto_int64),
and computed a mask as -1 if the branch condition holds, else 0:
crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);
With that setup, the following lines compute DIMENSION - hmwt + pos if
the branch condition holds, else -1:
crypto_int64 storepos1 = -1;
storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));
The point here is that storebyte() doesn't do anything if the position
is -1, so it's safe to do
storebyte(res,DIMENSION,storepos1,loadbyte(res,DIMENSION,div));
which won't change any bytes unless the branch condition holds.
After similarly adjusting the second storebyte(), I looked at the
division. There's a bug in the original code at this point: the line
remain = 0xfffffffff / (DIMENSION - hmwt + pos);
that I quoted above has 9 f's instead of the 8 listed in the spec. I
assume this bug produces an incorrect distribution of outputs, although
maybe not something exploitable.
I changed 0xfffffffff to 0xffffffff in opt/hwt.c and ref/hwt.c; as
expected, this changed the SUPERCOP checksums. For the constant-time
code in progress in opt2/hwt.c, since the goal of
uint32_t remain;
remain = 0xfffffffff / (DIMENSION - hmwt + pos);
div = 0xffffffff - remain * (DIMENSION - hmwt + pos);
was actually to compute 0xffffffff modulo DIMENSION-hwmt+pos, I wrote a
function two321mod() shown below to compute mod in constant time.
This left two more obvious timing variations in the hwt() code. One was
if (pos != hmwt)
fprintf(stderr, "hwt sampling error\n");
which I simply removed: the code should be verified to work in the first
place, and shouldn't be generating debugging messages. The other was
size_t cnt_arr_idx = 0;
for (i = 0; i < DIMENSION; ++i) {
cnt_arr_idx = ((i & 0x700) >> 8) & (-(res[i] & 0x01));
cnt_arr[cnt_arr_idx] += (res[i] & 0x01);
}
using something based on res[i] as an array index; I also changed this
to use loadbyte() and storebyte().
At this point running TIMECOP---
env TIMECOP=1 ./do-part crypto_kem smaugt1
---showed that clang no longer produced "timecop_fail" lines for hwt,
but gcc still did. I looked at the assembly that gcc was generating for
loadbyte() and saw that, before the loop, gcc was doing, basically,
if (inlen <= 0) return 0;
counter = -inpos;
inlen -= inpos;
result = 0;
in += inpos;
indicating that, unfortunately, the gcc optimizer was figuring out that
nothing was happening until inpos moved down to 0. (This is gcc 11.4.) I
fixed that by changing "--inpos" as shown in the code below.
With these changes, I don't see further TIMECOP complaints about the
hwt() function, although there are still TIMECOP complaints about other
parts of the SMAUG-T code.
---D. J. Bernstein
#include "hwt.h"
#include "crypto_int64.h"
uint8_t loadbyte(const uint8_t *in,crypto_int64 inlen,crypto_int64 inpos)
{
uint8_t result = 0;
while (inlen > 0) {
result |= *in & ~((inpos >> 62) | (-inpos >> 62));
++in;
--inlen;
inpos += (inpos >> 62) | ((-1-inpos) >> 62);
}
return result;
}
void storebyte(uint8_t *out,crypto_int64 outlen,
crypto_int64 outpos,uint8_t byte)
{
while (outlen > 0) {
uint8_t olddata = *out;
uint8_t change = olddata ^ byte;
change &= ~((outpos >> 62) | (-outpos >> 62));
*out = olddata ^ change;
++out;
--outlen;
--outpos;
}
}
crypto_int64 two321mod(crypto_int64 m)
{
crypto_int64 result = 0xffffffff;
crypto_int64 sub = m;
sub <<= 31;
int j;
for (j = 0;j < 32;++j) {
crypto_int64 resultminus = result - sub;
crypto_int64 mask = resultminus >> 62;
result = resultminus ^ (mask & (resultminus ^ result));
sub >>= 1;
}
return result;
}
/*************************************************
* Name: hwt
*
* Description: Hamming weight sampling deterministically to generate sparse
* polynomial r(x) from a seed. shake256 is the Extendable-Output
* Function from the SHA-3 family.
*
* Arguments: - uint8_t *res: pointer to ouptput polynomial r(x)
* (of length LWE), assumed to be already initialized
* - uint8_t *input: pointer to input seed (of length input_size)
* - size_t input_size: length of input seed
* - uint16_t hmwt: hamming weight to sample
**************************************************/
void hwt(uint8_t *res, uint8_t *cnt_arr, const uint8_t *input,
const size_t input_size, const uint16_t hmwt) {
crypto_int64 i = 0, pos = 0;
uint32_t buf[SHAKE256_RATE * 2] = {0};
uint8_t xof_block = (hmwt == HS) ? HS_XOF : HR_XOF;
shake256((uint8_t *)buf, xof_block * SHAKE256_RATE, input, input_size);
for (i = 0; i < xof_block * 32; i++) {
crypto_int64 deg = buf[i];
crypto_int64 div = two321mod(DIMENSION - hmwt + pos) + 1;
crypto_int64 mask = ((pos - hmwt) >> 62) & ((deg - (0xffffffff - div)) >> 62);
crypto_int64 storepos1 = -1;
crypto_int64 storepos2 = -1;
storepos1 ^= mask & (storepos1 ^ (DIMENSION - hmwt + pos));
storepos2 ^= mask & (storepos2 ^ div);
storebyte(res,DIMENSION,storepos1,loadbyte(res,DIMENSION,div));
storebyte(res,DIMENSION,storepos2,((buf[(xof_block * 32 + (i >> 4))] >> (i & 0x0f)) & 0x02) - 1);
pos -= mask;
}
for (i = 0; i < DIMENSION; ++i) {
uint8_t bit = res[i] & 0x01;
size_t cnt_arr_idx = ((i & 0x700) >> 8) & -bit;
uint8_t old = loadbyte(cnt_arr,DIMENSION,cnt_arr_idx);
storebyte(cnt_arr,DIMENSION,cnt_arr_idx,old + bit);
}
}