Compiler-introduced timing leak in Kyber reference implementation

Antoon Purnal

unread,

Jun 3, 2024, 8:11:41 AM6/3/24

to pqc-forum

Hi all,

I'd like to inform you that recent versions of Clang (15-18) can produce a secret-dependent branch in the Kyber / ML-KEM reference implementation. The vulnerable function is poly_frommsg.

Please find a vulnerability report below this message. Key facts:

The resulting timing leak is small but exploitable (a crude PoC on my laptop needs <10 minutes for ML-KEM 512 key recovery)
One example of a vulnerable instance is clang -Os, but there are more.

I also wrote a more educational piece about this issue, which is available here.

I’ve been in touch with Peter Schwabe from the Kyber team to get this patched upstream in the pqcrystals/kyber reference. The updated version is already available.

We’ve notified several libraries that integrate the reference implementation, as they are directly affected. Note that this does not rule out the possibility that other libraries, which have only slightly modified the relevant function, are vulnerable - either now or in the future.

I would like to thank Peter for the prompt and collaborative communication surrounding this issue.

Kind regards,
Antoon Purnal (PQShield)

===

Vulnerability Report

Subject: control-flow timing leak in Kyber reference implementation when compiled with Clang 15-18 for -Os, -O1 and other options

Details: the function poly_frommsg produces a polynomial based on the bits of m during both encapsulation and decapsulation. Despite the source-level mitigations in poly_frommsg, the latest generations of Clang recognize that the code essentially performs a bit test and produces a secret-dependent branch for several compiler options.

Here are a few compiler options (Clang 15-16-17-18 on x86) which produce a branch (Godbolt link):
- -Os
- -O1
- -O2 -fno-vectorize
- -O3 -fno-vectorize
Note that plain -O2 and -O3 (i.e., without fno-vectorize) produce branch and vectorized code side by side in the binary. In practice, the vectorized instructions appear to be selected for execution at runtime.

Exploitable?: The timing leak can be used to implement a plaintext-checking oracle. On my laptop, a PoC local attack on the reference implementation leaks the ML-KEM 512 secret key in ~10 minutes using end-to-end decapsulation timing measurements.

D. J. Bernstein

unread,

Jun 3, 2024, 10:06:48 AM6/3/24

to pqc-...@list.nist.gov

'Antoon Purnal' via pqc-forum writes:
> I’ve been in touch with Peter Schwabe from the Kyber team to get this
> patched upstream in the pqcrystals/kyber

> <https://github.com/pq-crystals/kyber> reference.

Isn't that patch broken as soon as someone compiles with -flto?

I commented on a similar compiler-introducing-branch disaster in

https://microblog.cr.yp.to/1713627640/

(after tracking down what was causing some TIMECOP alerts in code for
another LPR-based KEM) and I had the same first reaction of splitting
code across files (see supercop-20240425/inttypes/README), but after
further analysis I concluded that what does the best job of blocking
peephole analysis is an intermediate xor with a global variable that's
declared as volatile and initialized to 0. See "optblocker" in

https://lib.mceliece.org/libmceliece-20240513/inttypes/crypto_intN.h.html
https://lib.mceliece.org/libmceliece-20240513/inttypes/intN_optblocker.c.html

for an example of how this works.

The first line of defense here is the idea that, no matter how much
compiler writers claim they can do whatever they want, they won't dare
to optimize away a volatile load. The second line of defense is the idea
that compiler writers focus on easy speedups; figuring out whether a
global variable is touched isn't easy even with LTO.

I'm not sure whether it's good to also force the optblocker variable to
have default visibility. With default visibility, the possibility of
interposition means that---assuming dynamic linking---it's always
possible for post-linking code to touch the global variable, so a
compiler claiming at link time to know that the variable is 0 is
definitely wrong. On the other hand, I don't know how most of the people
writing optimizations would learn this; much more blatant optimizer bugs
happen all the time. Meanwhile interposition also adds annoying side
conditions to formal verification.

---D. J. Bernstein

P.S. Maybe I should also mention another feature that made it into
libmceliece-20240513: the library now has its own built-in equivalent of
TIMECOP. See https://lib.mceliece.org/install.html under "valgrind".

signature.asc

Falko Strenzke

unread,

Jun 12, 2024, 4:11:49 AM6/12/24

to pqc-...@list.nist.gov

I came up with this solution for our Libgcrypt-based implementation (not the official one, though):

void _gcry_mlkem_poly_frommsg (gcry_mlkem_poly *r, const unsigned char msg[GCRY_MLKEM_INDCPA_MSGBYTES]) { unsigned int i, j; s16 mask; s16 local_opt_blocker = _gcry_u32_opt_blocker_mask_zero; for (i = 0; i < GCRY_MLKEM_N / 8; i++) { for (j = 0; j < 8; j++) { mask = -(s16)((msg[i] >> j) & 1); r->coeffs[8 * i + j] = (mask ^ local_opt_blocker) & ((GCRY_MLKEM_Q + 1) / 2); } } }

where

extern volatile u32 _gcry_u32_opt_blocker_mask_zero;

and has value zero, just as in the solution Daniel pointed to.

I think it should do the job both reliably and efficiently: (mask ^ local_opt_blocker) can be any value at run-time to the compiler, making it impossible to skip the computation (except the compiler would care to create a branch for local_opt_blocker to be either 0 or -1, which would be extremely unlikely). Yet the volatile variable has to be loaded only once at the beginning of the function call.

- Falko

Am 03.06.24 um 16:06 schrieb D. J. Bernstein:

--

MTG AG
Dr. Falko Strenzke

Phone: +49 6151 8000 24
E-Mail: falko.s...@mtg.de
Web: mtg.de

MTG AG - Dolivostr. 11 - 64293 Darmstadt, Germany
Commercial register: HRB 8901
Register Court: Amtsgericht Darmstadt
Management Board: Jürgen Ruf (CEO), Tamer Kemeröz
Chairman of the Supervisory Board: Dr. Thomas Milde

This email may contain confidential and/or privileged information. If you are not the correct recipient or have received this email in error,
please inform the sender immediately and delete this email.Unauthorised copying or distribution of this email is not permitted.

Data protection information: Privacy policy

Falko Strenzke

unread,

Jun 12, 2024, 6:17:00 AM6/12/24

to pqc-...@list.nist.gov

Am 03.06.24 um 16:06 schrieb D. J. Bernstein:

I'm not sure whether it's good to also force the optblocker variable to
have default visibility. With default visibility, the possibility of
interposition means that---assuming dynamic linking---it's always
possible for post-linking code to touch the global variable, so a
compiler claiming at link time to know that the variable is 0 is
definitely wrong. On the other hand, I don't know how most of the people
writing optimizations would learn this; much more blatant optimizer bugs
happen all the time. Meanwhile interposition also adds annoying side
conditions to formal verification.

I personally don't think it is necessary to make the variable visible. One well established use case for volatile variables are memory mapped registers. That means in order to make any assumption about the constancy such a variable the compiler would have to know that it is not modified by the underlying platform. So even when the linker is assigning the address to variable, on modern "large" systems this will be a virtual address and the operating system could still decide where to map it physically at load time. Generally I think such a feature of a compiler or linker, namely to be able to exclude that a volatile variable is mapped to a special address, would require a lot of cooperation between different parts of the compiler and linker, very detailed knowledge about the hardware platform, and it still be highly error prone. Maybe most importantly it will hardly be ever motivated to implement anything like this, since the when the developer declares a variable as volatile, it should be assumed they know what they want to achieve by this and consciously accept the cost.

- Falko

D. J. Bernstein

unread,

Jun 13, 2024, 7:02:15 AM6/13/24

to pqc-...@list.nist.gov

I think what's critical to realize about these compiler optimizations is
that they're actively looking for various patterns that produce 1-bit
results, and turning those patterns into the branch-condition data type
(bool). This breaks the idea that secret-dependent conditional branches
come from secret bool in source code. For comparison, the examples in

https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf

are bool; the paper "When constant-time source yields variable-time
binary" cited there might _sound_ like it's an optimizer introducing
bool, but, no, it's just talking about bool operations in Microsoft's
old 32-bit int64 library; https://www.bearssl.org/constanttime.html says
"Avoid boolean types"; etc.

As usual, the most convincing way to stop compiler optimizations from
screwing things up is to write code in asm (for as large chunks of code
as possible; separate .S files are best). For portable code, the best
broad-spectrum defense is automated binary analysis (which of course
should also be applied as a double-check to binaries produced from asm).
The easiest tool to use according to the study

https://www.usenix.org/system/files/sec24fall-prepub-760-fourne.pdf

is TIMECOP, although TIMECOP (1) doesn't guarantee path coverage (so
secret branches etc. might be missed), (2) occasionally bumps into
limits in the set of machine instructions supported by the underlying
Valgrind tool (so the analysis might give up on valid code), and (3)
doesn't magically fix whatever problems it encounters.

For proactively avoiding problems in portable code, it's good to watch
not just for secret branch conditions and more generally secret bool but
also for all secret 1-bit values. (Hopefully compiler writers are busy
enough with easy optimizations that they never start thinking about what
they can do with 2-bit values!) What's under test for the next SUPERCOP
release is a bunch of support functions looking like this:

__attribute__((unused))
static inline
crypto_uint64 crypto_uint64_bottombit_mask(crypto_uint64 crypto_uint64_x) {
#if defined(__GNUC__) && defined(__x86_64__)
__asm__ ("andq $1,%0" : "+r"(crypto_uint64_x) : : "cc");
#else
crypto_uint64_x &= 1 ^ crypto_uint64_signed_optblocker;
#endif
return -crypto_uint64_x;
}

The 1^optblocker prevents the compiler from seeing that there's a 1-bit
result to begin with.

Falko Strenzke writes:
> mask = -(s16)((msg[i] >> j) & 1);

This is where I'd already say "uh-oh, &1, potential problem". The
compiler can see that the shift and mask can be compressed into a
bit-test operation---which, sure, puts results into a flag instead of an
integer register, but that's usable in all sorts of ways.

> r->coeffs[8 * i + j] = (mask ^ local_opt_blocker) & ((GCRY_MLKEM_Q + 1) / 2);

This doesn't look safe to me even if it changes to a global optblocker.
Instead of

shift by j bits
mask
negate
load optblocker
xor

the compiler can do

load optblocker
test bit j
conditionally jump after next instruction
xor with -1

which is 4 instructions instead of 5.

> Yet the volatile variable has to be loaded only once at
> the beginning of the function call.

I worry that this might trigger a compiler to do the load and then test
whether the variable is 0, so as to split the subsequent loop into a
fast version and a slow version. That sort of splitting shouldn't happen
with -Os, but it can save time on some microbenchmarks, and in general
the way that compiler writers decide to implement an optimization (see,
e.g., https://reviews.llvm.org/D36858?id=111605) is by observing that
they've found _some_ microbenchmark where the optimization saves time.

I agree that having optblocker loads in the loop will often produce a
measurable slowdown (mainly from the load cost on small platforms and
less vectorization on large platforms). But if we're talking about a hot
spot where speed matters then the code should generally be rewritten in
asm anyway purely for speed reasons, never mind the security benefits.

There's a valid objection to asm as being harder to read and thus more
likely to contain bugs. But writing code in C doesn't magically avoid
bugs---and there's increasing evidence that our best defense against
bugs, namely a computer-checked proof of software correctness, is easier
and more robust for asm than for C. For example, the fast s2n-bignum asm
for X25519 is backed by a theorem concluding that the code computes
X25519 correctly for all inputs (under hypotheses about the semantics of
each asm instruction used), whereas the old-fashioned path of inserting
a compiler into the verification story turns into a mess of

* certification or at least translation validation for compilation
(how many people are actually _using_ CompCert?) and then

* "reducing" the problem to verifying X25519 C code, which has the
disadvantage of having to deal with C's complicated semantics and
has no evident advantages for the verification process.

Similarly, for post-quantum crypto, asm is an attractive path for both
speed reasons and verification reasons. More use of asm also makes the
speed of the portable fallback code less important.

---D. J. Bernstein

signature.asc

Brent Kimberley

unread,

Jun 13, 2024, 7:31:14 AM6/13/24

to pqc-...@list.nist.gov

From a schedule perspective:

Was the logic compiled and assembled via controlled certified toolchain?

Was a Faghan inspection performed on the emitted code?

From: pqc-...@list.nist.gov on behalf of D. J. Bernstein
Sent: Thursday, June 13, 2024 7:01 AM
To: pqc-...@list.nist.gov
Subject: Re: [pqc-forum] Compiler-introduced timing leak in Kyber reference implementation

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.
To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20240613110152.1142772.qmail%40cr.yp.to.

THIS MESSAGE IS FOR THE USE OF THE INTENDED RECIPIENT(S) ONLY AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED, PROPRIETARY, CONFIDENTIAL, AND/OR EXEMPT FROM DISCLOSURE UNDER ANY RELEVANT PRIVACY LEGISLATION. No rights to any privilege have been waived. If you are not the intended recipient, you are hereby notified that any review, re-transmission, dissemination, distribution, copying, conversion to hard copy, taking of action in reliance on or other use of this communication is strictly prohibited. If you are not the intended recipient and have received this message in error, please notify me by return e-mail and delete or destroy all copies of this message.

D. J. Bernstein

unread,

Jun 22, 2024, 5:58:50 AM6/22/24

to pqc-...@list.nist.gov

I wrote:
> What's under test for the next SUPERCOP
> release is a bunch of support functions looking like this:

That release is planned for a few days from now. The full suite of
support functions currently planned for, e.g., int32 can be found here:

https://pqsrc.cr.yp.to/saferewrite-20240622/src/int32_max/supercopnew/crypto_int32.h.html

Obvious clicks lead to the functions for any of {int,uint}{8,16,32,64}.
Except for separate .c files for optblocker, the .h files are
self-contained, so they can be easily integrated into libraries.

With gcc or clang on 64-bit Intel/AMD or 64-bit ARM, many of these
functions automatically switch over to inline asm internally. This means
fewer opportunities for compilers to introduce variable-time code, and
fewer speed excuses for implementors to avoid the functions. The
portable fallbacks use optblocker.

For each function, switching from "supercopnew" to the "ref" directory
in saferewrite shows a simple description of the function semantics. As

https://pqsrc.cr.yp.to/saferewrite-20240622/src/int32_max/ref/max.c.html

illustrates, "ref" in saferewrite prioritizes simplicity over being
constant-time; the point is to make clear what the function is doing.

saferewrite unrolls compiled binaries (via valgrind's VEX, via angr) and
then uses Z3 (via angr) to test equivalence with the ref code (meaning
that outputs match for all inputs), so it avoids the risks of manual
translation to SMT input. I've also done all sorts of other tests. My
main concern is that the API for gcc's inline asm is error-prone, with
errors (e.g., missing "clobber" declarations) capable of producing bugs
in some callers while other tests pass; so I designed a DSL to eliminate
all of the usual traps, and generated the inline asm from that.

---D. J. Bernstein

signature.asc

Michael Scott

unread,

Jun 23, 2024, 7:11:26 AM6/23/24

to pqc-...@list.nist.gov

This problem of compiler introduced timing leaks is I suspect much more prevalent than most people appreciate.

Its useful to play around with godbolt.org

Consider this function generated by Fiat-Crypto

void fiat_p1913_cmovznz_u64(uint64_t* out1, unsigned char arg1, uint64_t arg2, uint64_t arg3) {
unsigned char x1;
uint64_t x2;
uint64_t x3;
x1 = (!(!arg1));
x2 = ((signed char)(0x0 - x1) & UINT64_C(0xffffffffffffffff));
x3 = ((x2 & arg3) | ((~x2) & arg2));
*out1 = x3;
}

When compiled with -O3 optimization for x86-64 we might initially be delighted to observe the succinct output. And when inlined this becomes a single cmovne instruction

test esi, esi
cmovne rdx, rcx
mov qword ptr [rdi], rdx
ret

But the problem is that clang is being too clever. It has identified that the function wants to implement a conditional move from one variable to another. When the instruction set supports such a conditional move instruction (like cmov in x64 and csel in arm64) then it uses that. But if no such instruction exists?

When compiled for 64-bit RISCV

beqz a1, .LBB0_2
mv a2, a3
.LBB0_2:
sd a2, 0(a0)
ret

it will insert a branch.

Mike Scott

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20240622095827.325626.qmail%40cr.yp.to.

D. J. Bernstein

unread,

Jun 23, 2024, 9:30:51 AM6/23/24

to pqc-...@list.nist.gov

'Michael Scott' via pqc-forum writes:
> void fiat_p1913_cmovznz_u64(uint64_t* out1, unsigned char arg1, uint64_t
> arg2, uint64_t arg3) {
> unsigned char x1;
> uint64_t x2;
> uint64_t x3;
> x1 = (!(!arg1));
> x2 = ((signed char)(0x0 - x1) & UINT64_C(0xffffffffffffffff));
> x3 = ((x2 & arg3) | ((~x2) & arg2));
> *out1 = x3;
> }

This Fiat code is violating the traditional rule of avoiding booleans
(and thus also the broader rule of avoiding visible 1-bit values).
Secrets in C or C++ should never be fed to !, <, &&, etc.

Here's a safe way to rewrite this with the functions I linked to:

#include "crypto_uint64.h"

void fiat_p1913_cmovznz_u64(uint64_t* out1, unsigned char arg1, uint64_t arg2, uint64_t arg3) {

uint64_t mask = crypto_uint64_nonzero_mask(arg1);
*out1 = ((mask & arg3) | ((~mask) & arg2));
}

---D. J. Bernstein

signature.asc

Michael Scott

unread,

Jun 23, 2024, 2:32:41 PM6/23/24

to pqc-...@list.nist.gov

My ad-hoc solution was to return a value from the function which will only be calculated if the compiler follows my intention. In this case returning the x2 value from the function and taking steps to prevent function inlining seemed to fix the problem.

But a more general solution like yours is probably to be preferred

Mike

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20240623133031.78502.qmail%40cr.yp.to.

Watson Ladd

unread,

Jun 23, 2024, 8:47:28 PM6/23/24

to pqc-...@list.nist.gov

sssh! The compiler writers are watching and will introduce an
optimization for this next!

More seriously the only way to solve this is to introduce
invariant-time semantics to a subset language and force compilers to
obey them. Otherwise we're at the mercy of future smarter compilers.

Sincerely,
Watson Ladd

>
> ---D. J. Bernstein
>
> --
> You received this message because you are subscribed to the Google Groups "pqc-forum" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

> To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20240623133031.78502.qmail%40cr.yp.to.

--
Astra mortemque praestare gradatim

D. J. Bernstein

unread,

Jun 24, 2024, 12:09:28 AM6/24/24

to pqc-...@list.nist.gov

Watson Ladd writes:
> sssh! The compiler writers are watching and will introduce an
> optimization for this next!

Just to be clear about what the defenses are inside the crypto_int* and
crypto_uint* functions:

* The inline asm is for __GNUC__, where the compiler documentation
says the string is simply sent directly to the assembler after %
substitution. ("GCC does not parse the assembler instructions
themselves and does not know what they mean or even whether they
are valid assembler input.")

* For the portable code, the 1-bit data paths systematically use a
volatile global optblocker variable (e.g., x&(1^optblocker)).

There's also news coming soon about continually operating large-scale
detection of compiler-introduced timing variations; and, as a last line
of defense, the equivalent of TIMECOP is now built into the test program
for (e.g.) libmceliece.

As I mentioned before, writing separate asm files is the most convincing
way to stop compilers from screwing things up. The gold standard here is
illustrated by

https://github.com/awslabs/s2n-bignum/blob/main/x86/proofs/curve25519_x25519.ml

giving a complete correctness proof in HOL Light for the fast machine
language that's generated by

https://github.com/awslabs/s2n-bignum/blob/main/arm/curve25519/curve25519_x25519.S

for X25519. It's not hard to check that the code is constant-time, and
obviously there's no concern about the compiler getting in the way.

> More seriously the only way to solve this is to introduce
> invariant-time semantics to a subset language and force compilers to
> obey them. Otherwise we're at the mercy of future smarter compilers.

There have indeed been various proposals to make compilers aware of
timing declarations in source code. More progress here would be useful.

---D. J. Bernstein

signature.asc

D. J. Bernstein

unread,

Jun 26, 2024, 6:22:51 AM6/26/24

to pqc-...@list.nist.gov

I wrote:
> There's also news coming soon about continually operating large-scale
> detection of compiler-introduced timing variations

This is now operational. All implementations in SUPERCOP that declare
goal-constbranch and goal-constindex are now automatically fed through
TIMECOP 2 on each machine that regularly runs SUPERCOP. (Credits for
this will come later.) Currently there are 1307 such implementations,
and results are already online from one machine.

As a small example,

https://bench.cr.yp.to/web-impl/amd64-rome0-crypto_verify-127vartime.html

shows "Failed TIMECOP" results with all tested compilers, and those
results point to line 6 of crypto_verify/127vartime/ref/verify.c, which
says the following:

if (memcmp(x,y,127)) return -1;

Needless to say, 127vartime was deliberately written as a test to
trigger timing variations.

As another example, there's a very large page

https://bench.cr.yp.to/web-impl/amd64-rome0-crypto_sign-dilithium5.html

showing various "Failed TIMECOP" runs for Dilithium, in each case
pointing to the source line numbers that produce the failures. I presume
that these Dilithium issues are from rejection-sampling conditions that
are safe to feed to crypto_declassify(), but I haven't checked.

I did modify the Kyber implementations in SUPERCOP to pass TIMECOP, so

https://bench.cr.yp.to/web-impl/amd64-rome0-crypto_kem-kyber768.html

shows "passed TIMECOP" results with various implementations and various
compilers. That page also shows "TIMECOP error" for some compilers,
which is neither a pass nor a failure; those compilers are using XOP
instructions on this machine (AMD Zen 2), and valgrind doesn't
understand those instructions, so the analysis gives up. I should also
note that "passed TIMECOP" isn't a guarantee of safety, since TIMECOP's
tests won't necessarily trigger all relevant code paths.

Marking goal-const{branch,index} is declaring that an implementation is
designed to avoid secret-dependent branches and secret-dependent array
indices. Other implementations are marked with a red "T:" in SUPERCOP's
tables and aren't fed through TIMECOP. Usually this is because the code
wasn't in fact designed to run in constant time, but sometimes it's just
because the necessary declaration hasn't been made.

---D. J. Bernstein

signature.asc

Madhavan Pallan - WhiteHouse HNPW Ministerial

unread,

Jul 2, 2024, 9:03:07 AM7/2/24

to pqc-...@list.nist.gov, d...@cr.yp.to

Dear PQC Community,

I am madhavan, and I advise the cyber technical community EOP of White House.

I have experience working in the QKD(Quantum Key Distribution) Community of ITU(at United Nations).

I will be around sharing my expertise.

Sending well wishes. Happy rest of the day.

Sincerely,

Madhavan Pallan

--
You received this message because you are subscribed to the Google Groups "pqc-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pqc-forum+...@list.nist.gov.

To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/20240622095827.325626.qmail%40cr.yp.to.

--

Madhavan Pallan

Ministerial Member, Member of the Ministerial Unit- White House.

Ministerial Member in UN. (#HNPW Status)

White House - Scientist #AI4Good.

Primary:- Google Email/ Meet-Video calls (madhp...@gmail.com)

X(Twitter):- @madhavanpallan (Running website).

Federal community working experience:- 20 years

Secondary(All AI Communities) #UN #AI4Good Slack member/ #PETs, #UN Microsoft teams/Viva Engage(Yammer) & few to many UN private communities.

Secondary +91-8800996753 (Whatsapp Video Calls)

Guillaume Endignoux

unread,

Jul 9, 2024, 8:49:42 AM7/9/24

to Michael Scott, pqc-...@list.nist.gov

To add to Mike's comment about inspecting assembly in https://godbolt.org/, I also find it useful to inspect the LLVM intermediate representation (for clang).

For C++, it's available out-of-the-box with "Add new > LLVM IR", for Rust you can add the "--emit=llvm-ir" compiler flag.

Taking the fiat_p1913_cmovznz_u64() example reveals that it's simplified into 3 simple statements: a comparison into a boolean (hence "violating the traditional rule of avoiding booleans" mentioned by Dan), a select based on this boolean and a store.

With that, it's not surprising that a branch is introduced on targets that lack a branch-less select instruction.

```llvm-ir

define dso_local void @fiat_p1913_cmovznz_u64(unsigned long*, unsigned char, unsigned long, unsigned long)(ptr nocapture noundef writeonly %out1, i8 noundef zeroext %arg1, i64 noundef %arg2, i64 noundef %arg3) local_unnamed_addr {
entry:
%tobool.not = icmp eq i8 %arg1, 0
%or = select i1 %tobool.not, i64 %arg2, i64 %arg3
store i64 %or, ptr %out1, align 8
ret void

}

```

You may also be surprised by how much LLVM can simplify SIMD intrinsics.

My favorite example (although not leading to variable timing): intrinsics that shuffle lanes aren't mapped to specific instructions (as one might think from Intel's intrinsics guide), but to a generic "shufflevector" in LLVM IR, that is then converted into a series of instructions for the target platform.

So something like the generic permutation intrinsic "_mm256_permutevar8x32_epi32" on Intel may be turned into something more specific (and faster) than the "vpermd" instruction if there is a better instruction for the given permutation.

```c++

#include <immintrin.h>
#include <xmmintrin.h>

__m256i unpackhi_epi32(__m256i x, __m256i y) {
    return _mm256_unpackhi_epi32(x, y);
}

__m256i structured_shuffle(__m256i x) {
    return _mm256_permutevar8x32_epi32(x, _mm256_set_epi32(6, 7, 4, 5, 2, 3, 0, 1));
}

__m256i arbitrary_shuffle(__m256i x) {
    return _mm256_permutevar8x32_epi32(x, _mm256_set_epi32(2, 7, 4, 1, 3, 5, 0, 6));

}

```

Compiled with "-O3 -march=haswell":

```asm

unpackhi_epi32(long long vector[4], long long vector[4]):
    vunpckhps ymm0, ymm0, ymm1
    ret
structured_shuffle(long long vector[4]):
    vshufps ymm0, ymm0, ymm0, 177
    ret
.LCPI2_0:
    .long 6 # 0x6
    .long 0 # 0x0
    .long 5 # 0x5
    .long 3 # 0x3
    .long 1 # 0x1
    .long 4 # 0x4
    .long 7 # 0x7
    .long 2 # 0x2
arbitrary_shuffle(long long vector[4]):
    vmovaps ymm1, ymmword ptr [rip + .LCPI2_0]
    vpermps ymm0, ymm1, ymm0
    ret

```

All of these intrinsics map to LLVM's shufflevector:

```llvm-ir
define dso_local noundef <4 x i64> @unpackhi_epi32(long long vector[4], long long vector[4])(<4 x i64> noundef %x, <4 x i64> noundef %y) local_unnamed_addr {
entry:
%0 = bitcast <4 x i64> %x to <8 x i32>
%1 = bitcast <4 x i64> %y to <8 x i32>
%shuffle.i = shufflevector <8 x i32> %0, <8 x i32> %1, <8 x i32> <i32 2, i32 10, i32 3, i32 11, i32 6, i32 14, i32 7, i32 15>
%2 = bitcast <8 x i32> %shuffle.i to <4 x i64>
ret <4 x i64> %2
}

define dso_local noundef <4 x i64> @structured_shuffle(long long vector[4])(<4 x i64> noundef %x) local_unnamed_addr {
entry:
%0 = bitcast <4 x i64> %x to <8 x i32>
%1 = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6>
%2 = bitcast <8 x i32> %1 to <4 x i64>
ret <4 x i64> %2
}

define dso_local noundef <4 x i64> @arbitrary_shuffle(long long vector[4])(<4 x i64> noundef %x) local_unnamed_addr {
entry:
%0 = bitcast <4 x i64> %x to <8 x i32>
%1 = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 6, i32 0, i32 5, i32 3, i32 1, i32 4, i32 7, i32 2>
%2 = bitcast <8 x i32> %1 to <4 x i64>
ret <4 x i64> %2
}

```

Best,

Guillaume

To view this discussion on the web visit https://groups.google.com/a/list.nist.gov/d/msgid/pqc-forum/CAEseHRqmd%2Brm6o3dDAiyjkctwgQcXLnU72tv%2BiRrLKsJ4-k-CA%40mail.gmail.com.

D. J. Bernstein

unread,

Jul 15, 2024, 5:00:28 AM7/15/24

to pqc-...@list.nist.gov

I had written:

> This is now operational. All implementations in SUPERCOP that declare
> goal-constbranch and goal-constindex are now automatically fed through
> TIMECOP 2 on each machine that regularly runs SUPERCOP. (Credits for
> this will come later.) Currently there are 1307 such implementations,
> and results are already online from one machine.

TIMECOP 2 results are online from many more machines at this point. I'll
take HAETAE as an example of how to navigate the results. Clicking on
"List of primitives ... sign" from the top bar gets to

https://bench.cr.yp.to/primitives-sign.html

and then clicking on "haetae2" gets to

https://bench.cr.yp.to/impl-sign/haetae2.html

which shows some "failed TIMECOP" entries. (There are also many "passed
TIMECOP" entries, and some "TIMECOP error" entries; the error entries
can be valgrind bugs.) Clicking on the first machine name with a "failed
TIMECOP" entry gets to

https://bench.cr.yp.to/web-impl/amd64-hydra5-crypto_sign-haetae2.html

which shows red "T!!!" for the ref implementation with some (but not
all) clang options. Scrolling down to the "Failed TIMECOP" section shows
valgrind output:

Conditional jump or move depends on uninitialised value(s)
at 0x...: minmax (polyvec.c:465)

The lines around line 465 of crypto_sign/haetae2/ref/polyvec.c are

int32_t c = b - a;
c ^= ab & (c ^ b);
c >>= 31;
c &= ab;
*x = a ^ c;

where, unfortunately, these clang options are---in context of the
calling function---now turning the 1-bit quantity c >> 31 into a branch
condition. (https://godbolt.org/z/a7W3ehET3 seems like a minimal example
of what's happening here, with "js" being data-dependent and presumably
slower than the original code. Great job, clang!)

To eliminate this timing variation, I'm eliminating the whole minmax()
function and replacing the minmax() call with crypto_int32_minmax()
after including "crypto_int32.h". The crypto_int32_minmax() function is
provided by SUPERCOP (see inttypes/crypto_int32.h) and defends against
these compilers.

There's also a nearby function minmaxmask() that I'm rewriting to use
SUPERCOP's crypto_int32_smaller_mask() as follows:

int32_t a = *x;
int32_t b = *y;
int32_t oldmask = *mask;
int32_t swap = oldmask & crypto_int32_smaller_mask(b,a);
int32_t ab = swap & (a ^ b);
*mask = oldmask ^ swap;
*x = a ^ ab;
*y = b ^ ab;

Looking beyond HAETAE: One machine (so far) is running debian-testing
with 2023-era clang 16, which adds timing variations to implementations
of a wide range of primitives (not just Kyber). I'm sweeping through the
TIMECOP results there and making appropriate code changes in SUPERCOP.

Top recommendation at this point is for cryptographic libraries to
include TIMECOP-like tests, as in libmceliece, so that basic CI catches
any TIMECOP violations. The main danger here is that this doesn't
guarantee path coverage. One might hope that compiler issues will always
be discovered on the covered paths, but this could just be selection
bias given that TIMECOP is how this wave of compiler issues was
discovered in the first place:

https://microblog.cr.yp.to/1713627640/index.html

It's probably good to do C-level static analysis looking for &1 etc.

---D. J. Bernstein

signature.asc

Marco Palumbi

unread,

Jul 16, 2024, 11:20:20 AM7/16/24

to pqc-...@list.nist.gov

Hello all,

I just found a slightly different implementation for the case of the Kyber poly_frommsg() function that I want to propose for your evaluation.
The difference is that the volatile opt_blocker applies to the entire 1-bit mask which causes the boolean evaluation.
The volatile variable can optionally be made global.

void poly_frommsg(poly *r, const uint8_t msg[32])
{
  unsigned int i,j;
  int16_t mask;
  volatile unsigned opt_blocker = 1;
  const unsigned one_bit_mask = opt_blocker;
  for(i=0;i<KYBER_N/8;i++) {
    for(j=0;j<8;j++) {
      mask = -(int16_t)((msg[i] >> j) & one_bit_mask);
      r->coeffs[8*i+j] = mask & ((KYBER_Q+1)/2);
    }
  }
}

This should prevent the compiler from making further speculations.

Marco

Reply all

Reply to author

Forward