Non-constant timing for instructions, including XOR

78 views
Skip to first unread message

Thomas Koenig

unread,
Jan 27, 2023, 12:44:14 PMJan 27
to
This is _really_ rich.

It seems that you have to set special CPU flags on both Intel and
AMD to get constant time for such simple operations as XOR or ADD
and not latency which depends on data.

XOR? ADD? Seriously? If there ever was a single cycle latency
instruction, XOR's the one, and ADD has been down to a single
cycle for ages.

See https://www.openwall.com/lists/oss-security/2023/01/25/3 which
includes a links to the vendor documentation and a discussion on
the Linux Kernel list.

It seems people at Intel and AMD have chosen execution speed
over sanity.

Scott Lurndal

unread,
Jan 27, 2023, 1:19:12 PMJan 27
to
Thomas Koenig <tko...@netcologne.de> writes:
>This is _really_ rich.
>
>It seems that you have to set special CPU flags on both Intel and
>AMD to get constant time for such simple operations as XOR or ADD
>and not latency which depends on data.

ARM has addressed this architecturally over the last few years,
specifying that certain sets of instructions may not have timing
dependent upon the data, including all crypto instructions (the DIT feature).


The architecture makes no statement about the timing properties
when the PSTATE.DIT bit is not set. However, it is likely that
many of these instructions have timing that is invariant of the data in
many situations.

In particular, Arm strongly recommends that the Armv8.3 pointer
authentication instructions do not have their timing dependent on the
key value used in the pointer authentication in all cases,
regardless of the PSTATE.DIT bit.

Anton Ertl

unread,
Jan 27, 2023, 1:42:35 PMJan 27
to
Thomas Koenig <tko...@netcologne.de> writes:
>This is _really_ rich.
>
>It seems that you have to set special CPU flags on both Intel and
>AMD to get constant time for such simple operations as XOR or ADD
>and not latency which depends on data.
>
>XOR? ADD? Seriously? If there ever was a single cycle latency
>instruction, XOR's the one, and ADD has been down to a single
>cycle for ages.
>
>See https://www.openwall.com/lists/oss-security/2023/01/25/3 which
>includes a links to the vendor documentation and a discussion on
>the Linux Kernel list.

Nearly all instructions are affected on Intel, the exceptions I have
noticed are multiplication instructions.

Given that the data-dependent timing was introduced only with Ice
Lake, but apparently affects almost all instructions there, including,
bitwise instructions, I wonder what data-dependent timing was
introduced. I found nothing about that in the posting you linked to
and the link to the Intel document I followed from there.

What I can imagine is that the renamer now knows that it can treat an
add of X and 0 (where already the renamer knows that it is 0) as X.
That would certainly be easy enough to disable with a single flag.
But is this a frequent-enough case to make it worth adding as a
hardware optimization? And why does it affect AND, but not multiply
instructions?

Given the limited speedup from Skylake to Ice Lake (and it's brethren
Tiger Lake and Rocket Lake), the speedup from this particular
optimization is likely to be miniscule.

Another, less likely theory is that Ice Lake uses a staggered addition
or somesuch, which takes 2 cycles if there is a carry from the lower
to the upper half and one cycle if there is not. However, in that
case I would expect bitwise operations to be unaffected.

Does anybody know what is really the reason for this data-dependent
timing?

>It seems people at Intel and AMD have chosen execution speed
>over sanity.

AMD?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thomas Koenig

unread,
Jan 27, 2023, 2:21:52 PMJan 27
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tko...@netcologne.de> writes:

> AMD?

That should have been ARM.

MitchAlsup

unread,
Jan 27, 2023, 3:55:23 PMJan 27
to
Does anyone (else) see that this is just another manifestation of
"SIMD considered harmful", this time to cryptography.

BTW:: XOR is not in the list of Intel OpCodes that can be subject to
different timings on different implementations. The only Intel OpCodes
all start with P or V.

robf...@gmail.com

unread,
Jan 28, 2023, 5:37:33 AMJan 28
to
Is the non-constant time for instructions possibly to combat viruses and
other malware? By making the execution speed inconsistent?

Terje Mathisen

unread,
Jan 28, 2023, 9:53:50 AMJan 28
to
That list looks like all SIMD PMADD variations, plus one (VPLZCNTD)
which I'm guessing is counting the leading zero bits?

Terje

EricP

unread,
Jan 28, 2023, 10:28:38 AMJan 28
to
The instructions marked as affected are:

PMADDUBSW
PMADDWD
PMULDQ
PMULHRSW
PMULHUW
PMULHW
PMULLD
PMULLW
PMULUDQ

VPLZCNTD
VPLZCNTQ
VPMADD52HUQ
VPMADD52LUQ
VPMADDUBSW
VPMADDWD
VPMULDQ
VPMULHRSW
VPMULHUW
VPMULHW
VPMULLD
VPMULLQ
VPMULLW
VPMULUDQ

Possibly predication masking could affect the latency if
there are less calculation units than SIMD elements
and they only calculate the valid elements.
But then why wouldn't it affect the latency of
all predicate masked SIMD operations?



Quadibloc

unread,
Jan 28, 2023, 12:11:14 PMJan 28
to
On Saturday, January 28, 2023 at 8:28:38 AM UTC-7, EricP wrote:

> But then why wouldn't it affect the latency of
> all predicate masked SIMD operations?

There are different numbers of calculation units of
different kinds?

Predicate masking is carried out by still going through
the motions for certain operations, because that's easier
or saves a few transistors?

Without a copy of the schematics, modern CPUs are
so complex that it's hard to even begin speculating on
the answers to questions like that. Well, maybe unless
your name is Mitch Alsup.

John Savard

Anton Ertl

unread,
Jan 28, 2023, 12:24:26 PMJan 28
to
EricP <ThatWould...@thevillage.com> writes:
>MitchAlsup wrote:
>>> See https://www.openwall.com/lists/oss-security/2023/01/25/3 which
>> BTW:: XOR is not in the list of Intel OpCodes that can be subject to
>> different timings on different implementations.

In that case I misinterpreted the documents by Intel. Apparently I
confused what false and true means in
<https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/data-operand-independent-timing-instructions.html>.

The title talks about "Data Operand Independent Timing", then we get
"Instructions that May Exhibit MCDT Behavior", without explanation
what MCDT means (later it turns out it is "MXCSR Configuration
Dependent Timing", without explanation what MXCSR is. It turns out to
be a configuration register. What does that configuration register
have to do with the operands of other instructions? Intel needs to
hire someone who knows how to write if they want people to take notice
of such documents and apply the software mitigations (but do they want
that?).
I think they discuss predication masking effects as an additional
source of timing variation somewhere in
<https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/data-operand-independent-timing-isa-guidance.html>,
but AFAIK they have no mechanism like you suggest. And the P*
instructions have no such masking.

I am somewhat surprised that SIMD instructions are affected. Even if
you can have a shorter latency in some lanes, what is the probability
that no lane needs the longest latency?

Absent instructions: The DIV and IDIV instructions on Skylake
certainly have operand-dependent timing and are not listed here. The
configuration register probably does not control the
operation-dependence of DIV/IDIV.

MitchAlsup

unread,
Jan 28, 2023, 12:52:32 PMJan 28
to
My Guess is that in certain power modes, the width of the
SIMD calculation units vary. At low temperatures, you get
512-bits = 8×64-bit lanes, at higher temperatures you get
256-bits = 4×64-bit lanes, and at still higher temperatures
you get 128-bits = 2×64-bit lanes. All done to smooth out
the power dissipation of the chip.

Scott Lurndal

unread,
Jan 28, 2023, 1:08:43 PMJan 28
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>EricP <ThatWould...@thevillage.com> writes:
>>MitchAlsup wrote:
>>>> See https://www.openwall.com/lists/oss-security/2023/01/25/3 which
>>> BTW:: XOR is not in the list of Intel OpCodes that can be subject to
>>> different timings on different implementations.
>

>
>I am somewhat surprised that SIMD instructions are affected. Even if
>you can have a shorter latency in some lanes, what is the probability
>that no lane needs the longest latency?

The problem occurs when the attacker chooses the data that the instruction
is performed upon, so the probablity is 1.

Anton Ertl

unread,
Jan 28, 2023, 5:41:07 PMJan 28
to
sc...@slp53.sl.home (Scott Lurndal) writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>I am somewhat surprised that SIMD instructions are affected. Even if
>>you can have a shorter latency in some lanes, what is the probability
>>that no lane needs the longest latency?
>
>The problem occurs when the attacker chooses the data that the instruction
>is performed upon, so the probablity is 1.

The question is: If the probability of getting a shorter latency is
low, why did they Intel put the operand-dependent timing (and the
user-controlled chicken bit to disable it) into the hardware at all?

Michael S

unread,
Jan 28, 2023, 7:22:13 PMJan 28
to
PMULUDQ with all inputs < 2**16 are probably common and it makes sense
to execute this case faster.
For the rest of them I don't see a point.

John Dallman

unread,
Jan 28, 2023, 7:26:28 PMJan 28
to
In article <2023Jan2...@mips.complang.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> The question is: If the probability of getting a shorter latency is
> low, why did they Intel put the operand-dependent timing (and the
> user-controlled chicken bit to disable it) into the hardware at all?

Maybe it provides a way to make benchmarks look better?

John

MitchAlsup

unread,
Jan 28, 2023, 8:22:09 PMJan 28
to
The converse is true:: it is easier to use a side channel attack on a non-fixed
timing critical calculation loop (such as crypto).

MitchAlsup

unread,
Jan 28, 2023, 8:23:57 PMJan 28
to
On Saturday, January 28, 2023 at 11:24:26 AM UTC-6, Anton Ertl wrote:

> Absent instructions: The DIV and IDIV instructions on Skylake
> certainly have operand-dependent timing and are not listed here. The
> configuration register probably does not control the
> operation-dependence of DIV/IDIV.
<
Unlikely to be used in a tight crypto loop.

MitchAlsup

unread,
Jan 28, 2023, 8:24:58 PMJan 28
to
On Saturday, January 28, 2023 at 4:41:07 PM UTC-6, Anton Ertl wrote:
> sc...@slp53.sl.home (Scott Lurndal) writes:
> >an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >>I am somewhat surprised that SIMD instructions are affected. Even if
> >>you can have a shorter latency in some lanes, what is the probability
> >>that no lane needs the longest latency?
> >
> >The problem occurs when the attacker chooses the data that the instruction
> >is performed upon, so the probablity is 1.
> The question is: If the probability of getting a shorter latency is
> low, why did they Intel put the operand-dependent timing (and the
> user-controlled chicken bit to disable it) into the hardware at all?
<
it is not low, as up to 50% of matrix multiplication have terms of zero.

Scott Lurndal

unread,
Jan 29, 2023, 9:11:01 AMJan 29
to
Isolated systems, not dependent upon data from outside, don't
need the DIT protection, and can thus tune for performance over
security. Not all systems are connected to the internet processing
untrusted data.
Reply all
Reply to author
Forward
0 new messages