x86 microcode paper

Rod Pemberton

unread,

Dec 31, 2017, 11:54:44 PM12/31/17

to

x86 microcode paper

Does anyone here follow USENIX or Hackaday.com?

It came up on Hackaday that someone at USENIX this year hacked the x86
microcode instructions. Their pdf report documents the microcode
instruction set.

http://syssec.rub.de/research/publications/microcode-reversing/

What I'm wondering is if more efficient code can be generated by C
compilers if they "understand" the x86 micro-code?

Rod Pemberton
--
How can it be domestic terrorism if your country is a dictatorship? ...

Robert Wessel

unread,

Jan 1, 2018, 2:24:54 AM1/1/18

to

On Sun, 31 Dec 2017 23:46:26 -0500, Rod Pemberton
<EmailN...@nospicedham.voenflacbe.cpm> wrote:

>
>x86 microcode paper
>
>Does anyone here follow USENIX or Hackaday.com?
>
>It came up on Hackaday that someone at USENIX this year hacked the x86
>microcode instructions. Their pdf report documents the microcode
>instruction set.
>
>http://syssec.rub.de/research/publications/microcode-reversing/
>
>
>What I'm wondering is if more efficient code can be generated by C
>compilers if they "understand" the x86 micro-code?

Not really. Any instruction that's microcoded pretty much needs to be
avoided in any performance critical code. There may be some minor
exception for things like the repeated string instructions on some
processors, but those aren't really microcode, except in setup, or in
special case handing (which you'd be wanting to avoid anyway).

That's all rather processor dependent, though.

Of course if AMD or Intel fix a bug, an instruction may go from
hardwired to microcoded, and then take a considerable performance hit.

In the reference manuals, you can usually tell the microcode
instructions because they have both high latency and low throughput.

In the past we had many processors that had most or all instructions
microcoded, so understanding the performance of the microcoded
instructions was more important, but back then processors were fairly
sequential, and you had to understand little beyond how many cycles
the instruction would require. On some of those machine it was even
possible to load custom microcode, which might improve performance on
some sequences.

Anton Ertl

unread,

Jan 1, 2018, 5:40:07 AM1/1/18

to

Robert Wessel <robert...@nospicedham.yahoo.com> writes:
>On Sun, 31 Dec 2017 23:46:26 -0500, Rod Pemberton

>>http://syssec.rub.de/research/publications/microcode-reversing/

This came up a while ago at comp.arch, and a few days ago I saw a
presentation of this stuff at 34C3.

>>What I'm wondering is if more efficient code can be generated by C
>>compilers if they "understand" the x86 micro-code?
>
>
>Not really. Any instruction that's microcoded pretty much needs to be
>avoided in any performance critical code. There may be some minor
>exception for things like the repeated string instructions on some
>processors, but those aren't really microcode, except in setup, or in
>special case handing (which you'd be wanting to avoid anyway).

I have yet to see a string instruction on AMD and Intel CPUs that
cannot be beat by using hardwired instructions.

IIRC only one microcoded instruction per cycle can be decoded on both
Intel and AMD CPUs, and no other instruction can be decoded in the
same cycle, so you loose superscalarity right from the start. I don't
know how much the microinstructions of a microcoded instruction can
overlap with other stuff, but I would not be surprised if this was
more limited than for the hardwired instructions, too: the common case
that they make fast is the hardwired instructions, and the microcoded
instructions are the unloved stepchild.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Frank Tetzel

unread,

Jan 6, 2018, 10:40:52 AM1/6/18

to

> x86 microcode paper
>
> Does anyone here follow USENIX or Hackaday.com?
>
> It came up on Hackaday that someone at USENIX this year hacked the x86
> microcode instructions. Their pdf report documents the microcode
> instruction set.
>
> http://syssec.rub.de/research/publications/microcode-reversing/

They recently also gave a talk at 34c3 [1]. Very interesting talk also
showing CPU backdoors in microcode.

> What I'm wondering is if more efficient code can be generated by C
> compilers if they "understand" the x86 micro-code?

There's already quite a lot of information publicly available how a
certain instruction behaves: latency, throughput and which ports are
used. The only "bigger" blindspot is the interaction of instructions in
a sequence. There are things like micro- and macro-fusions [2]
happening. IACA [3] can analyze this statically on assembly level.

You would probably need to understand all CPU internal state and how it
is manipulated by micro-ops. Then, maybe you could find a better x86
instruction sequence triggering a more optimal micro-ops sequence,
utilizing the function units of a CPU in a better way.

I find the possibility to add new instructions to the ISA with microcode
updates very interesting. Why not add a 1:1 mapping of micro-ops to the
ISA?

[1] https://media.ccc.de/v/34c3-9058-everything_you_want_to_know_about_x86_microcode_but_might_have_been_afraid_to_ask
[2] https://en.wikichip.org/wiki/macro-operation_fusion
[3] https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

Robert Wessel

unread,

Jan 6, 2018, 2:41:06 PM1/6/18

to

On Sat, 6 Jan 2018 16:34:49 +0100, Frank Tetzel
<s144...@nospicedham.mail.zih.tu-dresden.de> wrote:

>I find the possibility to add new instructions to the ISA with microcode
>updates very interesting. Why not add a 1:1 mapping of micro-ops to the
>ISA?

Back when machines were mostly microcoded, that was not an uncommon
ability. Some machines (the Burroughs 1700s, for example) were even
designed with reloadable microcode for different applications. The
B1700s, for example, had different microcode, and thus different ISAs,
for Fortran or Cobol programs. Many other machines had microcode, and
on more than a few occasions ISA extensions or assists of various
sorts were programmed. The recent(-ish) IBM z9's grew decimal
floating point support with a microcode update, for example. It was
slow, but it worked.

I saw a fair number of user-developed microcode extensions for S/370s
(all with at least some IBM help, since the required internals
documentation was not all usually available), but this was always very
machine specific (IOW, the microcode assist you developed for a
370/158 was of no help at all when you upgraded to a /168, or shipped
to a customer in the field with a /148).

Anyway, it's not a fashionable technique on modern machines since
microcode tends to be pretty slow under the best of circumstances, and
is used more for handing excessively complex legacy instructions,
special cases, and internal operations.

Making the micro-ops visible has certainly been proposed, it's not a
great idea since they tend to be radically different from machine to
machine, even often between minor steppings. They also tend to be
very fragile, and may have serious limitations from a general purpose
point of view, and major side effects (again, all of which may vary
greatly from implementation to implementation).

To some extent, the more straight-forward RISC machines are an attempt
to do just that, but that just tends to freeze a given set of
technological tradeoffs.

James Harris

unread,

Jan 8, 2018, 9:28:23 AM1/8/18

to

On 06/01/2018 15:34, Frank Tetzel wrote:

...

> Why not add a 1:1 mapping of micro-ops to the ISA?

Perhaps the main reasons are to do with compatibility. For example,
Intel CPUs of today have to run many of the binaries of years ago. They
therefore need the same encodings. New encodings could be provided in
addition to the current ones but they would require lengthy
instructions. And they themselves would define new encodings that would
be expected to be preserved to future CPUs.

In fact, any micro-op encoding now would tend to lock itself in whereas
designers of later CPUs might want to change the internal encoding - and
we would be back to square one. So a translation from machine code to
internal encoding is not unreasonable.

Of course, the use of micro-coded instructions is not usually thought of
as a good idea now, with RISC approaches being faster.

--
James Harris

Frank Tetzel

unread,

Jan 10, 2018, 6:17:28 PM1/10/18

to

> In fact, any micro-op encoding now would tend to lock itself in
> whereas designers of later CPUs might want to change the internal
> encoding - and we would be back to square one. So a translation from
> machine code to internal encoding is not unreasonable.

Sounds reasonable if they really make radical changes. Quite similar to
what GPUs do: the CUDA compiler emits PTX [1] which is then translated
by the graphics driver to the actual machine code of the current GPU.
That's a luxury only a co-processor has, of course. The CPU has to do
the tranformation in the instruction decoder.

[1] https://en.wikipedia.org/wiki/Parallel_Thread_Execution

Walter H.

unread,

Jan 14, 2018, 2:10:01 PM1/14/18

to

On 01.01.2018 08:10, Robert Wessel wrote:
> On Sun, 31 Dec 2017 23:46:26 -0500, Rod Pemberton
> <EmailN...@nospicedham.voenflacbe.cpm> wrote:
>
>>
>> x86 microcode paper
>>
>> Does anyone here follow USENIX or Hackaday.com?
>>
>> It came up on Hackaday that someone at USENIX this year hacked the x86
>> microcode instructions. Their pdf report documents the microcode
>> instruction set.
>>
>> http://syssec.rub.de/research/publications/microcode-reversing/
>>
>>
>> What I'm wondering is if more efficient code can be generated by C
>> compilers if they "understand" the x86 micro-code?
>
>
> Not really. Any instruction that's microcoded pretty much needs to be
> avoided in any performance critical code.

can you really say this?

because: in a CPU generation the more expensive ones have less microcode
the cheaper ones have more microcode, or is this wrong these days;

this was knowledge in the 1990s and I guess this hasn't really changed;

> Of course if AMD or Intel fix a bug, an instruction may go from
> hardwired to microcoded, and then take a considerable performance hit.

is this true for the expensive CPUs of a generation?

because this could kill any speed benefits of the more expensive CPUs ...

Robert Wessel

unread,

Jan 18, 2018, 8:01:12 PM1/18/18

to

On Sun, 14 Jan 2018 19:59:11 +0100, "Walter H."
<Walter...@nospicedham.mathemainzel.info> wrote:

>On 01.01.2018 08:10, Robert Wessel wrote:
>> On Sun, 31 Dec 2017 23:46:26 -0500, Rod Pemberton
>> <EmailN...@nospicedham.voenflacbe.cpm> wrote:
>>
>>>
>>> x86 microcode paper
>>>
>>> Does anyone here follow USENIX or Hackaday.com?
>>>
>>> It came up on Hackaday that someone at USENIX this year hacked the x86
>>> microcode instructions. Their pdf report documents the microcode
>>> instruction set.
>>>
>>> http://syssec.rub.de/research/publications/microcode-reversing/
>>>
>>>
>>> What I'm wondering is if more efficient code can be generated by C
>>> compilers if they "understand" the x86 micro-code?
>>
>>
>> Not really. Any instruction that's microcoded pretty much needs to be
>> avoided in any performance critical code.
>
>can you really say this?
>
>because: in a CPU generation the more expensive ones have less microcode
>the cheaper ones have more microcode, or is this wrong these days;
>
>this was knowledge in the 1990s and I guess this hasn't really changed;

Certainly smaller and less expensive CPUs tend to punt more special
cases and complex instructions to slow execution paths. For example
consider the handling of denormals in FP: they *can* be handled at
approximately full speed with some hardware overhead, but many smaller
processers with FP hardware handle them much more slowly than normal
numbers. Some processors implement longer versions of the vector
instructions as multiple shorter operations.

But this isn't the 60s, 70s or 80s, or event the 90s anymore. For the
most part "fast" CPUs are more likely to have more OoO resources,
cache, copies of execution units, etc., but not so much differences in
the actual execution of most instructions. A Silvermont and Skylake
can both do a single register to register add in a cycle, but the
Skylake can do more at once, but microcode is involved in neither.

But that doesn't matter. If you want fast performance on a processor,
avoid the places where that processor is slow, and almost everything
that invokes actual microcode on modern processors is slow. IOW, fast
code on an Intel Silvermont will be different than on a Skylake.

>> Of course if AMD or Intel fix a bug, an instruction may go from
>> hardwired to microcoded, and then take a considerable performance hit.
>
>is this true for the expensive CPUs of a generation?
>
>because this could kill any speed benefits of the more expensive CPUs ...

Well, sure. The point of the patch is to fix a bug. Doesn't matter
if the bug is in a Silvermont or a Skylake. In fact the smaller
processor might well have a (slightly) better chance of seeing a
minimal performance hit since more still will be microcoded anyway,
which means that you have some chance of hitting a bug in the
microcode part, which will presumably not be as badly impacted as
going from straight hardware execution to a microcode fix.

And the bugs can be subtle (consider the Meltdown and Spectre stuff),
and the fixes still painful (and partial).