Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

I-cache coherency and pipeline flushing on modern x86/x64 CPUs

334 views
Skip to first unread message

Rick C. Hodgin

unread,
May 16, 2017, 9:53:12 AM5/16/17
to
A thread in comp.lang.c++ has come up regarding a particular well-
optimized video decoder, and it uses a double jump sequence:

cmp something
jz target1
; Other code bytes
target1:
jmp target2
; Other code bytes
target2:

The OP was able take the disassembled code and re-assemble it by
altering it to:

cmp something
jz target2
; Other code bytes
target1:
jmp target2
; Other code bytes
target2:

... indicating it was not out of range. The OP asked why such a
double jump would be present.

I responded with the possibility that the algorithm uses SMC, and
the hard JMP is there to flush the pipeline, so that any recent
changes to the L1 instruction cache will be re-loaded.

Another responder replied and said that's no longer necessary in
modern Intel/AMD x86/x64 CPUs, as they all snoop the linear addresses
for SMC, and will automatically flush and refill the instruction
pipeline without the explicit need for a JMP instruction, as with
the 486 and earlier CPUs.

Is this true? I could not find a reference to that feature in the
Intel IA-32/Intel64 manual:

https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

The closest I found is on page 3710, and it reads:

On the Intel486 processor, a write to an instruction in the cache
will modify it in both the cache and memory. If the instruction was
prefetched before the write, however, the old version of the
instruction could be the one executed. To prevent this problem, it
is necessary to flush the instruction prefetch unit of the Intel486
processor by coding a jump instruction immediately after any write
that modifies an instruction.

The P6 family and Pentium processors, however, check whether a write
may modify an instruction that has been prefetched for execution.
This check is based on the linear address of the instruction. If
the linear address of an instruction is found to be present in the
prefetch queue, the P6 family and Pentium processors flush the
prefetch queue, eliminating the need to code a jump instruction
after any writes that modify an instruction.

-----
NOTE

The check on linear addresses described above is not in practice a
concern for compatibility. Applications that include self-modifying
code use the same linear address for modifying and fetching the
instruction. System software, such as a debugger, that might possibly
modify an instruction using a different linear address than that
used to fetch the instruction must execute a serializing operation,
such as IRET, before the modified instruction is executed.

I know you don't want to reply to me ... but I am still asking for
help. As I understand it, and my knowledge may be specific to the
old 486-and-prior way of doing things, whenever you use SMC you
always added a JMP $+2 in order to refill the pipeline with any
changes that may now be in the instruction cache, but not in the
pre-decoded pipeline.

Thank you,
Rick C. Hodgin

Rick C. Hodgin

unread,
May 16, 2017, 11:10:22 AM5/16/17
to
I did find this guidance for current architectures, with the warning
that SMC behavior is model-specific (page 2918), and one of the options
it suggests is the one mentioned above (to use a JMP instruction):

8.1.3 Handling Self- and Cross-Modifying Code

The act of a processor writing data into a currently executing code
segment with the intent of executing that data as code is called
self-modifying code. IA-32 processors exhibit model-specific behavior
when executing self modified code, depending upon how far ahead of
the current execution pointer the code has been modified.

As processor microarchitectures become more complex and start to
speculatively execute code ahead of the retirement point (as in P6
and more recent processor families), the rules regarding which code
should execute, pre- or post-modification, become blurred. To write
self-modifying code and ensure that it is compliant with current
and future versions of the IA-32 architectures, use one of the
following coding options:

(* OPTION 1 *)
Store modified code (as data) into code segment;
Jump to new code or an intermediate location;
Execute new code;

(* OPTION 2 *)
Store modified code (as data) into code segment;
Execute a serializing instruction; (* For example, CPUID instruction *)
Execute new code;

The use of one of these options is not required for programs
intended to run on the Pentium or Intel486 processors, but are
recommended to ensure compatibility with the P6 and more recent
processor families.

It goes on to suggest steps for ensuring cross-modifying code on a
multi-CPU environment maintains immediate coherency. It is protocol-
based more than architecture-based:

(* Action of Modifying Processor *)
Memory_Flag <- 0; (* Set Memory_Flag to value other than 1 *)
Store modified code (as data) into code segment;
Memory_Flag <- 1;

(* Action of Executing Processor *)
WHILE (Memory_Flag != 1)
Wait for code to update;
ELIHW;
Execute serializing instruction; (* For example, CPUID instruction *)
Begin executing modified code;

That seems to make sense anyway because you're essentially dispatching
code on the other CPU(s) by some executive control on the current one.
0 new messages