Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

compiler generate jmp to another jmp

68 views
Skip to first unread message

qak

unread,
May 15, 2017, 12:56:49 PM5/15/17
to
I disassempled 2 highly optimized direct show filter (CoreAvc and Lentoid,
known to be fastest for H264 and H265) and notice many (more than a few
dozen) patterns like the following:
jz Label_1 ; this is short jump but some of them unconditional jump
...
Label_1:
jmp Label_123 ; quite a few of them jump again to another jump
...
...
Label_123:
jmp Label_234

I could replace the first line with 'jz Label_234' and reassemple without
problem.
Could the optimizer miss the oportunity ? or I know too little!
The first line is not always short jump, sometimes is even 'call Label_1'.

Rick C. Hodgin

unread,
May 15, 2017, 1:14:31 PM5/15/17
to
Some CPUs only allow conditional jumps which go for a certain
distance (number of bytes) away from their instruction. As such,
it's a common occurrence to see the condition jump only move to
another instruction which then does a hard jump to some location
further away. A more common practice is to jump locally on the
alternate condition, and then hard-jump if the opposite condition
test fails. However, for optimization, when it's known that a
particular branch is more likely or less likely to branch, it
can be coded in those ways.

In this case, since you are able to modify the source code and
generate a binary without an error, it's likely legacy code
leftover from a 32-bit codec, which supported only +/- 128
bytes for branch target offsets, whereas in 64-bit code it can
go pretty much anywhere needed.

It may also be operating that way for alignment, so the code
nearby is aligned at a particular boundary. Without knowing
the specifics of the algorithm, it would be hard to say for
certain, but it's not uncommon to see what you find there,
and the reasons it exists typically are the limitations of
the instruction ISA, or for some reasons related to optimization.

Thank you,
Rick C. Hodgin

Rick C. Hodgin

unread,
May 15, 2017, 1:42:20 PM5/15/17
to
Something else ... a hard JMP instruction forces a refill of the
instruction cache. If the algorithm is using self-modifying code,
that could be the reason for the dual-target jump. On a conditional
branch, the pipeline is only invalidated and refilled if the branch
unit mis-predicted the branch. By forcing a hard branch, it will
then re-fill the instruction code in the pipeline, which will read
the recently altered self-modifying code (if it existed).

If you re-post your question to alt.lang.asm or comp.lang.asm.x86
if it's x86-specific, then you'll get better answers.

Scott Lurndal

unread,
May 15, 2017, 1:54:54 PM5/15/17
to
"Rick C. Hodgin" <rick.c...@gmail.com> writes:
>On Monday, May 15, 2017 at 1:14:31 PM UTC-4, Rick C. Hodgin wrote:
>> On Monday, May 15, 2017 at 12:56:49 PM UTC-4, qak wrote:
>> > I disassempled 2 highly optimized direct show filter (CoreAvc and Lentoid,
>> > known to be fastest for H264 and H265) and notice many (more than a few
>> > dozen) patterns like the following:
>> > jz Label_1 ; this is short jump but some of them unconditional jump
>> > ...
>> > Label_1:
>> > jmp Label_123 ; quite a few of them jump again to another jump
>> > ...
>> > ...
>> > Label_123:
>> > jmp Label_234
>> >
>> > I could replace the first line with 'jz Label_234' and reassemple without
>> > problem.
>> > Could the optimizer miss the oportunity ? or I know too little!
>> > The first line is not always short jump, sometimes is even 'call Label_1'.
>>
>> Some CPUs only allow conditional jumps which go for a certain
>> distance (number of bytes) away from their instruction.
>>


>
>Something else ... a hard JMP instruction forces a refill of the
>instruction cache.

No, it doesn't. The only difference between an unconditional
branch operation and a condition branch operation is whether or
not instructions are speculatively executed. Modern branch
predictors are pretty good at preventing speculation down the
wrong branch path.

Intel's instruction cache is snooped specifically to support
self-modifying code without the programmer having to do anything
special (like flushing the L1I cache).


> By forcing a hard branch, it will
>then re-fill the instruction code in the pipeline, which will read
>the recently altered self-modifying code (if it existed).

No, the unconditional branch does nothing other than
changing the flow of execution. It has no effect on cache
coherency.

Rick C. Hodgin

unread,
May 15, 2017, 2:20:09 PM5/15/17
to
That is information beyond my existing understanding. I just
did a search online and Randall Hyde wrote on page 447 of his
book:

"Write Great Code, Vol. 2: Thinking Low-Level, Writing High-Level"
https://books.google.com/books?id=mM58oD4LATUC&pg=PA447

"Although these statements typically compile to a single
machine instruction (jmp), don't get the impression they
are efficient to use. Even ignoring the fact that a jmp
can be somewhat expensive (because it forces the CPU to
flush the instruction pipeline), statements that branch
out of a loop can have..."

When did the JMP instruction stop forcing a refill of the cache?

Rick C. Hodgin

unread,
May 15, 2017, 2:27:59 PM5/15/17
to
On Monday, May 15, 2017 at 1:54:54 PM UTC-4, Scott Lurndal wrote:
> "Rick C. Hodgin" <rick.c...@gmail.com> writes:
> >Something else ... a hard JMP instruction forces a refill of the
> >instruction cache.

I just realized that I wrote "instruction cache" here when I meant
to write "instruction pipeline." I realize the cache is not
invalidated, but only what has already been pre-loaded into the
instruction pipeline.

Those pre-decoded instructions already in the pipeline, which may
have been received from prior reads before SMC updated something,
are invalidated by the JMP and it re-fills from the instruction
cache.

If that is incorrect, then my information is notably out of date
because that's how it used to work.

> No, it doesn't. The only difference between an unconditional
> branch operation and a condition branch operation is whether or
> not instructions are speculatively executed. Modern branch
> predictors are pretty good at preventing speculation down the
> wrong branch path.
>
> Intel's instruction cache is snooped specifically to support
> self-modifying code without the programmer having to do anything
> special (like flushing the L1I cache).
>
> > By forcing a hard branch, it will
> >then re-fill the instruction code in the pipeline, which will read
> >the recently altered self-modifying code (if it existed).
>
> No, the unconditional branch does nothing other than
> changing the flow of execution. It has no effect on cache
> coherency.

How does the CPU synchronize instructions which have been pre-fetched
from now stale instruction data for an upcoming instruction that's
already begun decoding for its pipeline, to then later signal without
the hard JMP to know that it's going to be executing stale SMC? That
would be quite a slick feature to have in a CPU, so that no matter
when SMC was used, it always executed the correct version.

Scott Lurndal

unread,
May 16, 2017, 8:38:13 AM5/16/17
to
"Flush the instruction pipeline" != "refill the cache".

At any point in time, there may be a dozen or more instructions
_currently being executed_ in various stages of the processor
pipeline. In addition, the processor will use a branch predictor
to select the direction of a conditional branch in order to keep
the pipeline full. If the choice was poor, instructions speculatively
fetched and executed will need to be discarded from the pipeline, causing
a pipeline stall. This has nothing to do with the cache, but it
does have an effect on performance.

The advice given in the book above is incorrect in that it doesn't
account for the branch predictors, which are quite good in modern
processors (see, e.g. TAGE).

Scott Lurndal

unread,
May 16, 2017, 8:40:15 AM5/16/17
to
"Rick C. Hodgin" <rick.c...@gmail.com> writes:

>> No, the unconditional branch does nothing other than
>> changing the flow of execution. It has no effect on cache
>> coherency.
>
>How does the CPU synchronize instructions which have been pre-fetched
>from now stale instruction data for an upcoming instruction that's
>already begun decoding for its pipeline, to then later signal without

All x86 processors (intel, amd) snoop the L1 cache, and if the
line changes, the pipeline is flushed.

Self-modifying code should be avoided on all processors, under all
circumstances (I'm not counting JIT in as self-modifying in this context).

Rick C. Hodgin

unread,
May 16, 2017, 9:09:30 AM5/16/17
to
I realize that. I said in a later message that I used the wrong
words in these cases. I apologize for the confusion. I meant
instruction pipeline at all points, not instruction cache.

qak

unread,
May 16, 2017, 9:15:24 AM5/16/17
to
"Rick C. Hodgin" <rick.c...@gmail.com> wrote in
news:1df79527-2593-4a01...@googlegroups.com:

> On Monday, May 15, 2017 at 1:14:31 PM UTC-4, Rick C. Hodgin wrote:
>> Some CPUs only allow conditional jumps which go for a certain
>> distance (number of bytes) away from their instruction. As such,
>> it's a common occurrence to see the condition jump only move to
>> another instruction which then does a hard jump to some location
>> further away. A more common practice is to jump locally on the
>> alternate condition, and then hard-jump if the opposite condition
>> test fails. However, for optimization, when it's known that a
>> particular branch is more likely or less likely to branch, it
>> can be coded in those ways.
>>
>> In this case, since you are able to modify the source code and
>> generate a binary without an error, it's likely legacy code
>> leftover from a 32-bit codec, which supported only +/- 128
>> bytes for branch target offsets, whereas in 64-bit code it can
>> go pretty much anywhere needed.
>>
>> It may also be operating that way for alignment, so the code
>> nearby is aligned at a particular boundary. Without knowing
>> the specifics of the algorithm, it would be hard to say for
>> certain, but it's not uncommon to see what you find there,
>> and the reasons it exists typically are the limitations of
>> the instruction ISA, or for some reasons related to optimization.
>
> Something else ... a hard JMP instruction forces a refill of the
> instruction cache. If the algorithm is using self-modifying code,
> that could be the reason for the dual-target jump. On a conditional
> branch, the pipeline is only invalidated and refilled if the branch
> unit mis-predicted the branch. By forcing a hard branch, it will
> then re-fill the instruction code in the pipeline, which will read
> the recently altered self-modifying code (if it existed).
>

Thanks for your thought.
I'm glad "it's not uncommon", so compiler writer must know about them.

Scott Lurndal

unread,
May 16, 2017, 9:21:18 AM5/16/17
to
Fortunately, a JMP instruction will _not_ flush the pipeline, since
has a 100% prediction rate.

Rick C. Hodgin

unread,
May 16, 2017, 9:27:06 AM5/16/17
to
On Tuesday, May 16, 2017 at 8:40:15 AM UTC-4, Scott Lurndal wrote:
> "Rick C. Hodgin" <rick.c...@gmail.com> writes:
> >How does the CPU synchronize instructions which have been pre-fetched
> >from now stale instruction data for an upcoming instruction that's
> >already begun decoding for its pipeline, to then later signal without
> All x86 processors (intel, amd) snoop the L1 cache, and if the
> line changes, the pipeline is flushed.

I am aware that all processors snoop data writes and update the L1
instruction cache, but I am not aware that they will automatically
flush the pipeline if they detect a write to an address that's
already been decoded and is in the pipe.


I don't see in the IA-32/Intel64 architecture manual where it says
the pipeline will be flushed if it detects changes in the L1
instruction cache:

https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

I would appreciate a reference to where this happens, because to my
current knowledge it does not automatically happen.

> Self-modifying code should be avoided on all processors, under all
> circumstances (I'm not counting JIT in as self-modifying in this
> context).

I have understood the reason why applications should avoid SMC is
because CPUs will not automatically flush the pipeline when it detects
changes. The L1 instruction cache will be updated, but that will
only affect the next pass through the code when a new set of load-
and-decode operations takes place on those opcode bytes.

Rick C. Hodgin

unread,
May 16, 2017, 9:46:40 AM5/16/17
to
On Tuesday, May 16, 2017 at 9:27:06 AM UTC-4, Rick C. Hodgin wrote:
> On Tuesday, May 16, 2017 at 8:40:15 AM UTC-4, Scott Lurndal wrote:
> > "Rick C. Hodgin" <rick.c...@gmail.com> writes:
> > >How does the CPU synchronize instructions which have been pre-fetched
> > >from now stale instruction data for an upcoming instruction that's
> > >already begun decoding for its pipeline, to then later signal without
> > All x86 processors (intel, amd) snoop the L1 cache, and if the
> > line changes, the pipeline is flushed.
>
> I am aware that all processors snoop data writes and update the L1
> instruction cache, but I am not aware that they will automatically
> flush the pipeline if they detect a write to an address that's
> already been decoded and is in the pipe.
>
>
> I don't see in the IA-32/Intel64 architecture manual where it says
> the pipeline will be flushed if it detects changes in the L1
> instruction cache:
>
> https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
>
> I would appreciate a reference to where this happens, because to my
> current knowledge it does not automatically happen.

I see this on page 3709 of the manual linked above:

On the Intel486 processor, a write to an instruction in the cache
will modify it in both the cache and memory. If the instruction was
prefetched before the write, however, the old version of the
instruction could be the one executed. To prevent this problem, it
is necessary to flush the instruction prefetch unit of the Intel486
processor by coding a jump instruction immediately after any write
that modifies an instruction.

The P6 family and Pentium processors, however, check whether a write
may modify an instruction that has been prefetched for execution.
This check is based on the linear address of the instruction. If
the linear address of an instruction is found to be present in the
prefetch queue, the P6 family and Pentium processors flush the
prefetch queue, eliminating the need to code a jump instruction
after any writes that modify an instruction.

-----
NOTE

The check on linear addresses described above is not in practice a
concern for compatibility. Applications that include self-modifying
code use the same linear address for modifying and fetching the
instruction. System software, such as a debugger, that might possibly
modify an instruction using a different linear address than that
used to fetch the instruction must execute a serializing operation,
such as IRET, before the modified instruction is executed.

I don't see any references to describing what happens on more modern
architecture, only these notes regarding compatibility with prior
architectures (Pentium, Pentium Pro and derivatives).

On this website, I find a reference to more modern features, that it
will snoop the instruction cache and invalidate the pipeline:

http://blog.onlinedisassembler.com/blog/?p=133

... but that the express behavior is model-specific, and not an
inherent trait of the architecture which exists across all models in
the families.

Scott Lurndal

unread,
May 16, 2017, 10:44:59 AM5/16/17
to
"Rick C. Hodgin" <rick.c...@gmail.com> writes:
>On Tuesday, May 16, 2017 at 8:40:15 AM UTC-4, Scott Lurndal wrote:
>> "Rick C. Hodgin" <rick.c...@gmail.com> writes:
>> >How does the CPU synchronize instructions which have been pre-fetched
>> >from now stale instruction data for an upcoming instruction that's
>> >already begun decoding for its pipeline, to then later signal without
>> All x86 processors (intel, amd) snoop the L1 cache, and if the
>> line changes, the pipeline is flushed.
>
>I am aware that all processors snoop data writes and update the L1
>instruction cache, but I am not aware that they will automatically
>flush the pipeline if they detect a write to an address that's
>already been decoded and is in the pipe.
>
>
>I don't see in the IA-32/Intel64 architecture manual where it says
>the pipeline will be flushed if it detects changes in the L1
>instruction cache:

Volume 3A:

"A write to a memory location in a code segment that is currently
cached in the processor causes the associated cache line (or lines)
to be invalidated. This check is based on the physical address
of the instruction. In addition, the P6 family and Pentium
processors check whether a write to a code segment may modify
an instruction that has been prefetched for execution. If the
write affects a prefetched instruction, the prefetch queue is
invalidated. This latter check is based on the linear address
of the instruction. For the Pentium 4 and Intel Xeon processors,
a write or a snoop of an instruction in a code segment, where the
target instruction is already decoded and resident in the trace
cache, invalidates the entire trace cache. The latter behavior
means that programs that self-modify code can cause severe
degradation of performance when run on the Pentium 4 and Intel Xeon processors."


>
>I have understood the reason why applications should avoid SMC is
>because CPUs will not automatically flush the pipeline when it detects
>changes. The L1 instruction cache will be updated, but that will
>only affect the next pass through the code when a new set of load-
>and-decode operations takes place on those opcode bytes.

Besides being a performance killer, self-modifying code on non-intel
architectures requires flushing the instruction cache, which non-privileged
code may or may not be able to accomplish.

Just don't do it. Ever. The days of 4KW address spaces are long in the past.

Rick C. Hodgin

unread,
May 16, 2017, 11:01:25 AM5/16/17
to
I see similar language in section 11.6 of that manual on page 3090, but
it is only referring to operation of P6 and Pentium families. We are
several families beyond that generation now.

I do find this guidance for current architectures, with the warning that
SMC behavior is model-specific (page 2918), and one of the options it
suggests is the one mentioned above (to use a JMP instruction):

8.1.3 Handling Self- and Cross-Modifying Code

The act of a processor writing data into a currently executing code
segment with the intent of executing that data as code is called
self-modifying code. IA-32 processors exhibit model-specific behavior
when executing self modified code, depending upon how far ahead of
the current execution pointer the code has been modified.

As processor microarchitectures become more complex and start to
speculatively execute code ahead of the retirement point (as in P6
and more recent processor families), the rules regarding which code
should execute, pre- or post-modification, become blurred. To write
self-modifying code and ensure that it is compliant with current
and future versions of the IA-32 architectures, use one of the
following coding options:

(* OPTION 1 *)
Store modified code (as data) into code segment;
Jump to new code or an intermediate location;
Execute new code;

(* OPTION 2 *)
Store modified code (as data) into code segment;
Execute a serializing instruction; (* For example, CPUID instruction *)
Execute new code;

The use of one of these options is not required for programs
intended to run on the Pentium or Intel486 processors, but are
recommended to ensure compatibility with the P6 and more recent
processor families.

The option (1) that's there is that which this code in the video
codec may be doing.

> >I have understood the reason why applications should avoid SMC is
> >because CPUs will not automatically flush the pipeline when it detects
> >changes. The L1 instruction cache will be updated, but that will
> >only affect the next pass through the code when a new set of load-
> >and-decode operations takes place on those opcode bytes.
>
> Besides being a performance killer, self-modifying code on non-intel
> architectures requires flushing the instruction cache, which non-privileged
> code may or may not be able to accomplish.
>
> Just don't do it. Ever. The days of 4KW address spaces are long
> in the past.

SMC is a requirement for true optimization. It is more of a startup
modification once the capabilities of the machine are assessed, but
you can minimize instruction cache pollution by dynamically altering
your code at runtime so that algorithms which won't be required in
this runtime instance are no longer present, etc.

It's tricky, but you'll never have performance which reaches to the
levels which use it, even if it's only an up-front modification, or
a first-pass modification, which does have the performance hit, but
from that point on it won't have a performance hint, and the new
optimizations now applied will result in fewer code bytes being
cached for the same algorithm, and more optimized code based on some
variations known about at compile-time, which could only be fully
expressed at run-time once the machine state was identified.

David Brown

unread,
May 17, 2017, 5:39:29 PM5/17/17
to
On 16/05/17 15:26, Rick C. Hodgin wrote:
> On Tuesday, May 16, 2017 at 8:40:15 AM UTC-4, Scott Lurndal wrote:
>> "Rick C. Hodgin" <rick.c...@gmail.com> writes:
>>> How does the CPU synchronize instructions which have been pre-fetched
>> >from now stale instruction data for an upcoming instruction that's
>>> already begun decoding for its pipeline, to then later signal without
>> All x86 processors (intel, amd) snoop the L1 cache, and if the
>> line changes, the pipeline is flushed.
>
> I am aware that all processors snoop data writes and update the L1
> instruction cache, but I am not aware that they will automatically
> flush the pipeline if they detect a write to an address that's
> already been decoded and is in the pipe.
>

Minor point - perhaps all /x86/ processors snoop data writes and update
the L1 instruction cache and/or flush instruction caches. But that
certainly does not apply to /all/ processors. Most processors, I think,
assume that self-modifying code is a bygone technique that dropped out
of fashion many decades ago, and do not waste the rather significant
design effort and silicon space needed for such detection.

On the ARM and PPC, IIUIC, if you want to change code you write it to
memory (as data writes), then flush the relevant addresses in the data
cache to push the change into memory or combined cache levels, then
discard the relevant lines in instruction cache, then issue an
instruction pipeline flush instruction. /Then/ you can start executing
from the new code.


David Brown

unread,
May 17, 2017, 5:43:44 PM5/17/17
to
On 16/05/17 17:01, Rick C. Hodgin wrote:
> On Tuesday, May 16, 2017 at 10:44:59 AM UTC-4, Scott Lurndal wrote:
>> Besides being a performance killer, self-modifying code on non-intel
>> architectures requires flushing the instruction cache, which non-privileged
>> code may or may not be able to accomplish.
>>
>> Just don't do it. Ever. The days of 4KW address spaces are long
>> in the past.
>
> SMC is a requirement for true optimization. It is more of a startup
> modification once the capabilities of the machine are assessed, but
> you can minimize instruction cache pollution by dynamically altering
> your code at runtime so that algorithms which won't be required in
> this runtime instance are no longer present, etc.
>

What is "true optimisation" ? Do you think that being able to modify
code at run-time somehow brings orders of magnitude greater performance?

The whole point of instruction caches is that code that is often used
gets into the cache, while code that is not used, stays out. This
happens automatically in the cache - you don't need to /modify/ the code
to achieve the effect. In some operating systems (such as Linux), code
that is not executed on a particular platform might not even be loaded
off the disk when you run a program.

qak

unread,
May 18, 2017, 10:18:22 AM5/18/17
to
David Brown <david...@hesbynett.no> wrote in
news:ofifvt$7i3$1...@dont-email.me:

> On 16/05/17 17:01, Rick C. Hodgin wrote:
>> On Tuesday, May 16, 2017 at 10:44:59 AM UTC-4, Scott Lurndal wrote:
>>> Besides being a performance killer, self-modifying code on non-intel
>>> architectures requires flushing the instruction cache, which
>>> non-privileged code may or may not be able to accomplish.
>>>
>>> Just don't do it. Ever. The days of 4KW address spaces are long
>>> in the past.
>>
>> SMC is a requirement for true optimization. It is more of a startup
>> modification once the capabilities of the machine are assessed, but
>> you can minimize instruction cache pollution by dynamically altering
>> your code at runtime so that algorithms which won't be required in
>> this runtime instance are no longer present, etc.
>>
>
> What is "true optimisation" ? Do you think that being able to modify
> code at run-time somehow brings orders of magnitude greater
> performance?
>
> The whole point of instruction caches is that code that is often used
> gets into the cache, while code that is not used, stays out. This
> happens automatically in the cache - you don't need to /modify/ the
> code to achieve the effect. In some operating systems (such as
> Linux), code that is not executed on a particular platform might not
> even be loaded off the disk when you run a program.
>

Can individual instruction be 'hot', or only the whole block of code can
be cached ? I always wish:
if(AMD) do REP RET
else do RET
then after the first run all the first lines disappear from every PROC

Rick C. Hodgin

unread,
May 18, 2017, 10:30:24 AM5/18/17
to
On Thursday, May 18, 2017 at 10:18:22 AM UTC-4, qak wrote:
> > On 16/05/17 17:01, Rick C. Hodgin wrote:
> >> On Tuesday, May 16, 2017 at 10:44:59 AM UTC-4, Scott Lurndal wrote:
> >>> Besides being a performance killer, self-modifying code on non-intel
> >>> architectures requires flushing the instruction cache, which
> >>> non-privileged code may or may not be able to accomplish.
> >>>
> >>> Just don't do it. Ever. The days of 4KW address spaces are long
> >>> in the past.
> >>
> >> SMC is a requirement for true optimization. It is more of a startup
> >> modification once the capabilities of the machine are assessed, but
> >> you can minimize instruction cache pollution by dynamically altering
> >> your code at runtime so that algorithms which won't be required in
> >> this runtime instance are no longer present, etc.
>
> Can individual instruction be 'hot', or only the whole block of code can
> be cached ? I always wish:
> if(AMD) do REP RET
> else do RET
> then after the first run all the first lines disappear from every PROC

Compilers do not do this today because SMC is generally regarded as
one of the biggest performance killers due to pipeline refills, but
if it is done properly as a first-pass operation, or as another form
of dynamic linking applied by compiler as a run-time application of
the various options it found for maximal optimization, to be included
in the app's startup code, then it is merely the compiler itself
directing the maximum operation based on run-time observations,
rather than static compile-time observations.

The compiler could determine there are 50 different models which could
exist, and based on which machine they're running on, which OS version,
how much memory is installed, how much memory is available, etc., give
options and choices for maximum performance.

By encoding those in the startup code, and then dynamically linking it
all together at runtime based on a runtime test of the operating
environment, the compiler could finish its work of producing the most
optimized version possible for the runtime machine. It would not
merely enable flags which traverse the code differently, but literally
re-arrange the code so that the footprint of the dynamically linked
version in memory only includes those things which are needed for this
instance.

And by adding dynamic runtime analysis of what functions are called
most often, they could be rearranged to have a minimal impact on the
cache by moving the most frequently called functions into a common
area that would persist in the L1 cache longer because of its frequent
use.

A lot of options become possible when you look at a compiler as more
than just an intermediate translator between source code and object
code. When you recognize that the job isn't done until the code is
running in a real runtime environment, then you just need to produce
mechanical features which go along with the code at various stages,
able to act upon it and produce the best option for the environment
at hand.

Note also that I am considering these features mostly for larger
machines, including modern mobile devices and traditional PCs and
laptops / notebooks. I cannot see it being a usable feature in
the embedded environments, except for where the embedded CPUs are
now getting into more and more capable machines, with extra memory
where such factors would make a difference.

I can see this type of optimization being most desirable in server
farms, and on widely distributed applications in something like a
supercomputer center, where the goal is maximum app performance,
minimal machine use, and maximum throughput of jobs.

David Brown

unread,
May 18, 2017, 10:59:49 AM5/18/17
to
Caches work on a line at a time - typically something like 16 to 64
bytes long. Details will vary from processor to processor, and can be
different for different cache levels.

> I always wish:
> if(AMD) do REP RET
> else do RET
> then after the first run all the first lines disappear from every PROC
>

What difference do you think this would make? "rep ret" is executed
exactly like "ret" on all processors, while avoiding a performance bug
in early AMD x86-64 processors. Since the instruction is typically at
the end of a function, and typically followed by padding to the next 16
byte boundary, it has only a one in sixteen chance of wasting a byte of
code space. i.e., the cost is negligible, and far smaller than keeping
a list of addresses for function returns that need patched.

A compiler such as gcc may use "rep ret" if compiling for a target that
includes these old devices. If you use switches such as "-mtune" or
"-march" to give a newer minimum required processor (AMD or Intel), it
will generate "ret".

qak

unread,
May 19, 2017, 8:41:08 AM5/19/17
to
David Brown <david...@hesbynett.no> wrote in
news:ofkcmh$rff$1...@dont-email.me:

>> Can individual instruction be 'hot', or only the whole block of code
>> can be cached ?
>
> Caches work on a line at a time - typically something like 16 to 64
> bytes long. Details will vary from processor to processor, and can be
> different for different cache levels.
>
How the cache system works for decoding video, is the data cache is filled
with previous frames which are not needed again? If I reopen the movie with
NO_CACHE flag, then immediately close it, I would find the file cache near
empty, so everything was evicted to have room for the movie.
Would so much code to decode a frame that the next frame will find all the
code cache is filled with now useless code?
Thanks for sharing.

David Brown

unread,
May 19, 2017, 10:06:47 AM5/19/17
to
You are mixing up a number of concepts here.

First, you are talking about an operating system's file or disk cache
here. That is not directly related to a processor's caches (though they
are both caches).

Secondly, you are talking about data caching - the thread here has been
about instruction or code caching. When playing a movie, the data
changes from frame to frame (or block to block, when it is compressed).
Once a block has been played, you don't need it again, and it will get
dropped from caches (processor caches, and OS caches). But the code
used to interpret that data stays the same, and you want that to stay in
your instruction caches.


Scott Lurndal

unread,
May 19, 2017, 10:08:34 AM5/19/17
to
qak <q...@mts.net.NOSPAM> writes:
>David Brown <david...@hesbynett.no> wrote in
>news:ofkcmh$rff$1...@dont-email.me:
>
>>> Can individual instruction be 'hot', or only the whole block of code
>>> can be cached ?
>>
>> Caches work on a line at a time - typically something like 16 to 64
>> bytes long. Details will vary from processor to processor, and can be
>> different for different cache levels.
>>
>How the cache system works for decoding video, is the data cache is filled
>with previous frames which are not needed again?

That depends entirely on the algorithm that is decoding the
video and whether it is hardware assisted or not. If software
is doing the decoding, then unless it uses non-temporal operations,
it's likely that the encoded and decoded data will both occupy
cache lines. The cache replacement algorithms generally select
the least-recently used line for replacement.


> If I reopen the movie with
>NO_CACHE flag, then immediately close it, I would find the file cache near
>empty, so everything was evicted to have room for the movie.

The File cache (maintained by the kernel) is unrelated to the data cache
which David was discussing. The File cache is simply otherwise unused
memory regions (generally discontiguous pages), while the data cache is
a high-speed caching structure located between memory and the processor.

An intel processor, for example has three levels of
cache between the processor and memory. The cache closest to
the processor (L1) is also the fastest at about 4 cycles load-to-use
latency (1.3ns @3ghz); this cache is the smallest with 2Kbytes
for data and 32Kbytes for instructions. The next level of cache
(L2) is around 256KB and has a longer latency (perhaps 10ns), while
the third level of cache (which is shared by all processors) can
be up to 25MBytes, but with even longer latency (perhaps 30-40ns).
DRAM latency is in the 50-150ns range depending on many factors
including the speed and width of the memory bus and whether the
memory is local or remote to a given processor socket.
0 new messages