On Thursday, May 18, 2017 at 10:18:22 AM UTC-4, qak wrote:
> > On 16/05/17 17:01, Rick C. Hodgin wrote:
> >> On Tuesday, May 16, 2017 at 10:44:59 AM UTC-4, Scott Lurndal wrote:
> >>> Besides being a performance killer, self-modifying code on non-intel
> >>> architectures requires flushing the instruction cache, which
> >>> non-privileged code may or may not be able to accomplish.
> >>>
> >>> Just don't do it. Ever. The days of 4KW address spaces are long
> >>> in the past.
> >>
> >> SMC is a requirement for true optimization. It is more of a startup
> >> modification once the capabilities of the machine are assessed, but
> >> you can minimize instruction cache pollution by dynamically altering
> >> your code at runtime so that algorithms which won't be required in
> >> this runtime instance are no longer present, etc.
>
> Can individual instruction be 'hot', or only the whole block of code can
> be cached ? I always wish:
> if(AMD) do REP RET
> else do RET
> then after the first run all the first lines disappear from every PROC
Compilers do not do this today because SMC is generally regarded as
one of the biggest performance killers due to pipeline refills, but
if it is done properly as a first-pass operation, or as another form
of dynamic linking applied by compiler as a run-time application of
the various options it found for maximal optimization, to be included
in the app's startup code, then it is merely the compiler itself
directing the maximum operation based on run-time observations,
rather than static compile-time observations.
The compiler could determine there are 50 different models which could
exist, and based on which machine they're running on, which OS version,
how much memory is installed, how much memory is available, etc., give
options and choices for maximum performance.
By encoding those in the startup code, and then dynamically linking it
all together at runtime based on a runtime test of the operating
environment, the compiler could finish its work of producing the most
optimized version possible for the runtime machine. It would not
merely enable flags which traverse the code differently, but literally
re-arrange the code so that the footprint of the dynamically linked
version in memory only includes those things which are needed for this
instance.
And by adding dynamic runtime analysis of what functions are called
most often, they could be rearranged to have a minimal impact on the
cache by moving the most frequently called functions into a common
area that would persist in the L1 cache longer because of its frequent
use.
A lot of options become possible when you look at a compiler as more
than just an intermediate translator between source code and object
code. When you recognize that the job isn't done until the code is
running in a real runtime environment, then you just need to produce
mechanical features which go along with the code at various stages,
able to act upon it and produce the best option for the environment
at hand.
Note also that I am considering these features mostly for larger
machines, including modern mobile devices and traditional PCs and
laptops / notebooks. I cannot see it being a usable feature in
the embedded environments, except for where the embedded CPUs are
now getting into more and more capable machines, with extra memory
where such factors would make a difference.
I can see this type of optimization being most desirable in server
farms, and on widely distributed applications in something like a
supercomputer center, where the goal is maximum app performance,
minimal machine use, and maximum throughput of jobs.