On Thursday, November 17, 2016 at 4:24:05 PM UTC-5,
supe...@casperkitty.com wrote:
> On Thursday, November 17, 2016 at 2:02:42 PM UTC-6, Rick C. Hodgin wrote:
> > On Thursday, November 17, 2016 at 2:35:46 PM UTC-5, supercat wrote:
> > > Optimizations without the keyword would require a multi-pass compiler.
> > > The cost of pass to determine register usage prior to code generation
> > > may not be huge, but if compilation speed is a priority, allowing variables
> > > to be fixed in registers using the "register" keyword would allow a
> > > fraction of the benefits of register optimization to be obtained at a lower
> > > compilation-time cost than would be needed for most sophisticated analysis.
> >
> > Are there still people targeting a one-pass compiler? I had people in
> > this C group tell me that I wouldn't be able to compile CAlive source
> > code in one pass, as if that was a reason to reconsider some of my
> > syntax allowances. My reply was, "So?"
>
> The thread was benchmarking, among other things, tcc. I may be mistaken,
> but I was under the impression that tcc is a single-pass compiler which
> places a substantial design emphasis on compilation speed.
It does. Its goals were to be compatible, fast, small, and not the
best in terms of performance. It met all of its goals. :-)
> > In this day and age with $25 quad-core CPUs running over 2 GHz, and
> > 16GB memory sticks that today cost $70 each, and 2TB hard drives which
> > cost $70 each ... what limitations on compute are we talking about?
>
> Primarily situations in which it would be necessary to be able to compile
> and generate code on the fly. For example, if one wanted to write a virtual
> machine with the ability to perform just-in-time compilation, a design that
> generates C code and compiles that could be more versatile than one that
> tries to generate machine code, but if the compiler isn't very fast the
> system might spend more time trying to optimize code than it would have
> spent running an non-optimized version.
To be honest, I would rather have a slower JIT that's able to produce
better code because the compilation will be run one, stuck into a cache
where it's then run many times. In addition, having more efficient
code generation would result in less energy use long-term.
> > > Not really. If a processor has enough registers that the code generator
> > > could get by even if some were unusable, all the generator needs to do is
> > > (1) allow a field in the symbol table to indicate whether a variable is in
> > > a register or memory, and (2) have a code generator support instruction
> > > forms that can use a register instead of memory. If the code generator
> > > would use registers as a rolling-window value stack, declaring variables
> > > as "register" might in some cases increase the number of register spills in
> > > code that isn't using those variables, but in other cases it might offer
> > > a big savings in variable-access costs without causing any extra register
> > > spills.
> >
> > My apologies. I thought we were discussing the 32-bit code generation
> > of TCC32 on this madel() function, and in this particular case the x86.
> > The 80386 up has a very limited register set, and certain operations
> > can only be conducted in certain registers. Within that constraint,
> > it's very difficult to generate code which honors developer-specified
> > register assignments.
>
> Turbo C was able to get by reserving SI and DI for register variables
> when targeting the 8088/8086. I see no reason version targeting the
> 80386 instruction set shouldn't be able to reserve at least ESI and EDI,
> and possibly EBX as well. Are there any cases where simple code generation
> would need more than EAX, ECX (for shift), EDX, and ESP/EBP (stack and
> frame)?
I think the compiler would be able to gain more benefits in multiple
instructions by local usages which occupy those values, rather than
sticking a value in a register and keeping it there which may force
some additional memory accesses because there aren't enough spare
registers to cycle data through in other cases.
Modern CPUs are also OoO, so you can have the re-load from L1 data
cache moved temporally a few instructions so that by the time it's
actually referenced, it's been loaded. You can issue the memory read
in one instruction, stuff some other instructions in-between, such
that by the time you actually use what was read it's been retrieved.
There are also some hints you can provide to the CPU about what to
prefetch into which cache, allowing oft-used memory to be brought
into L1 and explicitly kept there for a time.
> On the floating-point side, would there be any difficulty keeping
> track of which x87 registers hold clean or dirty variables and simply caching
> the register values on an as-convenient basis?
No. It's non-trivial, but it's not also not overtly complex. In the
same way, the ebp register can be freed by not using a stack frame.
The compiler could keep track of relative references as data is pushed
onto or off the stack. That's what happens when you use the compiler
option to remove stack frame pointers. ebp also has the added side-
effect of being beneficial because its default segment is SS:, which
means without any opcode override bytes it's accessing data directly
on the stack, though in many OSes all segment registers are mapped
to the same value and they use paging to isolate visible portions.
> > But, for AMD64 or some RISC machines with 32+ registers, it would be
> > much easier and could be handled in a first pass effort, but there is
> > then also the issue of spill and fill as you don't want to trump some
> > values used in another function that may have also been optimizing
> > something to use registers ... so now you're into a depth of call
> > analysis to see how you can structure registers to affect the fewest
> > things in the least way. A single-pass compiler could not do that,
> > and the benefits of assigning registers could be easily undone by
> > the spill/fill overhead requirements.
>
> Multi-pass compilers are often useful, but there may be other use cases
> involving dynamic code generation where compiler performance is critical.
I can't see a real benefit for compiler performance being critical. And
given the nature of multiple cores existing, I would go so far as to say
that if initial launch speed was so crucial that it required the fastest
possible performance, then go ahead and introduce a 1-pass compiler pass,
as part of a larger mechanism that does more advanced analysis as an n-
pass compiler, to then generate the "rough" code which begins running
initially, to be followed-up with the "smooth" optimized code once the
additional passes are completed.
But I would consider that to be such a niche application that I can
honestly not see a use for it.
> > My personal view is the issue of optimization is an essential component
> > of any program that will enter into production, and isn't required to
> > be changed often, or even need to be hot-patched. It only makes sense
> > as it is a maximal use of available resources with the least expenditure
> > of energy. But ... that being said, I don't see computers ever getting
> > slower. I see electrical engineers working on newer technologies which
> > continue to increase the width of CPU engines, and the speed at which
> > they operate, reducing energy footprints and expanding capabilities.
>
> Mobile changes things quite a bit. I don't think battery technology will
> ever improve to the point that devices that do a lot will be able to run
> so long that consumers wouldn't prefer that they run even longer.
Consider the power envelope's we're dealing with. The original 4004 CPU
was built using a 10,000 nm process technology. It operated at a clock
speed of around 160 KHz when first introduced (IIRC). It required 15V,
had 2300 transistors, and consumed 0.45 watts, or 1.95 milliwatts per
transistor (195,000 nanowatts). It was introduced in 1971.
Flash forward to 2016, 45 years later, and we now have CPUs like Intel's
Core-M CPU which have 1.3 billion transistors, and consume 4.5 watts,
or 0.000003 milliwats per transistor (3 nanowatts) on a 14 nm process
technology. They now have 10 nm, and are pushing toward 8 and 6.
That's a tremendous increase in 45 years. Where are we headed? Physics
is going to give us a hard wall in a few years in silicon feature sizes.
We won't be able to get down much below 6nm because there isn't enough
signal to indicate the switch. So, electrical engineers are going to
have to figure something else out. Whether it's carbon nanotubes, or
photonic, or some other thing ... I don't know, but it's coming. Our
ingenuity and resource dollars will keep going.
> > I see a future where even simple things like greeting cards which play
> > music when you open them have a quad-core 64-bit CPU or greater. Why?
> > Because it will be so inexpensive to manufacture that model in bulk
> > that it will be used for a wide range of things.
>
> What kind of battery is that greeting card going to have? The number of
> computations that can be performed per joule of energy has increased over
> the years, but slower processors can often perform more computations per
> joule than faster ones.
Considering the power trend, decreasing 65,000x in 45 years, a rate of
about 1444x per year, where will be in 10 years? We'll be to the point
where that tiny 64-bit CPU built on atoms is able to have the same 1.5+
billion transistors for 1/10,000th the power budget, resulting in a
birthday card that consumes more power to amplify the digital signal
into the range of audible hearing than it does to power the entire
digital engine. And, that digital engine can be written in C# running
atop .NET 9.5, in interpreted mode, because it also contains a gigabyte
of on-die memory. :-)
Who knows ... my point is if you would've told someone from the 1950s
that they'd someday have greeting card they could open up and it plays
a song for them, they'd think you were crazy. So, when I say it will
someday soon be played on a quad-core 64-bit CPU where it takes more
power to amplify the digital signal to audible sound than it does to
run the computing device generating the sound ... it's not so far-
fetched as you might think.