Practicality of microarchitecture-optimized register allocation?

Paul A. Clayton

unread,

May 9, 2013, 11:58:41 AM5/9/13

to

How practical would it be for a compiler to coordinate register
allocation with microarchitecture? (Some coordination would be
based on ABIs.)

E.g., "Implicit Hints: Embedding Hint Bits in Programs without
ISA Changes" (Hans Vandierendonck and Koen De Bosschere, 2010)
proposed using register names for hints and "Knapsack: A
Zero-Cycle Memory Hierachy Component" (Todd Austin et al., 1993)
proposed using a register for a global or TLS pointer to allow direct
indexing using immediate offsets of a pre-loaded cache (adding
valid bits and/or version tags would be a simple extension of
this concept and similar optimizations might be applied to
accesses relative to a stack pointer).

Distinguishing between highly temporary values, moderate-lifespan
values, and pseudo-constants might be useful for optimizing the
register file (as well as forwarding, operand capture, and value
caching) and register renaming. Similarly, distinguishing use of
values (addressing, counter, etc.) might facilitate some
microarchitectural optimizations.

Such specialization of registers might also be used architecturally
to improve code density by encoding operand information in the
opcode (implicit operands) or by encoding operation information in
the operand.

(Instruction addresses could also be used to communicate metadata,
but the flexibility of code placement at fine granularity seems to
be more limited than for register allocation. For coarse-grained
metadata, placement is less of a problem. Likewise, coarse-grained
metadata would be more amenable to use of PTE bits, lower-
overhead predictors, or software mode setting.)

It seems to me that there would be a few issues with supporting
such. First, typical call interfaces seem generally not to be
sophisticated enough to support specialized uses (in many cases
the information may not be available--e.g., there might not be a
mechanism to communicate that the 'n' argument in strncpy is a
counter).

Second, the cost of developing the compiler infrastructure to
support such optimizations would need to be justified by the
benefit--which might be limited to specific hardware vendors or
even specific hardware implementations under specific workloads.
When system software is developed independently, the incentives for
cross optimization are reduced.

(A similar problem involves the difficulty of having various
hardware vendors cooperate in defining and using hint-level
information. Implementations which ignore such information might
discourage the use of such optimizations in software and
implementations that use different hint-level semantics would
seem to be even more fragmenting. In addition, what information
is important and able to be communicated might change with time
as well as with targeted uses. While using hint-level semantics
ensures result compatibility, substantial performance incompatibilty
can still restrict implementation choices. Some optimizations
might be for energy-efficiency only [so performance benefits would
be limited to the additional thermal/power headroom], but the
opportunities for such optimizations would seem to be very
limited.)

Third, optimizations that communicate additional semantic
information to the hardware would sometimes conflict with
optimizations to avoid load-store traffic or register-register
moves (or other optimizations). Properly weighing the tradeoffs
could be very complex, increasing compiler complexity and the time
required to generate optimized code.

If the compiler issues are not too great, such a mechanism for
communicating information to the hardware (with hint-level
semantics) would seem to have significant potential for software-
hardware codesign.

timca...@aol.com

unread,

May 9, 2013, 4:32:22 PM5/9/13

to

Andy Glew mentioned explicit operand chaining in another thread. That would be one way of giving a hint to the microarch, especially in relation to forwarding results (it also potentially allows fewer "temporary" registers).

It seems like the other big hit is letting the microarch know when a register's value will no longer be used. You can do this, somewhat, by XOR reg,reg on the x86, but it seems to me it might be more useful to have an instruction that explicitly says a register's value is dead. That can be potentially used to cause exceptions if it is used anyway and to reduce the state save/restore overhead for interrupts (and perhaps call/return).

Both of these should be fairly straightforward to implement with current compiler technology.

- Tim

Paul A. Clayton

unread,

May 9, 2013, 5:29:36 PM5/9/13

to

On May 9, 4:32 pm, timcaff...@aol.com wrote:
> Andy Glew mentioned explicit operand chaining in another
> thread. That would be one way of giving a hint to the
> microarch, especially in relation to forwarding results (it
> also potentially allows fewer "temporary" registers).

Some time ago I suggested using a zero register for one
time use results. After each read, the "zero" register
would be reset to zero, so it could be used for normal
zero register uses without extra overhead. (This would
not be binary compatible with most RISCs since nops and
certain hints [e.g., prefetches] are allowed to use the
zero register as a destination, but binary translation
would be fairly trivial [as would per-page modes to
support legacy operation, but PTE bits are a bit
scarce to be used for such a trivial use IMO].)

> It seems like the other big hit is letting the microarch
> know when a register's value will no longer be used. You
> can do this, somewhat, by XOR reg,reg on the x86, but it
> seems to me it might be more useful to have an instruction
> that explicitly says a register's value is dead. That can
> be potentially used to cause exceptions if it is used anyway
> and to reduce the state save/restore overhead for interrupts
> (and perhaps call/return).

Zero compression and similar special handling of zero values
might not be that difficult, so using register clear to avoid
save/restore overhead might be practical. In theory, one
could use SUB rX, rX as a signaling register clearing operation,
but preserving the signaling property through memory might be
problematic (at the same granularity as ECC coverage, a chunk
could be given a maximally uncorrectable error, perhaps with
the data value equal to the virtual address to indicate a
software issue rather than an actual detected uncorrectable
error).

> Both of these should be fairly straightforward to implement
> with current compiler technology.

Both of those are potential uses of specialized register
names, though the later might have a stronger architectural
component. (Explicit operand chaining might be provided as
a hint with the possibility of prediction being used to
choose to ignore the hint and choices about how expensive
incorrect hints would be [e.g., localized replay or
equivalent of an exception/branch misprediction].) Such are
also unlikely to greatly hinder register allocation
optimization.

Optimizations like grouping addresses by potential aliasing
(e.g., defining four collections of register names that
only allow internal aliasing when used as base addresses
for accessing memory) could facilitate some energy-saving
hardware optimizations (possibly even as hints), but such
would be more difficult for a compiler to use and would
interfere more with register allocation.

Using the same architectural register for all jump table
base addresses would potentially allow some optimization of
indirect branch predictors (both in making recognizing
load rX <- [rB + offset] as loading from "offset" entry
in a jump table and in potentially getting rB value
information to the front-end early enough to influence
branch prediction). (I think Nick Maclaren mentioned this
possibility some time ago.)

Presumably other pieces of information could also be
useful and be communicated by "redundant" encodings (as
register names are somewhat "redundant"). However, first
the compiler needs to be able to get such information
(and know how useful it is to the hardware) and then it
needs to be able to communicate the information.

Mark Thorson

unread,

May 9, 2013, 9:34:28 PM5/9/13

to

timca...@aol.com wrote:
>
> It seems like the other big hit is letting the microarch know
> when a register's value will no longer be used. You can do this,
> somewhat, by XOR reg,reg on the x86, but it seems to me it might
> be more useful to have an instruction that explicitly says a
> register's value is dead. That can be potentially used to cause
> exceptions if it is used anyway and to reduce the state
> save/restore overhead for interrupts (and perhaps call/return).

Yes, an invalid bit for each register does this.
It can be hidden from the architectural model,
so it's upward compatible from the x86 model.

MitchAlsup

unread,

May 10, 2013, 2:59:17 AM5/10/13

to

On Thursday, May 9, 2013 3:32:22 PM UTC-5, timca...@aol.com wrote:
> Andy Glew mentioned explicit operand chaining in another thread. That
> would be one way of giving a hint to the microarch, especially in relation
> to forwarding results (it also potentially allows fewer "temporary" registers).

Several GPUs Have Source Register specifiers being FW instead of Reg. That
is the instruction source register specifier has an encoding that states that
the operand will arrive via the forwarding network and does not need to be read
from the register file.

> It seems like the other big hit is letting the microarch know when a
> register's value will no longer be used.

Several GPUs have destination specifiers that explicity state the result is
consumed on the forwarding network and does not need to be written into the
register file.

All you really need is the ability to express such at the instruction set level.

Mitch

Ivan Godard

unread,

May 10, 2013, 12:31:03 PM5/10/13

to

An alternative is to specify what is needed later, rather than
specifying what can be discarded. The Mill took such a "keep this"
approach - transient values (i.e. what would be off the forwarding
network on a legacy machine, plus those that go to registers but whose
last use is soon) cease being addressable after they fall off the belt.
If something has a longer lifetime than it lives on the belt then an
explicit "writer" op places it in scratchpad, where it will live until
overwritten (or function exit) and can be retrieved (by a "reader" op)
as often as desired. Retrieved scratchpad values are belt transients
again, and can be addressed several times until they again fall off, but
don't have to be replaced in scratchpad because the values are already
there.

We did it that way because transients are more common than
longer-lifetime values, so it's cheaper in the encoding to save/restore
than to kill, so long as the save/restore is compact. It also saves a
lot on the operand distribution trees, because your high-speed network
has fewer nodes once all the not-right-now values are off somewhere with
only a few ports.

A typical Mill configuration has a 16-position belt (high-speed,
replaces bypass and transient register use) and 512 bytes of scratchpad
(low speed, three ports). Spill to memory due to running out of
scratchpad is surprisingly rare. The power/area is a quarter that
required by a machine with the several hundred physical registers that
is typical these days.

Stephen Fuld

unread,

May 10, 2013, 12:54:26 PM5/10/13

to

On 5/9/2013 11:59 PM, MitchAlsup wrote:

What do you do if you take an interrupt between the producing
instruction and the consuming instruction? Don't you need some place to
save the intermediate value across the interrupt?

I see this mechanism saving power (no need to write and read the
intermediate register most of the time, but not reducing the size of the
state needed to save on an interrupt. I guess you could prevent
interrupts between the two instructions, but that could lead to problems
too.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Paul A. Clayton

unread,

May 10, 2013, 1:52:04 PM5/10/13

to

On May 10, 12:54 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid>
wrote:

> On 5/9/2013 11:59 PM, MitchAlsup wrote:

[snip]

>> Several GPUs Have Source Register specifiers being FW instead of Reg. That
>> is the instruction source register specifier has an encoding that states that
>> the operand will arrive via the forwarding network and does not need to be read
>> from the register file.

[snip]

>> Several GPUs have destination specifiers that explicity state the result is
>> consumed on the forwarding network and does not need to be written into the
>> register file.
>
>> All you really need is the ability to express such at the instruction set level.
>
> What do you do if you take an interrupt between the producing
> instruction and the consuming instruction? Don't you need some place to
> save the intermediate value across the interrupt?

A GPU taking an interrupt??

Theoretically, one could preserve such "pipeline registers"
across an interrupt. (ISTR that the Motorola 88100 had
some funky [for modern times] pipeline register saving
mechanism for floating point operations when there was an
interrupt.)

> I see this mechanism saving power (no need to write and read the
> intermediate register most of the time, but not reducing the size of the
> state needed to save on an interrupt. I guess you could prevent
> interrupts between the two instructions, but that could lead to problems
> too.

If one only allowed interrupts at specific points (rather
than being operation precise), it might be possible to
reduce the number of temporary values that needed to be
preserved.

(There *might* also be some way to minimize temporaries
in an unrolled and software pipelined loop for an interrupt
by requiring a resumption to perform a ramp-up as in the
start of the loop. [Determining whether that could be done
would take more concentrated thought that I am willing to
expend now; I do not fully grok the complexities of
software pipelined loops.])

Ivan Godard

unread,

May 10, 2013, 3:25:20 PM5/10/13

to

It depends on your replay strategy. Most OOO uses an instruction replay.
This is, they simply discard in-flights, and restart after the interrupt
at the last point at which everything had retired. Any operations that
had issued but not retired are discarded.

The problem with this approach is that operations vary in latency, so
things retire in an order that differs from the issue order. To provide
issue-order semantics (which is necessary in this model) the early
retires must be deferred so that retire order is the same as issue
order, at least as far as visible state is concerned. This is the
genesis of the reorder buffer.

The advantage of instruction replay is that, if you have to reorder
retires anyway then there's no real need to issue in issue order either;
instead you issue in dataflow order, not issue order, and the
combination is OOO as we know it.

The alternative is to use result replay. In result replay, the interrupt
saves all already-retired state for later restore, just like in an OOO.
However, operands that are in-flight are not discarded when the
interrupt is taken, but are captured into runout buffers and not retired
in their expected cycle. Later, after the interrupt returns, the saved
retire state is restored and the program restarted after the last issued
instruction (in contrast to instruction replay, which restarts after the
last retired instruction). As execution proceeds, the runout buffers
retire their contents (the previously in-flight operands) in the cycle
in which they would have retired had the interrupt not happened.

The drawback to this approach is that you have to save all effective
state changes of the in-flights to the runout buffers, including any
faults or traps, so that they can be replayed later, and this can be
complicated if the rest of the machine is not designed to support
deferrable faults. You also have to be able to save and replay a fair
amount of state, possibly nested, without slowing down the rest of the
machine. And result replay requires that the retire timing is well
defined, which precludes any advantage of early-out ops (although
early-out is no longer used much except for LOAD).

The advantage of result replay is that no actual progress is thrown
away, and you save big on instruction buffers, schedule queues,
scoreboards or the like, and reorder buffers. It also lets you have
transparently simple call/return semantics because if you can run over
an interrupt then you can also run over an explicit call. Thus you can
start an FMA now, make a call, and have the FMA result retire after the
call returns.

The Mill uses result replay.

Paul A. Clayton

unread,

May 10, 2013, 5:04:33 PM5/10/13

to

On May 10, 3:25 pm, Ivan Godard <i...@ootbcomp.com> wrote:
[snip]

> defined, which precludes any advantage of early-out ops (although
> early-out is no longer used much except for LOAD).

[begin niggle]
If by "no longer used much" you mean 'not many implementations',
I would disagree--a fair number of recent implementations have
early out for integer division. (If you meant 'not many
operations' or 'not many dynamic occurrences', then I would
tend to agree. [There _might_ be embedded processors that
have early out for multiplication; ISTR that PowerPC750 had
early out for multiplication if a specific operand was only
16 bits in magnitude, but I would not trust my memory.])
[end niggle]

Paul A. Clayton

unread,

May 10, 2013, 5:11:14 PM5/10/13

to

On May 10, 5:04 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
[snip]

> tend to agree. [There _might_ be embedded processors that
> have early out for multiplication; ISTR that PowerPC750 had
> early out for multiplication if a specific operand was only
> 16 bits in magnitude, but I would not trust my memory.])

I just found a PowerPC 750 data sheet (admittedly almost
eight years old!) that indicates that it did have "Early out
multiply".

MitchAlsup

unread,

May 11, 2013, 11:26:41 AM5/11/13

to

On Friday, May 10, 2013 11:54:26 AM UTC-5, Stephen Fuld wrote:
> What do you do if you take an interrupt between the producing
> instruction and the consuming instruction? Don't you need some place to
> save the intermediate value across the interrupt?

A) there is no such thing as an interrupts wrt GPUs at this level
B) the timing is such that the compiler has already guarenteed that no
storage other than the normal pipeline flip-flops have to exist.

C) lets say that there were an interrupt, to which of the 1024 cores would
it be directed? And if there were something catching interrupts wouldn't
you want it to be Turring Complete--which is why there is an ARM core
in there to catch it for you.

Mitch

Stephen Fuld

unread,

May 11, 2013, 12:53:24 PM5/11/13

to

On 5/11/2013 8:26 AM, MitchAlsup wrote:
> On Friday, May 10, 2013 11:54:26 AM UTC-5, Stephen Fuld wrote:
>> What do you do if you take an interrupt between the producing
>> instruction and the consuming instruction? Don't you need some place to
>> save the intermediate value across the interrupt?
>
> A) there is no such thing as an interrupts wrt GPUs at this level

While I understand that GPUs don't have interrupts at this level, I
thought the thrust of this thread was about optimizations applicable to
general purpose CPUs. My question was directed to how could you
implement this optimization on a CPU.

> B) the timing is such that the compiler has already guarenteed that no
> storage other than the normal pipeline flip-flops have to exist.
>
> C) lets say that there were an interrupt, to which of the 1024 cores would
> it be directed? And if there were something catching interrupts wouldn't
> you want it to be Turring Complete--which is why there is an ARM core
> in there to catch it for you.

You could have an architecture where one core took all external
interrupts and that core didn't use this optimization.

I think that you could also prevent interrupts while one of these
forwarding operations was in process (i.e.interrupts could occur not
after every instruction), but you would have to limit the length of
"chains" of such instructions to prevent interrupts lock outs forever.
Forcing external interrupts to wait for up to say 4-5 instructions isn't
too bad and would gain much of the benefit of the forwarding.

Brett Davis

unread,

May 11, 2013, 10:50:30 PM5/11/13

to

In article
<8cfc0711-c04d-4691...@e13g2000yqp.googlegroups.com>,

The Mips processor in the Playstation One does early out multiply.
http://en.wikipedia.org/wiki/PlayStation_(console)
http://en.wikipedia.org/wiki/R3000

R3000 dates to 1988, only the second arg is checked for high zeros
if I remember correct, think it was each high byte, could have been nibble.

A full multiplier takes too many gates for embedded, you loop on a
smaller multiplier, and it is easy to do early out checks.

MitchAlsup

unread,

May 12, 2013, 5:35:35 PM5/12/13

to

On Saturday, May 11, 2013 11:53:24 AM UTC-5, Stephen Fuld wrote:
> While I understand that GPUs don't have interrupts at this level, I
> thought the thrust of this thread was about optimizations applicable to
> general purpose CPUs. My question was directed to how could you
> implement this optimization on a CPU.

When the instruction sequence is performing these automatic data-flow
connections, just defer the interrupt! Then tell the compiler (and assembler)
to only allow a certain number of instructions to be done "in a row". You
can also not allow this trickery to be performed across cache boundaries.

All in all, perhaps the "not allows to cross a cache line boundary" would
be the best.

Mitch