hey people i would appreciate if someone can give me the exact
advantages and disadvantages of general-purpose register and condition
code registers
One obvious advantage of having general purpose registers instead of a
stack based approach (as in the Java virtual machine) is speed (very
frequent accessed to memory in the stack-based approach). Also, it
would make the code for a given program more complex, as many push/pops
would be required.
I also assume that no GPRs and only the stack makes the instruction set
simpler and more compact, since just the operand would need to be
specified since the source and destination would be implicit (the
stack).
Having condition codes in an ISA makes life simpler, since they provide
an extra bit of functionality for free. for eg, suppose you do an
arithmetic instruction (like add). If you did not have condition codes,
you would have to do a compare of the result of this instruction with 0
to determine if the result was >0 or <0. But if you have a condition
code bit 'N' which is set when the result is negative, you would be
spared from doing this.
Disadvantages of condition codes are that they cause dependencies in an
instruction stream, i.e., they contribute to serialization of the
instruction stream, since a conditional branch instruction following
the arithmetic instruction would need to wait for the condition codes
to be set before it could resolve the branch. This necessiates the
introduction of stalls/delay slots in pipelines, or the adoption of
approaches like out-of-order execution.
Another disadvantage is that having condition codes increases the
chances of a programming error, if youre not careful about the values
of condition codes while writing a program, you might observe
unexpected behaviour of your program.
Hope that helped. Others, please correct me if im wrong somewhere.
-Mayank
conditional flag settings are not obvious from source code.
besides, conditional flag settings are easily modified by any
arithmetic or logic instruction which might need to be inserted during
maintenance phase.
if( a >= b )
and there are condition evaluations that result in data:
c = a >= b;
and there is a surprising amount of code written:
if( a >= b ) return TRUE; else return FALSE;
rather than:
return a >= b;
And given the rather long latencies of hard to predict branches; being
able to do:
if( a >= b & c <= d )
in a single branch can be benefitial.
Given the choice on a new instruction set, I, personally, would "try
real hard" to put the condition evaluation into a GPR and have branch
instructions that consume a GPR and make reasonable control transfers.
The M88000 ISA has such a model. As one who has written compilers for a
number of years before building hardware for a number of years, is
close to the best balance between conditions as ancilarry codes and
conditions as pure data values.
And then there is Carry, the access point to multiprecision arithmetic.
This is the driving force towards condition codes, because there is no
direct means to 'carry' (pun intended) the overflow information in one
'add' instruction to another 'add' without this container being
'carried' along with the result. Some computers have resorted to
'carrying' a condition code along with each GPR at the ISA level, many
microarchitectures do this even when it cannot be expressed at the ISA
level.
Carry isn't that hard to synthesize. On machine without condition code
registers, it's usually just a single instruction (unsigned compare of
the result being less than either of the inputs), will do the trick.
It's less convenient than carry and an add-with-carry instruction.
There are two related problems:
* Condition codes increase the number of unintended dependencies.
A section of code A may do some arithmetic operation for example. It
leaves the condition codes in some state. Another section of code B
comes after this. It does not use the condition codes, but it is
difficult for the processor to derive this. So even if the processor
can issue many instructions at once it cannot begin on section B until
A has finished and the condition codes are known. A sophisticated
out-of-order processor can eliminate this problem, but ...
* It's harder to design an OOO processor with condition codes.
The condition codes are effectively an extra output for every
instruction that sets them. They are another resource that must be
renamed like registers. This means they must be forwarded to many
places.
In the K7 for example, there are 3 reservation stations which can
execute any integer macro-op. Each of these has a buffer of 15
instructions (AFAICR). Each of these buffers holds any calculated
source inputs, tags to pending inputs, and also a copy of the flag
register or a tag saying where to get it from.
Both of the above are simpler if you have sevaral conditon code
registers as PPC does.
> Another disadvantage is that having condition codes increases the
> chances of a programming error, if youre not careful about the values
> of condition codes while writing a program, you might observe
> unexpected behaviour of your program.
Hopefully you're not writing in assembly language for much (if any) of
you're code though
.
> Both of the above are simpler if you have sevaral conditon code
> registers as PPC does.
Having separate versions of instructions that set and don't set
condition codes (like PPC) is a bigger part of the solution, I'd say.
Multiple condition code registers is for nonbranchy boolean evaluation.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Yes, but...if you're already carrying around 64 bits for the register
contents alone, are three or four additional bits going to break your
implementation?
Jan
> * Condition codes increase the number of unintended dependencies.
>
> A section of code A may do some arithmetic operation for example. It
> leaves the condition codes in some state. Another section of code B
> comes after this. It does not use the condition codes, but it is
> difficult for the processor to derive this.
Which reminds me: I have on a couple of occasions thought that it
might be worthwhile for a microprocessor to specify when registers
(including condition codes) become dead. This will require one extra
instruction bit per register use, so it is by no means free, but it
would have a number of benefits, for example allowing the register
renamer to reuse the physical register even if it can't see a write to
the logical register in the lookahead buffer.
Does anyone know if this has been proposed before?
Torben
What would be the value read from a dead register? Or should it
raise an exception? If the value was zero then just setting the
reg to zero would allow reuse of the physical register.
>
> Does anyone know if this has been proposed before?
>
> Torben
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
> Which reminds me: I have on a couple of occasions thought that it
> might be worthwhile for a microprocessor to specify when registers
> (including condition codes) become dead. This will require one extra
> instruction bit per register use, so it is by no means free, but it
> would have a number of benefits, for example allowing the register
> renamer to reuse the physical register even if it can't see a write to
> the logical register in the lookahead buffer.
>
> Does anyone know if this has been proposed before?
There was a poster presentation on this topic at a UWisc computer
affiliates meeting circa 1998. Dunno if it became a conference paper.
Good work.
It wasn't clear if it was better to add a bit to the register ldst, or
to have an explicit "kill register" instruction. And, of course,
reg<-move(0) suffices to kill the register, *and* has the advantage of
having an architecturally precise meaning.
I suspect that taking advantage of dead registers is a good idea,
but that adding a bit to the instruction ldst is not.
IA-64 GPRs are 65-bits wide.
Could the additional bit be used as you propose?
(Probably not.)
<quote>
[The general registers] are numbered GR0 through GR127, and are
available to all programs at all privilege levels. Each general
register has 64 bits of normal data storage plus an additional bit,
the NaT bit (Not a Thing), which is used to track deferred
speculative exceptions.
</quote>
> Torben Ægidius Mogensen wrote:
> > robert...@antenova.com writes:
> >
> >>* Condition codes increase the number of unintended dependencies.
> >>
> >>A section of code A may do some arithmetic operation for example. It
> >>leaves the condition codes in some state. Another section of code B
> >>comes after this. It does not use the condition codes, but it is
> >>difficult for the processor to derive this.
> > Which reminds me: I have on a couple of occasions thought that it
> > might be worthwhile for a microprocessor to specify when registers
> > (including condition codes) become dead. This will require one extra
> > instruction bit per register use, so it is by no means free, but it
> > would have a number of benefits, for example allowing the register
> > renamer to reuse the physical register even if it can't see a write to
> > the logical register in the lookahead buffer.
>
> What would be the value read from a dead register? Or should it
> raise an exception?
The ISA could let it be undefined what happens, but it would be safer
if it was trapped. You could map all dead registers to a nonexisting
physical register, which would make it easy to detect reads from a
dead register.
> If the value was zero then just setting the reg to zero would allow
> reuse of the physical register.
That could work, but it would require extra instructions to kill
registers. My idea was to add a bit to every register use to specify
it the register is dead after this read. This would cost two bits per
instruction (or one bit if you use two-address code, where one of the
argument registers is explicitly written to anyway).
Torben
> Torben Mogensen wrote:
> > I have on a couple of occasions thought that it might be
> > worthwhile for a microprocessor to specify when registers
> > (including condition codes) become dead. This will require one
> > extra instruction bit per register use, so it is by no means
> > free, but it would have a number of benefits, for example
> > allowing the register renamer to reuse the physical register even
> > if it can't see a write to the logical register in the lookahead
> > buffer.
> > Does anyone know if this has been proposed before?
>
> IA-64 GPRs are 65-bits wide.
> Could the additional bit be used as you propose?
My idea was to use an extra bit in the instruction word when you read
from a register (so if you have 32 registers, you would use 6 bits to
specify the register -- 5 for the register number and one to mark if
it is dead after the read). The registers themselves would need no
extra bits.
An alternative to using an extra instruction bit per register read is
to make a subset of the registers automatically become dead after the
first read. You could even let _all_ registers become dead after the
first read, so you would explicitly have to copy them if you need the
value twice (like in data-flow architectures). The copy instruction
would not physically copy the value, but just map two new logical
registers to the same physical register (and unmap the logical
register that held the value, unless it is used as one of the two new
registers).
Torben
> tor...@app-2.diku.dk (Torben Ægidius Mogensen) writes:
>
> > Which reminds me: I have on a couple of occasions thought that it
> > might be worthwhile for a microprocessor to specify when registers
> > (including condition codes) become dead. This will require one extra
> > instruction bit per register use, so it is by no means free, but it
> > would have a number of benefits, for example allowing the register
> > renamer to reuse the physical register even if it can't see a write to
> > the logical register in the lookahead buffer.
> >
> > Does anyone know if this has been proposed before?
>
> There was a poster presentation on this topic at a UWisc computer
> affiliates meeting circa 1998. Dunno if it became a conference paper.
> Good work.
>
> It wasn't clear if it was better to add a bit to the register ldst, or
> to have an explicit "kill register" instruction. And, of course,
> reg<-move(0) suffices to kill the register, *and* has the advantage of
> having an architecturally precise meaning.
True, but since a large fraction of registers become dead after their
first use, it would add a lot of code to explicitly zero these. I
suppose you could do it only for the cases where the register is not
reused (written to) in the next few instructions, which would take
care of many cases.
> I suspect that taking advantage of dead registers is a good idea,
> but that adding a bit to the instruction ldst is not.
You might be right. What about my suggestion (in another reply) of
letting a subset of the registers automatically be killed when read?
These should probably be a subset of the caller-saves registers, as
they are unlikely to be live across a function call (if a variable is
live across a function call, you should map it to a callee-saves
register anyway).
Torben
Well, how often do you get single reads?
In an ISA with 2 operand instructions, at best you could get 50% for
data values...for 3 operand, the best is 100%.
For control values (branch or loop counters for example), you probably
get something very different.
I'm sure someone lurking here has good compiler statistics for the
usage of values in registers...I'd be willing to wager it looks like a
pareto distribution.
Perhaps a more important question is, could this be done without any
ISA modifications, but with changes to the microarchitecture instead?
> An alternative to using an extra instruction bit per register read is
> to make a subset of the registers automatically become dead after the
> first read. You could even let _all_ registers become dead after the
> first read, so you would explicitly have to copy them if you need the
> value twice (like in data-flow architectures). The copy instruction
> would not physically copy the value, but just map two new logical
> registers to the same physical register (and unmap the logical
> register that held the value, unless it is used as one of the two new
> registers).
I think your first idea sounds more palatable.
DK
>Andy Glew <first...@employer.domain> writes:
>> It wasn't clear if it was better to add a bit to the register ldst, or
>> to have an explicit "kill register" instruction. And, of course,
>> reg<-move(0) suffices to kill the register, *and* has the advantage of
>> having an architecturally precise meaning.
>True, but since a large fraction of registers become dead after their
>first use, it would add a lot of code to explicitly zero these.
From the i-don't-know-what-i'm-talking-about dept.:
How large is that fraction?
Let each register have a reference count. Normal setting
of the register, by specifying it as a dst register, sets the
count to 1. Use of a register decrements the count, making the
register dead when the count reaches 0. For registers which
become dead after their first use, no code is added. But for
all other cases...:
Do compilers usually/often know, whereever a register value is generated,
how often it will be reused? Exactly?
A new instruction would be needed to adjust reference counts of life
registers that do not become dead after their first use. The instruction
would probably also permit making the register 'permanent', i.e. disable
reference counts for it (until it receives a new assignment).
How much would it help to know earlier when registers become dead?
Does this make any sense at all? :)
best regards
Patrick
> tor...@app-5.diku.dk (=?iso-8859-1?q?Torben_=C6gidius_Mogensen?=) writes:
>
> >Andy Glew <first...@employer.domain> writes:
>
> >> It wasn't clear if it was better to add a bit to the register ldst, or
> >> to have an explicit "kill register" instruction. And, of course,
> >> reg<-move(0) suffices to kill the register, *and* has the advantage of
> >> having an architecturally precise meaning.
>
> >True, but since a large fraction of registers become dead after their
> >first use, it would add a lot of code to explicitly zero these.
>
> From the i-don't-know-what-i'm-talking-about dept.:
>
> How large is that fraction?
The fraction depends on the program and language, of course, but
registers used for intermediate values in expression evaluation are
nearly always single-use, as are registers used for address
calculation in array lookups. User-declared variables are less often
single-use, but sometimes some assignments to them have only one use.
> Let each register have a reference count. Normal setting
> of the register, by specifying it as a dst register, sets the
> count to 1. Use of a register decrements the count, making the
> register dead when the count reaches 0.
That would require extra bits in the registers, not just in the
instruction words.
> Do compilers usually/often know, whereever a register value is generated,
> how often it will be reused? Exactly?
You can quite often see that a register is used exactly once, for
example those holding intermediate values of expressions and address
calculations as I mentioned above. In a compiler I once wrote, I let
the code generator specify single-use variables as a subsequent
optimisation used this information, and I was surprised by how often I
was able to know for sure that there would be only one use. Granted,
some of these registers were eliminated by the subsequent
optimisation, but there were still quite a few left.
A liveness analysis (as done by the register allocator) will be able
to identify even more cases of dead registers (as they can often
detect the last use of a variable that is used multiple times).
> A new instruction would be needed to adjust reference counts of life
> registers that do not become dead after their first use. The instruction
> would probably also permit making the register 'permanent', i.e. disable
> reference counts for it (until it receives a new assignment).
I still think run-time reference counts are a bad idea.
> How much would it help to know earlier when registers become dead?
As I mentioned, it frees physical registers for register renaming. It
can also avoid write-backs to the register file of values that are
computed and then used immediately afterwards (and never again), as
these values can live entirely in the forwarding network.
Torben
You could make these registers stacks: when you write to them, the new
value is pushed, when you read from them, the previous value
reappears. If the stack is empty on a read or full on a write, you
get an exception. This would be useful for expression evaluation and
other stack stuff.
Another idea is to make some of these registers FIFOs: when you write
to them, the value gets pushed into the FIFO at the front, when you
read from it, it is consumed at the end of the FIFO. Again,
exceptions on FIFO full and empty conditions. This would be useful
for software pipelining and other loop stuff.
These ideas would give you more architectural registers (with some
restrictions) than register names when you need them and fewer when
you don't.
A general-purpose architecture would probably support some combination
of these ideas.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
snip
>> How much would it help to know earlier when registers become dead?
>
> As I mentioned, it frees physical registers for register renaming. It
> can also avoid write-backs to the register file of values that are
> computed and then used immediately afterwards (and never again), as
> these values can live entirely in the forwarding network.
With some additional work, it could also speed up context switches as there
would be no need to save nor to restore dead registers.
--
- Stephen Fuld
e-mail address disguised to prevent spam
Sometimes the compiler won't know that a register is dead
at the time of last use. Consider the vector normalize code
sum = 0.0
for j from 0 to 2 do
sum = sum + vec[j]^2
end for
mulby = (sum == 0 ? 0.0 : sqrt(1/sum))
for j from 0 to 2 do
vec[j] = mulby*vec[j]
end for
The compiler can see that the register holding vec[j]
is dead after the squaring, and
the register holding vec[j]^2 is dead after the add.
The generated code might also have registers holding the
addresses &vec[j] and &vec[3] (the latter to compare against
&vec[j] for the end-of-loop test). The &vec[j]
register will be dead only after the final iteration of the first
loop. Whereas the &vec[3] register can remain alive for the
second loop. A single-bit in the compare instruction cannot describe this
behavior.
Likewise, in the definition of mulby, the register holding
sum remains alive if sum is nonzero, but dead if sum is zero
(I'll ignore the case where sum is NaN).
After the second loop, the compiler needs to flag
mulby and &vec[3] registers as dead.
Assume the square root is implemented as a function call
(not inline). The square root code might need register R4 for scratch,
where the callee is responsible for saving R4. It generates
save R4 on stack
use R4 for other purposes
restore old R4 from stack
If the caller had already flagged R4 as dead,
how does the "save R4 on stack" behave?
Can we save the dead bit along eith the register?
--
The USA must invest in interstate passenger rail. Amtrak needs high-speed
public railways, not private lines where it waits for freight trains to pass.
pmon...@cwi.nl Microsoft Research and CWI Home: Bellevue, WA
Re: Dead registers
It's easy to encode that a register is dead with no change to the ISA:
add r3 <- r3, r2
The old value of r3 is obviously no longer available. You can't get
away from marking values as dead in a 2-address ISA, which is one
of the nice things about the x86 ISA. (The more I look at x86, the
more I think there is a decent ISA lurking in that mess.)
One advantage to knowing that a register is dead is that you can
figure out if it had just one reader. Notice also that most of the
energy cost of an instruction is in distributing the results and
scheduling... ALUs cost almost nothing. So, identifying chains of
instructions which pass values from one to the next with no result
broadcast is valuable, since you can, at minimum, save power by not
broadcasting tag or data. Going further, you can have execution
hardware which consumes a string of instructions and generates and
broadcasts a single result, and does not even have the hardware to
distribute the intermediate results.
But going further may be pointless. Just knowing that you don't have
to broadcast tag+data saves most of the power.
Re: Condition codes
The trouble with condition codes is that they are an extra data input
to your instruction, and so the scheduler must take them into account.
Also, a branch to PC+offset (that's the common kind) which is
dependent on a condition code is a waste of issue bandwidth. When the
thing issues, it just checks that the condition code matches the
speculated value. A more power efficient microarchitecture has the
branch condition code check bundled with the instruction that does the
compare and generates the condition code, so you issue one operation
instead of two.
You can save more power if you avoid having to build this bundle from
discrete instructions, but then you need branch instructions with a
target, and two source registers, plus a rich explanation of how you
want to compare the two datums. That's a big instruction, so now you
need variable sized instructions or some other way to get extra bits
in the branch.
Re: Using general registers instead of condition codes
>From the point of view of data going into an instruction, they are
just two different name spaces, which clutters up the rename hardware
a little bit.
Isn't it more likely to slow down context switches, because each register
becomes a condition that must be tested and acted upon or skipped? (Not
to mention the need to also save a map of the saved "live" registers, for
subsequent restoration.) With unconditional saves or loads, you get to
issue multiple reads to your multiple load/store units in parallel, or use
other "long word" memory operations.
I suppose that it depends on the number of architected registers, and on
what extra hardware/instruction set assistance you have. An architecture
like the Itanium may very well save some time that way, since there are so
many visible registers.
The Cell PEs (also 128 of them?) don't have the problem, since they don't
(preemptively) context switch (or even get interrupts).
--
Andrew
It could, but I was assuming some kind of hardware assist such as a store
multiple that automagically skipped the dead registers. But yes, you would
need to save some kind of "map" to indicate what to restore (with a special
load multiple instruction).
> Sometimes the compiler won't know that a register is dead
> at the time of last use.
Of course not. Any compile-time analysis is approximative, but often
it will be able to identify sufficient cases to be worthwhile.
> If the caller had already flagged R4 as dead,
> how does the "save R4 on stack" behave?
> Can we save the dead bit along eith the register?
My original idea was only to use it for register remaning and
forwarding, but the idea of only saving live registers across calls is
interesting.
For caller-saves, this is already true: The compiler will know which
registers hold live values and save/restore only these. So it is only
callee-saves that need changing: If there is a way to recognise dead
callee-saves registers, only live registers need to be saved and
restored. But as others have pointed out, this requires a map of
which registers were live to be stored as well. This would add one
more word (assuming number of callee-saves registers <= word size) to
the saved state, so it is break-even if one or more of the
callee-saves registers are dead.
It would require some hardware assist, at the very least special
instructions for saving only the live callee-saves registers (in a
certain range) along with the bitmap and restoring along the same
lines. These would need also to modify the stack pointer or return in
a register how many words were saved.
But I'm not sure it is worth the effort: Most compilers allocate
variables that are dead across function calls to caller-saves
registers, so most callee-saves registers are presumably live -- not
necessarily in the calling procedure, but since it has saved only
those callee-saves registers that it itself uses, the remainder may be
live in the previous caller, etc., all the way down the call chain.
So, all considered, I think the best overall option is:
Let a fixed subset of the caller-saves registers be one-use: they are
presumed dead after the first read to them, and they can not be live
across a procedure call.
This doesn't require any extra opcode bits or instructions, it is just
a matter of the register allocator using them only for variables that
have these properties.
If we have 32 registers, the following division would be plausible:
- registers 0- 3: one-use
- registers 4-11: caller-saves
- registers 12-31: callee-saves
Since one-use registers are used for short-lived values, there are not
likely to be too many of these at the same time, so 4 should be
sufficient. They can live entirely in the forwarding network and need
never be stored in the actual register file. At exceptions, they can
be stored directly from the forwarding network to memory (or shadow
registers).
Torben
With per-register scheme, I agree that checking can be more expensive.
However I can imagine having a single bit for a set of registers
instead for each register.
A similar mechanism exists on SPARC, where it has a status register called
"floating point registers state register" (FPRS) that has two dirty bits for
upper and lower floating point registers.
Those bits are set if any register in the corresponding set has been modified.
During the context switch, the kernel checks the bit
and avoid saving 16 64-bit registers if the bit isn't set.
Since FPRS is writable by non-privileged software,
in theory, a compiler can emit code to clear those bits
when it knows that the corresponding set is dead.
Unfortunately, writing to a status register is usually expensive
and there's not much gain to be had with extra clearing of those bits anyway
(I won't get into details why that's the case).
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
I don't like this scheme because it bakes ABI decisions into a hardware
(it's difficult to get ABI right) and the gain from such a scheme
doesn't warrant the severe restriction - any particular ABI
may work well for certain applications/languages but it may not for others.
> If we have 32 registers, the following division would be plausible:
>
> - registers 0- 3: one-use
> - registers 4-11: caller-saves
> - registers 12-31: callee-saves
>
> Since one-use registers are used for short-lived values, there are not
> likely to be too many of these at the same time, so 4 should be
> sufficient. They can live entirely in the forwarding network and need
> never be stored in the actual register file. At exceptions, they can
This reminds me of an exotic VLIW processor a friend of mine worked on
which instruction set had input/output bus of each operation
explicitly specified (and each opcode mapped directly to a particular
functional unit)
hence it allowed the compiler to avoid using registers entirely
for short-lived values (there was explicit instructions to read from register
and put the value onto any particular bus and vice versa).
It generally made the compiler writer's life miserable :(
> be stored directly from the forwarding network to memory (or shadow
> registers).
>
> Torben
I have seen a few papers on this subject in the past.
Google "dead value" and "register alias table".
For example:
Exploiting Dead Value Information
Milo M. Martin, Amir Roth, and Charles N. Fischer
http://www.seas.upenn.edu/~milom/papers/micro97_deadvalueinfo.pdf
discusses these scenarios, explicit dead value instructions
and call/return/task_switch savings, and runs simulations on it
"Our results indicate that dynamic saves and restore instances can be
reduced by 46% for procedure calls and by 51% for context switches.
In addition, save/restore elimination for procedure calls can improve
overall performance by up to 5%."
Unfortunately citeseer is down at the moment so I can't
see all the citations and other references.
Eric
Such also allows condition codes to be set early, potentially
reducing the pipeline delay WRT branch condition evaluation.
(Condition codes are small and can be replicated in the
front-end with fast access.)
E.g.:
... // code that does not touch A,B,N, or M
if (A>N)
{ do_this(A);} // do_this() does not touch B or M
if (B>M)
{do_that(B);}
can become
cc1 = (A>N)
cc2 = (B>M)
...
if (cc1)
{ do_this(A);}
if (cc2)
{ do_that(B);}
If the branch can be evaluated one pipeline stage after
instruction fetch, the branch misprediction penalty
becomes very small.
Paul Aaron Clayton
just a technophile
> Torben Ægidius Mogensen <tor...@app-5.diku.dk> wrote:
> ...
> > Since one-use registers are used for short-lived values, there are not
> > likely to be too many of these at the same time, so 4 should be
> > sufficient. They can live entirely in the forwarding network and need
> > never be stored in the actual register file.
>
> This reminds me of an exotic VLIW processor a friend of mine worked
> on which instruction set had input/output bus of each operation
> explicitly specified (and each opcode mapped directly to a
> particular functional unit) hence it allowed the compiler to avoid
> using registers entirely for short-lived values (there was explicit
> instructions to read from register and put the value onto any
> particular bus and vice versa). It generally made the compiler
> writer's life miserable :(
I have seen such processors suggested too (e.g., in a PhD thesis by
Phil Endecott (IIRC) from the Manchester AMULET project), and as you
say they were hard to compile for. But that was mainly because you
might not know where a value is going to be used when it is made, as
there might be branches between production and use. This required a
large number of values to be stored in registers anyway, even if they
were short-lived and had only one use. In short, such architectures
work great for sequential code, but not so well when there are
branches and function calls. I suppose they could be useful for
graphics processors.
By using "virtual" register numbers to identify the values in the
forwarding network, they can be used by different instructions in
different paths without having to be stored in physical registers.
Torben
However, data and control dependencies conspire to defeat
that goal. In practice, the vast majority of branch conditions
can't be evaluated until the instruction before the branch. Most
of the exceptions are well-predicted branches (branch-decisions
determined by induction variable values, ie loop control).
You might find "Evaluating the Use Of Register Queues in Software
Pipelined Loops" (2001, Gary S. Tyson, Mikhail Smelyanskiy, Edward
S. Davidson) interesting.
There is also "Software-Directed Register Deallocation for
Simultaneous Multithreaded Processors" (Jack L. Lo,
Sujay S. Parekh, Susan J. Eggers, Henry M. Levy, Dean
M. Tullsen), which only looks at making more registers
available to other threads to reduce the total register
count in an SMT processor.
I know switch-case constructs are not common, but they are not
necessarily predictable. (BTW, how many case statements are
required before a table read and indirect jump or an indirect
jump to a computed address that contains a jump instruction
becomes better than a tree of branches, assuming (unrealistically)
even distribution within the variable?) I still like early evaluation,
even though it may not be practically worth the implementation
complexity overhead.
Condition codes also make predictation easier: the condition
is in a separate register bank, so there is no need for an
additional read port (per instruction issued) into the GPR file.
If the ISA is fully predicated, it is useful to have multiple
predicate registers.
> Niels Jørgen Kruse wrote:
> [snip]
> > Multiple condition code registers is for nonbranchy boolean evaluation.
>
> Such also allows condition codes to be set early, potentially
> reducing the pipeline delay WRT branch condition evaluation.
That too, for as much as that was ever worth.
> (Condition codes are small and can be replicated in the
> front-end with fast access.)
However, the front-end may be looking at the branch before the condition
code setting instruction reaches the stage where the condition code
register is renamed....
It is important to not slow down the fetch-branch prediction recurrence.
It will be interesting to see if the POWER6 reintroduce the ability to
resolve conditional branches in the fetch unit.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Interesting. I read comp.arch irregularly, so may have missed the other reply.
I was about to (and am now going to) counter-propose a "multi-kill" instruction:
"kill multiple registers under a bitmask". Which would actually, probably, have
the semantics "zero multiple registers under bitmask".
Should it be "zero-*" or "kill-and-trap-subsequent-access-*"?
The part of me that has learned from experience has learned to be reluctant to
have "undefined" behaviour in a computer instruction set. Users will break the rules...
Unfortunately, "zero-*" or "trap-subsequent-reads-*" are both potentially damned useful.
At least zero-* has defined existing behavior.
trap-subsequent-reads-* would require that someone - the OS, possibly the call/return sequence
- have the ability to save and restore the bit(s) that record the trap.
---
Ah heck, I gotta admit: if I were a language hacker, and you gave me a machine that had
"trap on register read", I would use it to create a thunking mechanism.
Damned useful - but probably NOT what you are trying to enable, short term.
How about "0xdeadbeef-*" for integers and "NaN-*" for floats?
--
Andrew
Good point, though such problems might be avoided or minimized
by using it just on leaf routines so there is no mask nesting.
That might give (baseless guess) 80% of the call return savings
without requiring a full blown general mechanism.
However there are some important architectural pieces missing from
HPS which it seems to me would considerably change HPS's performance
characteristics: (a) a general uncommitted+committed store queue
with store to load forwarding, and (b) watch points and replay traps.
The store-load forwarding would be particularly useful for the all
the weird and wonderful VAX addressing modes that _might_, but
usually don't, cause memory operand values to interact.
Without that, HPS requires all sorts of specialized logic to
check for interactions, and can only minimally read ahead.
If HPS had hardware table of address watch points and replay traps,
in conjunction with the store queue, might allow the register
restore mask stored in the stack to be speculatively read ahead,
similar to how return instruction address are read ahead now.
The generalized watch point table would allow mask value and return
address locations to be monitored for changes that occur after their
values are speculatively read until return instruction commits.
If a write occurs to a watched address, it triggers a replay trap.
HPS did have checkpoint/restore mechanism, but it was used
only for branch mispredict and exceptions and not replays.
Without these, HPS stalls more due to worst case potential
interactions that probably never actually occur.
For current designs this may not be a problem as these components
already exist, though I can see people might not want to introduce
features that require larger CAM watch point tables in order to
speculate ahead.
Eric
The VAX RET instruction would still be a problem even if HPS was
better at speculative execution, perhaps just not quite so bad.
However this need not apply to dead value optimizations.
The VAX RET instruction decode problem is that the decoder wants to
crack the instruction and generate a series of register pop micro-ops,
but to do that it needs the register mask.
The register mask is stored separately in the stack, and the decoder
doesn't even know it needs the value until it decodes the RET
instruction so it can't even prefetch the value.
The killer is that the instruction therefore stalls inside the decoder
until the mask is available, rather than waiting in a function unit,
and blocks all subsequent instruction decoding.
Having predicated store and load cracked micro-ops as you suggest
looks to me as though it would have its own set of problems.
Since you don't know which registers will be affected by the mask
value, you must assume it is all of them, which means generating
1 micro-op per architecture register. Those are not all going
to enqueue in one clock, so we are talking a few clocks to do this.
The predicated stores just read registers so they don't change the
state but the do use ROB and other resources. However the loads
would have to assign a new physical register and as each predicate
was determined to be true or false, load memory or copy the original
physical register value to the new register. And since the majority
of registers are not stored or loaded, most of this work is wasted.
For the dead value optimization it seems to me the problem can be
avoided simply by not having conditional StoreRegSet and LoadRegSet
style instructions (and therefore no conditional instruction cracking).
By having separate per register conditional load and store
instructions, like CMOV instruction but based on a mask register,
the decoder can dispatch the individual CSTR and CLOAD instructions
to their function units to wait and not block subsequent decoding.
To support nested routine calls the mask register would have
to be saved and restored, but this need not stall the decoder.
Eric
> The VAX RET instruction decode problem is that the decoder wants to
> crack the instruction and generate a series of register pop micro-ops,
> but to do that it needs the register mask.
> The register mask is stored separately in the stack, and the decoder
> doesn't even know it needs the value until it decodes the RET
> instruction so it can't even prefetch the value.
>
> The killer is that the instruction therefore stalls inside the decoder
> until the mask is available, rather than waiting in a function unit,
> and blocks all subsequent instruction decoding.
You could have a (shadow) stack of register masks, just as you have a
link stack of return addresses.
You could have an op that scan a mask to find the first set bits and
transfer those registers and pass a mask without those bits set to the
next op. Because registers are saved adjacent to each other on the
stack, you could merge data before handing them over to the cache and
vice versa. The limit on speed for this is the number register file
ports and what hoops you have to jump through to get exclusive access to
all of them.
If I understand you correctly, you are talking about moving
responsibility for decisions about actual register set restore
out of the decoder. That is fine, but...
I was going on the (possibly erroneous) assumption the scheduler
would normally be designed to handle up to 2 register reads and 1
write per instruction, and that an instruction that can write up
to 16 registers would cause it to have kittens.
Wouldn't rename reservation be an N**2 algorithm? If so, trying to
reserve many rename registers at once seems prohibitively expensive.
So the scheduler's ability to reserve rename registers would limit
it to micro-ops that write just 1 register each, so you are forced
to crack the register restore set in the decoder/scheduler.
But maybe that assumption is wrong.
Eric
> Niels Jørgen Kruse wrote:
> >
> > You could have an op that scan a mask to find the first set bits and
> > transfer those registers and pass a mask without those bits set to the
> > next op. Because registers are saved adjacent to each other on the
> > stack, you could merge data before handing them over to the cache and
> > vice versa. The limit on speed for this is the number register file
> > ports and what hoops you have to jump through to get exclusive access to
> > all of them.
>
> If I understand you correctly, you are talking about moving
> responsibility for decisions about actual register set restore
> out of the decoder. That is fine, but...
OK, the ideas I threw out above were halfbaked. Still, I think you went
overbord in blackballing the call/return instructions.
> Wouldn't rename reservation be an N**2 algorithm? If so, trying to
> reserve many rename registers at once seems prohibitively expensive.
Noone hand out awards for perfect allocation. If adding a few extra
registers is easier, you do that. You could eg. have each reservation
slot consider only a subset of the free registers.
She is a fair lady, of Rubinesk proportions,
and I would never impugn her virtue.
I'm just trying to understand the basic issues involved and why
the VAX problems need not apply to dead register optimization.
> > Wouldn't rename reservation be an N**2 algorithm? If so, trying to
> > reserve many rename registers at once seems prohibitively expensive.
>
> Noone hand out awards for perfect allocation. If adding a few extra
> registers is easier, you do that. You could eg. have each reservation
> slot consider only a subset of the free registers.
Yes. It could do the pop micro-ops in groups of 4, as that seems
the current hardware limit, and the result should be the same as
the equivalent risc sequence.
The key to minimizing RETs decode stalls looks like having a hardware
lookaside stack for the masks, with watch points on their memory
locations to detect any changes.
Eric
The forwarding itself is not the problem, the problem is more the
tracking.
For example, when a reservation station considers issuing an
instruction it must wait for the registers to be ready, but it must
also wait for the condition codes. Similarly, the bus to issue the new
condition code register at the end of execution is small, but the logic
to place the value into the reservation station is the same as it would
be for another register output.
If the implementation uses content addressable memory for tags, then
the same CAM is needed for the condition codes as for registers
themselves.
I think a possible solution is to attach CCs to the tags, though I
don't know if anyone does this.
> Jan Vorbrüggen wrote:
> > > The condition codes are effectively an extra output for every
> > > instruction that sets them. They are another resource that must be
> > > renamed like registers. This means they must be forwarded to many
> > > places.
> >
> > Yes, but...if you're already carrying around 64 bits for the register
> > contents alone, are three or four additional bits going to break your
> > implementation?
>
> The forwarding itself is not the problem, the problem is more the
> tracking.
>
> I think a possible solution is to attach CCs to the tags, though I
> don't know if anyone does this.
Guys, you are talking as if this is a new topic.
It should already be well known - i.e. it is public - that the Intel
P6 family renames the condition codes in conjunction with the data
values. More than 10 years already.
I.e. on the original P6, with 32 bit integer values, the integer value
stored in the ROB was actually 40 bits, to give you room to store the
condition codes (abd some other stuff).
Because the x86 nearly always writes all of the condition codes at the
same time, only one physical destination register was assigned - the
condition codes were bits 32..39, the integer bits 0..31.
There was a separate RAT entry for the condtion codes. The RAT entries
for the destinatin register and condition codes were updated with the
same physical destination register.
I.e. a single physical destination, 2 logical destinations.
The RAT has no CAMs. The RS has CAMs, but no extra CAMs are required
for condition codes. The condition codes just use one of the 2 input
CAMs.
---
AMD does it in a slightly different way, but not fundamentally differently.
When the EUs write back into the ROB they write back a 32-bit value +
flags = a 40 bit value. Then the RAT performs two updates, updating
the location of the flags and possibly changing where an architected
register is stored. Is that right?
So, when an instruction is dispatched it requires possibly three
inputs, two inputs providing values and the condition codes. And the
condition codes may be stored in a different ROB entry to either of the
value inputs. Doesn't this make things a little more complex?
I thought this stage, i.e. the stage "find registers X, Y and Z" was
implemented using CAM, maybe I'm wrong. Anyway, am I wrong in general
in saying that it requires extra logical complexity to deal with
condition codes?
(BTW I'm not necessarily arguing with you here, I know you were one of
the P6 architects, I'm just interested to here your opinion)
AFAIK, the conditions codes are always stored along with the relevant
register, simply because that's were they belong!
The only hickup in this process relates to things like INC and DEC which
are architecturally defined to NOT affect the carry flag, allowing
things like a perfectly pipelined (without any unrolling) TCPIP
copy+checksum.
Unfortunately, on a P6 this becomes effectively a Partial Register
Stall. :-(
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
> So, when an instruction is dispatched it requires possibly three
> inputs, two inputs providing values and the condition codes. And the
> condition codes may be stored in a different ROB entry to either of the
> value inputs. Doesn't this make things a little more complex?
There are very few instructions that need three inputs. Since you can
split them into two micro-ops with two inputs each you don't need
another input CAM.
Surely, at any point in execution the relevant flag inputs are the
flags associated with the instruction executed previously. So when an
instruction that uses them is executed it must read flags from that
previous instruction.
Thinking about it further, I suppose the reason this works without any
more problems is that there aren't many (any?) instructions that read
two operands and the flags. The ones that read two operands mostly
write the flags. So for the few instruction where the flags remain
undefined a microcode instruction can 1)intervene and store the flags
2) issue the equivalent insn that clobbers all the flags, then 3) put
back in the remaining flags afterwards. Or do something similarly
clever with microcode.
> The only hickup in this process relates to things like INC and DEC which
> are architecturally defined to NOT affect the carry flag, allowing
> things like a perfectly pipelined (without any unrolling) TCPIP
> copy+checksum.
>
> Unfortunately, on a P6 this becomes effectively a Partial Register
> Stall. :-(
Is that because it does something similar to what I mention above?
I guess this is where the penalty for flag register really lies.
You're right, as you note this works because except for conditional
branches, very few instructions (primarily ADC/SBB) use flags as inputs.
>
> Thinking about it further, I suppose the reason this works without any
> more problems is that there aren't many (any?) instructions that read
> two operands and the flags. The ones that read two operands mostly
> write the flags. So for the few instruction where the flags remain
> undefined a microcode instruction can 1)intervene and store the flags
> 2) issue the equivalent insn that clobbers all the flags, then 3) put
> back in the remaining flags afterwards. Or do something similarly
> clever with microcode.
>
>> The only hickup in this process relates to things like INC and DEC which
>> are architecturally defined to NOT affect the carry flag, allowing
>> things like a perfectly pipelined (without any unrolling) TCPIP
>> copy+checksum.
>>
>> Unfortunately, on a P6 this becomes effectively a Partial Register
>> Stall. :-(
>
> Is that because it does something similar to what I mention above?
>
> I guess this is where the penalty for flag register really lies.
I believe that's correct.
Sure. We're talking about whether condition codes are worthwhile in
a new ISA, so you could conceivably have an instruction which
generates the carry out bit, so no branch and no extra dataflow
dependency.
The problem is that you have to add the carry into the next
base-2^64 digit. If you are originally adding two numbers, you
now have two adds per digit column, and either (but not both)
can produce a carry-out. So you need yet another add to
combine those outputs.
I count 5 instructions per digit column, instead of one
add-with-carry.
But then, I find it hard to imagine that a 5x slowdown for the core
operation in multiprecision arithmetic is really so bad, compared to
the overhead of condition codes. Rather than do condition codes,
I'd handle three-source instructions, as there are some other
operations that really want three sources (conditional moves being
the biggie, and shift-and-add a frequent enough idiom to matter).
For something like ten years I've occasionally looked for a
convincing case for three-source instructions, and I haven't found
it yet. Two source and no condition codes seems to be the right
way, maybe with one source aliasing the destination (two address).
Three-source instructions where at least one source is a constant
from the instruction stream seems good too.
If multiprecision arithmetic is really all that useful, just have a
dedicated unit, perhaps memory mapped. Silicon has gotten
cheap enough that a half-dozen special-case units, each 1 mm^2,
seems totally reasonable for a few-per-person CPU (desktop,
laptop, etc), and would probably save significant amounts of
power. Encryption and image compression seem like good
candidates. But keep these things away from the instruction
scheduler.
Don't worry about that. Multi-precision arithmetic libaries often
use base-2^63 (or 2^31 or whatever), on architectures where it is
faster to do that than to use the carry condition code. It is
more important to the speed of MP libraries to have a fast
multiply-accumulate instruction with a double-word output, than
it is to have condition codes.
--
David Hopwood <david.nosp...@blueyonder.co.uk>