Heidi Pan's "High Performance, Variable-Length Instruction
Encodings" implies that a 25% reduction in code size relative to
MIPS should be possible without making a superscalar
implementation excessively difficult. A more constrained
instruction encoding (with simpler decoding) might lose half of
that advantage, so the question might be: is a 10-15% reduction in
code size worth a signficant effort in architecture design and a
moderate additional effort in implementations? (Even before
reading that paper I had been thinking about a 64-bit instruction
word ISA that would encode more than two instructions on average
by exploiting unused bits in common RISC encodings with modestly
more variability in encodings but fixed locations for register
identifiers and parts of opcodes. The ISA might also have
somewhat more complex instructions than a pure RISC--e.g.,
load/store register pair--and highly implicit encodings for some
operations--e.g., a very short form for subroutine return with
the link register being implicit. Unlike a traditional LIW
architecture, it would be possible to jump into the middle of a
word; the entire instruction word would still be fetched but the
first operation would be converted to a no-op.)
In addition to decode complexity constraints, there is also the
question of whether such places an excessive burden on compiler
developers (at least in terms of generating dense code).
Unfortunately the standard methods for achieving greater code
density tend to increase hardware complexity, compiler
complexity, or both. (A smaller register count can make
CSE-optimizations more difficult to evaluate as well as generally
making optimized register allocation more complex. Implicit
arguments make instruction decoding more complex. Special
purpose registers [even a division as general as address and
integer] makes register allocation more complex. Modal
extensions of opcodes adds complexity to both hardware and
software. Variable length encodings increase hardware
complexity. Complex instructions add both hardware and software
complexity.)
(Since a new ISA in this target market has minimal chance of
being implemented [especially an ISA developed by a mere
technophile], this is a highly theoretical question.)
Paul A. Clayton
Not speaking for Big Sam's Anatomical Warehouse
http://home.earthlink.net/~paaronclayton/Big_Sams
You can get a 40% reduction with just a mix of 16 and 32-bit
instructions, without harming performance or making superscalar
execution more difficult than it otherwise would be. And yes, for
embedded apps, 40% reduction in code size is well worth it, that's why
we have MIPS-16, Thumb, ARCcompact etc.
Cheers,
Jon
In the past, I have estimated that you could get a 50-75% reduction
by a two-level ISA, where the top level was designed for the high
level language. Designing ISAs for ease of code generation was
first proposed in the late 1960s, as far as I know, but was never
done in a mainstream system, and hasn't been attempted at all in
recent decades.
In order to do this, you need a microcode approach, possibly with
a programmable microcode. In the heyday of the RISC dogma, this
was stated to be incompatible with performance, but the Pentium 4
and Opteron have shown that that is now false, if it ever were
true (which is very doubtful).
Regards,
Nick Maclaren.
> In article <1121333201.7...@g47g2000cwa.googlegroups.com>,
> "j...@beniston.com" <j...@beniston.com> writes:
> |> > Heidi Pan's "High Performance, Variable-Length Instruction
> |> > Encodings" implies that a 25% reduction in code size relative to
> |> > MIPS should be possible without making a superscalar
> |> > implementation excessively difficult.
> |>
> |> You can get a 40% reduction with just a mix of 16 and 32-bit
> |> instructions, without harming performance or making superscalar
> |> execution more difficult than it otherwise would be. And yes, for
> |> embedded apps, 40% reduction in code size is well worth it, that's why
> |> we have MIPS-16, Thumb, ARCcompact etc.
>
> In the past, I have estimated that you could get a 50-75% reduction
> by a two-level ISA, where the top level was designed for the high
> level language. Designing ISAs for ease of code generation was
> first proposed in the late 1960s, as far as I know, but was never
> done in a mainstream system,
You can argue that Digital's VAX did this.
> and hasn't been attempted at all in recent decades.
Mainly because, at this level, code generation is fairly trivial.
> In order to do this, you need a microcode approach, possibly with
> a programmable microcode. In the heyday of the RISC dogma, this
> was stated to be incompatible with performance, but the Pentium 4
> and Opteron have shown that that is now false, if it ever were
> true (which is very doubtful).
70's minicomputers often used interpreted microcode, which is bad for
performance. What you see in P4 etc. is compiled microcode where each
instruction is decoded by a combinatoric circuit and translated into
microinstructions all in hardware. This is quite different from
microcode that would decode instructions in software (or "firmware",
as it was often called).
In any case, there are many simple ways to increase code density
without seriously complicating implementation or impacting
performance:
- Use two-address instructions, i.e., A := A+B instead of C := A+B.
Coalescing register allocators can in most cases eliminate the
extra moves required by two-address code (assuming you have more
than a handful of available registers).
- Use mixed-length instructions. If you make sure no instruction
spans a word boundary, this isn't too difficult to handle. This
can be handled, e.g., by using only a few different ways to pack
instruction into a word, for example, either one 31 bit
instruction, two 15-bit instructions or three 10-bit instructions
into one 32-bit word (using the first one or two bits to select the
format).
- Provide instructions to load/store an interval/set of registers on
stack to reduce size of procedure prologues/epilogues.
Torben
Only feebly. It was a very bad design if that was an objective,
for reasons that were well-known before 1975.
|> > and hasn't been attempted at all in recent decades.
|>
|> Mainly because, at this level, code generation is fairly trivial.
Yes. At the cost of larger code, much more difficulty debugging,
and making it much harder to write correct interrupt handling.
|> 70's minicomputers often used interpreted microcode, which is bad for
|> performance. What you see in P4 etc. is compiled microcode where each
|> instruction is decoded by a combinatoric circuit and translated into
|> microinstructions all in hardware. This is quite different from
|> microcode that would decode instructions in software (or "firmware",
|> as it was often called).
Hmm. What makes that different from the argument that interpreted
and compiled programming languages are entirely different? That
viewpoint is generally agreed to be misleading.
|> In any case, there are many simple ways to increase code density
|> without seriously complicating implementation or impacting
|> performance: ...
The techniques that I am thinking of could actually SIMPLIFY
implementation! They would do that by removing much of the need
for the baroque tweaks (often involving interrupt handling) to
deal with the cases that are almost, but not entirely, handled in
hardware.
Regards,
Nick Maclaren.
That paper, I discovered in a brief Google search, can be found from:
http://www.cag.lcs.mit.edu/scale/hat/
The Google hit for the paper itself doesn't appear to be quite
workable, but this page has a good link on it in an obvious place.
> (Since a new ISA in this target market has minimal chance of
> being implemented [especially an ISA developed by a mere
> technophile], this is a highly theoretical question.)
For something that truly has no chance of being implemented, and is not
the product of serious research, try my page at
http://www.quadibloc.com/arch/arcint.htm
I started out with a few pages describing a simple architecture
inspired by the 68000 and the System/360, which was oriented towards a
compact instruction encoding from the beginning. But to add ways of
making the instruction format more complex, I wound up adding a
plethora of alternate addressing modes, struggling to squeeze more in.
(Also, I added just about every unusual feature ever known to the world
of computing, so as to have an opportunity to explain how each one
works...)
John Savard
IIRC, according to the paper, MIPS-16 suffers on the order
of a 20% performance penalty (presumably mainly from the extra
instructions caused by the smaller immediately available
register set, though the slight lengthening of the pipeline
might have a modest performance impact as well). Any variable
length encoding adds complexity to fetch/decode, especially
for a superscalar implementation. (It might be debated that
any performance-oriented implementation would already be so
complex that some extra decode complexity might be accepted.)
Note also that the target market did not include the lower-end
embedded market where Thumb et al. are so attractive. The
question remains: Would a modest improvement in code density
improve performance enough to justify the extra effort in ISA
design (and whatever [hopefully minimized] additional effort
in implementation)?
Paul A. Clayton
not speaking for Big Sam's Anatomical Warehouse
http://home.earthlink.net/~paaronclayton/Big_Sams
Torben Ægidius Mogensen wrote:
[snip]
> In any case, there are many simple ways to increase code density
> without seriously complicating implementation or impacting
> performance:
>
> - Use two-address instructions, i.e., A := A+B instead of C := A+B.
> Coalescing register allocators can in most cases eliminate the
> extra moves required by two-address code (assuming you have more
> than a handful of available registers).
The cited paper proposed this also, though it used a 3-operand
encoding when appropriate to avoid the extra move instruction.
> - Use mixed-length instructions. If you make sure no instruction
> spans a word boundary, this isn't too difficult to handle. This
> can be handled, e.g., by using only a few different ways to pack
> instruction into a word, for example, either one 31 bit
> instruction, two 15-bit instructions or three 10-bit instructions
> into one 32-bit word (using the first one or two bits to select the
> format).
This is something like the 64-bit instruction word ISA I was
thinking about except that I was thinking of making the
instruction boundaries less clear. E.g., a position in the word
could be reserved for a source register ID where only later
decoding would indicate which operation used that register.
Since the ISA would be targeted to performance-oriented
implementations, having a 64-bit minimum fetch width is
reasonable and such allows for more flexibility in encoding
instructions (including long immediates [I was thinking of
providing a word-spanning operation to provide true 64-bit
immediates primarily to load constant addresses. This
might also be extended to allow the fetch bandwidth to
provide a data stream with a tiny side buffer providing a
loop body of code, though such is probably not worth the
significant complexity that it would add.]).
> - Provide instructions to load/store an interval/set of registers on
> stack to reduce size of procedure prologues/epilogues.
Optimizing procedure overhead is an attractive idea,
but I am concerned that adding such would increase hardware
complexity and pipeline length. Load/store register pair
seemed a weak compromise that seemingly would be simpler
to implement and could take advantage of cache structure.
Thanks for the comments.
Paul A. Clayton
not speaking for Big Sam's Anatomical Warehouse
http://home.earthlink.net/~paaronclayton/Big_Sams
> For something that truly has no chance of being implemented, and is not
> the product of serious research, try my page at
> http://www.quadibloc.com/arch/arcint.htm
In reference to Heidi Pan's invention, I have made modifications to the
pages
http://www.quadibloc.com/arch/ar507.htm
http://www.quadibloc.com/arch/ar502.htm
in order to add her idea (with modifications for application to a
pre-existing CISC architecture, rather than a customized
variable-length modification of a pre-existing RISC architecture) to my
architecture.
John Savard
> Torben Ćgidius Mogensen wrote:
> > - Provide instructions to load/store an interval/set of registers on
> > stack to reduce size of procedure prologues/epilogues.
>
> Optimizing procedure overhead is an attractive idea,
> but I am concerned that adding such would increase hardware
> complexity and pipeline length.
It has been a part of the ARM ISA since ARM1, so it can't be that
difficult to implement. Nor does it have to affect pipeline length,
though it may require several cycles to complete. In addition to
working well with cache, it also exploits sequential memory access.
Torben
> nm...@cus.cam.ac.uk (Nick Maclaren) writes:
>
>> In article <1121333201.7...@g47g2000cwa.googlegroups.com>,
>> "j...@beniston.com" <j...@beniston.com> writes:
>> |> > Heidi Pan's "High Performance, Variable-Length Instruction
>> |> > Encodings" implies that a 25% reduction in code size relative to
>> |> > MIPS should be possible without making a superscalar
>> |> > implementation excessively difficult.
>> |>
>> |> You can get a 40% reduction with just a mix of 16 and 32-bit
>> |> instructions, without harming performance or making superscalar
>> |> execution more difficult than it otherwise would be. And yes, for
>> |> embedded apps, 40% reduction in code size is well worth it, that's
>> why
>> |> we have MIPS-16, Thumb, ARCcompact etc.
>>
>> In the past, I have estimated that you could get a 50-75% reduction
>> by a two-level ISA, where the top level was designed for the high
>> level language. Designing ISAs for ease of code generation was
>> first proposed in the late 1960s, as far as I know, but was never
>> done in a mainstream system,
>
> You can argue that Digital's VAX did this.
FWIW, our PL/I on Alpha generates executables ~ 3 times larger than it
does on VAX, which implies the need for larger cache and more bandwidth
on the memory cache channel.
A decrease in code size would most likely lead to an increase in
instruction cache hit rate. This obviously depends upon the
implementation of other microarchitectural features such as prefetch
though.
Cheers,
Jon
But ARM has also been very slow to go superscalar relative
to MIPS, PowerPC, and Alpha. (There are other reasons for
this, but I suspect that the ISA makes a superscalar
implementation more challenging.) Cracking and millicoding
(to use POWER4 terminology) requires extra work which might
not be easily done in parallel with other, necessary
decoding work. (32b PowerPC also has load/store multiple,
which loads/stores Rn through R31 using any GPR for the base
address register.)
Paul A. Clayton
http://home.earthlink.net/~paaronclayton
That's very poor reasoning indeed. Some ISA features are easy
in primitive systems that lack caches, out-of-order completion,
split transaction buses and MMUs, but become a pain in the ass
to get right when these things are introduced. For example,
in a non-MMU system, you don't have to worry about a page fault
occurring in the middle of your LDM/STM, which is a PITA.
LDM/STM is not too hard, but there are a lot of things you
have to keep in mind to make sure it gets done in the face
of all the events that might occur during its execution, and
you do need to add a little hardware to the pipeline to do it.
One of the motivational philosophies of RISC was to avoid
things that added complexity without adding significant
performance. Given the high hit rate of instruction caches,
you can argue that LDM/STM may not be a win compared to
a sequence of, say, LDRD/STRD (double-word load/stores),
especially if your ISA is efficiently encoded (like IA32 and Thumb).
And then you go off and simulate that, for the particular
implementation technology of the say, and see.
On a related note, one mistake naive architects often make
is thinking that the "goodness" of a particular ISA feature
is independent of the underlying implementation technology.
This is demonstrably untrue.
--
Dennis M. O'Connor dm...@primenet.com
Tom Linden wrote:
[snip]
> FWIW, our PL/I on Alpha generates executables ~ 3 times larger than it
> does on VAX, which implies the need for larger cache and more bandwidth
> on the memory cache channel.
Was this before the addition of byte and dual-byte load/store
instructions? Also, ISTR that _optimized_ Alpha code
included a lot of padding no-ops (the later implementations
liked to have branch targets 4-instruction aligned, IIRC,
and there might have been some other reasons for adding no-ops
such as to avoid having too many branches in a fetch block).
Other code-expanding optimizations like loop unrolling might
have been less effective on a VAX (because of a smaller register
count, less ability to schedule code, or even just greater cost
of larger code) and so not considered appropriate for VAX by the
compiler writers.
I suspect that Alpha executables were rarely optimized for
size (and that size-optimization was rarely a performance
benefit [relative to the loss from dropping padding no-ops
et al.] given a wide memory interface and large off-chip cache).
Also the weak support for large-branch-count codes in the EV6
branch predictor might imply that cache-busting commercial codes
were not a principal concern to the Alpha designers. OTOH,
perhaps later cache optimizers more carefully balanced code
size and ideal code scheduling?
Paul A. Clayton
http://home.earthlink.net/~paaronclayton/
>Torben Ægidius Mogensen wrote:
>[snip]
>> In any case, there are many simple ways to increase code density
>> without seriously complicating implementation or impacting
>> performance:
>>
>> - Use two-address instructions, i.e., A := A+B instead of C := A+B.
>> Coalescing register allocators can in most cases eliminate the
>> extra moves required by two-address code (assuming you have more
>> than a handful of available registers).
>
> The cited paper proposed this also, though it used a 3-operand
> encoding when appropriate to avoid the extra move instruction.
One variient that I have thought of that might make sense here is to spend a
single bit in the instruction that when set makes the instruction A+1 = A +
B. That is, if set, the destination is the "next" register from the first
source. This complicates register allocation but saves the extra moves
without using 4-5 bits for a full register specifier.
--
- Stephen Fuld
e-mail address disguised to prevent spam
snip
> On a related note, one mistake naive architects often make
> is thinking that the "goodness" of a particular ISA feature
> is independent of the underlying implementation technology.
> This is demonstrably untrue.
This is interesting. What "level" of implementation technology are you
talking about? At the level of say "CMOS" (as opposed to bipolar or GaAs),
that seems unlikely to change very much. But if you are talking at the
level of say a particular .08 micron CMOS process, you know that is going to
change several times over the life of a successfull product. If at that
level, how do you know what features the several generation down the line
process will make easy or hard? Or do you mean that a particular design
fits well with say Intel's implementation technology but not TSMCs?
I suspect that Dennis is talking about simple scalar pipeline v superscalar
v out-of-order v trace caches - what sort of micro-architectural
implementation rather than process implementation. But I'm sure he can speak
for himself so I'll shut-up...
Peter
Around the time RISC was developing, memory was getting
much faster, relative to processor logic, and cheaper than it
had previously been. I contend this was one reason RISC could
successfully (for a while) abandon the densely-coded ISA's of
the CISCs for the faster-to-decode simplicity that characterized
the original RISC ISAs.
But times change, and logic got faster again. Code density mattered
again, and RISC architectures started incorporating denser
but harder to decode instructions (like ARM's Thumb and Thumb2
ISA extensions).
Now we are seeing a change within the processor logic itself:
the ratio of logic speed to communication speed is changing,
as is the ratio of power dissipated by wire loads as opposed to gates.
6 years ago, on-chip communication days were small compared
to logic delays, and the power dissipated by wide busses going
across a chip could be neglected. Not so now. Communication
is now expensive in both time and power, the two things you
shouldn't be wasting in a modern CPU. This is what will eventually
kill wide superscalar designs: communication overhead. And I contend
that this will lead to a different "optimal" ISA for processors, an ISA that
minimizes the need for wasted communication. However, market forces
will probably cause ISA would be hidden underneath translation hardware.
I don't know if anyone else at my employer agrees, though.
I'm not connected to processor design there anymore.
Anyway, does that clarify my meaning ?
--
So, if you can say, what kinds of things are you doing there now?
> Anyway, does that clarify my meaning ?
Yes, thanks. That makes a lot of sense. So, if I may ask, or at least try
to start a "general interest thread" (note subject change) what are the
characteristics of a thread that is designed for an era where communication
cost is expensive (in the sense Dennis talkes about above). What are good
things and what are bad? How will this change underlying architecture and,
if one is willing to abandon legacy ISAs (say for an embedded processor),
how will it change ISAs?
Rather than use incrementing, I think one would want to use
bit replacement (which is logically simpler).
Designing ISAs for ease of code generation was first done in the late
50s and delivered in the early 60s on the B5000 et seq (perhaps
not "mainstream" systems. The 5000/5500/6700.../A-Series word mode
nstruction sets were designed for easy code generation from Algol. Somewhat
later than the B5000 (delivery in mid 60s) was the Cobol-oriented
ISA of the B3500/2500 et seq. Perhaps these too were not "mainstream"
systems.
I once worked on the B5500 with a guy who had worked on the Univac 9300
Cobol compiler (360-knock-off). One day I found him sitting on the
floor looking at a B3500 POO and muttering something awful, comparing the
effort he had to do to generate good Cobol code for the Univac system
compared
to the very straightforward code generation from Cobol to the 3500.
Still later, Burroughs produced and sold the B1700 et seq with multiple
microcoded
implementation of multiple language-oriented ISAs. At least part of being
language-oriented was being easy to comple to the ISA from the language.
>
> In order to do this, you need a microcode approach, possibly with
> a programmable microcode. In the heyday of the RISC dogma, this
> was stated to be incompatible with performance, but the Pentium 4
> and Opteron have shown that that is now false, if it ever were
> true (which is very doubtful).
>
>
> Regards,
> Nick Maclaren.
--
"If you can't drink their booze, take their money, and then vote
against them anyway, you don't belong in this game."
L O'Donnell, Jr
Designing ISAs for ease of code generation was first done in the late
50s and delivered in the early 60s on the B5000 et seq (perhaps
not "mainstream" systems. The 5000/5500/6700.../A-Series word mode
nstruction sets were designed for easy code generation from Algol. Somewhat
later than the B5000 (delivery in mid 60s) was the Cobol-oriented
ISA of the B3500/2500 et seq. Perhaps these too were not "mainstream"
systems.
I once worked on the B5500 with a guy who had worked on the Univac 9300
Cobol compiler (360-knock-off). One day I found him sitting on the
floor looking at a B3500 POO and muttering something awful, comparing the
effort he had to do to generate good Cobol code for the Univac system
compared
to the very straightforward code generation from Cobol to the 3500.
Still later, Burroughs produced and sold the B1700 et seq with multiple
microcoded
implementation of multiple language-oriented ISAs. At least part of being
language-oriented was being easy to comple to the ISA from the language.
>
> In order to do this, you need a microcode approach, possibly with
> a programmable microcode. In the heyday of the RISC dogma, this
> was stated to be incompatible with performance, but the Pentium 4
> and Opteron have shown that that is now false, if it ever were
> true (which is very doubtful).
>
>
> Regards,
> Nick Maclaren.
Designing ISAs for ease of code generation was first done in the late
50s and delivered in the early 60s on the B5000 et seq (perhaps
not "mainstream" systems. The 5000/5500/6700.../A-Series word mode
nstruction sets were designed for easy code generation from Algol. Somewhat
later than the B5000 (delivery in mid 60s) was the Cobol-oriented
ISA of the B3500/2500 et seq. Perhaps these too were not "mainstream"
systems.
I once worked on the B5500 with a guy who had worked on the Univac 9300
Cobol compiler (360-knock-off). One day I found him sitting on the
floor looking at a B3500 POO and muttering something awful, comparing the
effort he had to do to generate good Cobol code for the Univac system
compared
to the very straightforward code generation from Cobol to the 3500.
Still later, Burroughs produced and sold the B1700 et seq with multiple
microcoded
implementation of multiple language-oriented ISAs. At least part of being
language-oriented was being easy to comple to the ISA from the language.
>
> In order to do this, you need a microcode approach, possibly with
> a programmable microcode. In the heyday of the RISC dogma, this
> was stated to be incompatible with performance, but the Pentium 4
> and Opteron have shown that that is now false, if it ever were
> true (which is very doubtful).
>
>
> Regards,
> Nick Maclaren.
Designing ISAs for ease of code generation was first done in the late
50s and delivered in the early 60s on the B5000 et seq (perhaps
not "mainstream" systems. The 5000/5500/6700.../A-Series word mode
nstruction sets were designed for easy code generation from Algol. Somewhat
later than the B5000 (delivery in mid 60s) was the Cobol-oriented
ISA of the B3500/2500 et seq. Perhaps these too were not "mainstream"
systems.
I once worked on the B5500 with a guy who had worked on the Univac 9300
Cobol compiler (360-knock-off). One day I found him sitting on the
floor looking at a B3500 POO and muttering something awful, comparing the
effort he had to do to generate good Cobol code for the Univac system
compared
to the very straightforward code generation from Cobol to the 3500.
Still later, Burroughs produced and sold the B1700 et seq with multiple
microcoded
implementation of multiple language-oriented ISAs. At least part of being
language-oriented was being easy to comple to the ISA from the language.
>
> In order to do this, you need a microcode approach, possibly with
> a programmable microcode. In the heyday of the RISC dogma, this
> was stated to be incompatible with performance, but the Pentium 4
> and Opteron have shown that that is now false, if it ever were
> true (which is very doubtful).
>
>
> Regards,
> Nick Maclaren.
Designing ISAs for ease of code generation was first done in the late
50s and delivered in the early 60s on the B5000 et seq (perhaps
not "mainstream" systems. The 5000/5500/6700.../A-Series word mode
nstruction sets were designed for easy code generation from Algol. Somewhat
later than the B5000 (delivery in mid 60s) was the Cobol-oriented
ISA of the B3500/2500 et seq. Perhaps these too were not "mainstream"
systems.
I once worked on the B5500 with a guy who had worked on the Univac 9300
Cobol compiler (360-knock-off). One day I found him sitting on the
floor looking at a B3500 POO and muttering something awful, comparing the
effort he had to do to generate good Cobol code for the Univac system
compared
to the very straightforward code generation from Cobol to the 3500.
Still later, Burroughs produced and sold the B1700 et seq with multiple
microcoded
implementation of multiple language-oriented ISAs. At least part of being
language-oriented was being easy to comple to the ISA from the language.
>
> In order to do this, you need a microcode approach, possibly with
> a programmable microcode. In the heyday of the RISC dogma, this
> was stated to be incompatible with performance, but the Pentium 4
> and Opteron have shown that that is now false, if it ever were
> true (which is very doubtful).
>
>
> Regards,
> Nick Maclaren.
Take a look at the log-log miss rate versus cache size plot on
http://en.wikipedia.org/wiki/CPU_cache I'm going to claim the
falloff as you head right continues for quite a long way.
The implication is that increasing the cache size by a fixed ratio
will eliminate some fraction of cache misses. The curve is kinky, but
across a very wide range of cache sizes you get *something* for your
trouble.
As you say, tighter encodings should effectively improve hit rate in
applications where a significant portion of the on-chip cache is used
by instructions. My understanding is that's mostly the embedded and
business apps, for different reasons, and not the scientific apps.
But I've also heard the science apps are growing their instruction
cache footprint.
So the interesting thing is that there is some benefit across a wide
spectrum of implementations.
But how to best mitigate the cost?
The key things I'm looking for in a new instruction set are:
1) compatibility, but this ruins all the fun. Aside from that:
2) Before you see the instruction bits that you are fetching, you
should be able to cap how many register reads and writes those
instructions might make. MIPS16 and Thumb are fine here, since
they do an explicit mode switch between short and long formats,
and that mode bit can be propagated to the fetch stage of the
pipe.
The gain here is that the rename logic can be integrated into
the fetch pipeline, with no FIFO between the I$ and rename.
That FIFO adds fetch latency and thus mispredict latency.
PowerPC and HP-PA aren't so great in this department, because
both can have two writes per instruction but usually don't.
3) As Dennis says, cross-chip busses are expensive. I'd like to
see an instruction set that helps get the maximum amount of
work from each result broadcast, by making it possible for the
processor to determine that a result (and its tag) need *not*
be broadcast. Two address instructions can help here by
making it obvious that one read operand is dead at the end of
the instruction.
4) Fetch is power hungry, as it requires a lot more bandwidth
from usually similar-sized caches to the data cache.
Compressing the instruction stream is good. Arranging for
the CPU to be able to use as many of the fetched bytes as
possible is good too, since a lot of fetched bytes get wasted
these days on each taken branch. This basically comes down
to predicting branches before they are *fetched*.
Finally, note that fast x86 implementations do not have microcoded
fast paths. The instructions that compilers are encouraged to emit
generate a small number of microops, i.e. one or two on the K8. This
leaves the issue queues with the at least contained problem of
enqueueing up to 6 microops per cycle.
I would guess that a new ISA would have more explicit
(software-managed) partitioning of work and more explicit
management of communication. (Offloading more of this to
software would allow simpler, smaller hardware and a more
efficient communication hierarchy/network.) This implies more
threading, architected clustering of registers and functional
units, possibly more explicit control of the memory hierarchy.
It might also be desirable for basic units of logical work to
become larger (and presumably this would be somewhat exposed in
the ISA to reduce hardware complexity/size) given 'cheap' logic.
In addition to horizontal clustering (perhaps somewhat like
MAJC?), vertical clustering (along the length of the pipeline)
might be exposed in the ISA. (E.g., one might have 'front-end'
registers [branch condition registers, jump target registers],
'early-stage' registers [global address registers and possibly a
stack register, registers whose values change relatively
infrequently and which are used to access associated
'early-stage' cache; this might include jump table lookups
{reducing the length of communication of jump targets}],
'middle-stage' registers [address and offset registers {with the
main fast cache}, perhaps flag values and even shift values],
'fast-execution-stage' registers [where result latency is
important] and perhaps a 'slow-execution-stage'.)
There might also be some attraction to explicit small value
registers (reducing register file size and communication width),
obviously for condition registers but also for other values.
Since intra-core multithreading is likely to be attractive (to
hide communication latency) there might be an incentive to
provide for register-based communication between threads (this
can be helpful with threads in separate cores as well). There
would probably also be some incentive for non-privileged thread
control operations (in which the OS would only get involved when
a hardware resource limit was reached?).
One might also want an ISA that has a nice hierarchy of decoding
such that farther levels of the memory hierarchy might have
denser but less decode-friendly representations. Decompression
work can be less expensive in energy and latency than going to
the next level of the hierarchy. OTOH, one also does not want
the common case (L1 hit) to be excessively slow and/or
power-using since decode logic would tend to be faster and less
energetic than accessing a large memory array. Farther levels of
memory could work on larger blocks and so recognize a different
level of redundancy/compressibility.
Good things would seem to be those that hide latency
(multithreading [especially intra-core], e.g.), make the common
case fast/small (clustering and hierarchies of communication,
e.g.), and exploit pipelining (perhaps with the motto "do [even
partial] things as soon as possible"?). Bad things would seem to
be those that unnecessarily expose latency (e.g., a load
jump-pointer and jump instruction might be good for code density
but bad for not allowing the load to be scheduled earlier
[assuming prediction is problematic]), that excessively
generalize (e.g., a huge L1 cache to increase the hit rate but
which increases average power and average access time), or that
try to do everything at one time.
Paul A. Clayton
just a technophile, not even a CS student
http://home.earthlink.net/~paaronclayton
Sorry, I can't say.
>> Anyway, does that clarify my meaning ?
>
> Yes, thanks. That makes a lot of sense. So, if I may ask, or at least
> try to start a "general interest thread" (note subject change) what are
> the characteristics of a thread that is designed for an era where
> communication
By "thread" do you mean ISA ? Or microarchitecture ?
> cost is expensive (in the sense Dennis talks about above). What are good
> things and what are bad? How will this change underlying architecture
> and, if one is willing to abandon legacy ISAs (say for an embedded
> processor), how will it change ISAs?
<Sigh> Any ideas _I_ have on that subject are Intel property, I'm afraid.
[ Even if no one at Intel would care what I think, except the lawyers.
But then, I like the lawyers I've worked with at Intel. Given the choice
between a random Intel patent lawyer and a random Intel manager,
I'd pick the lawyer. ]
--
Another reason being the ability to implement a simple processor
on a single chip, especially with an on-chip Icache?
Interestingly memory is still cheap but is now slow (or is it
expensive and a little slow, considering on-chip L3 cache
comparable to early RISC memory?)
>But times change, and logic got faster again. Code density mattered
>again, and RISC architectures started incorporating denser
>but harder to decode instructions (like ARM's Thumb and Thumb2
>ISA extensions).
I take this as a vote that code density matters enough (even for
a high-performance ISA) that significant thought should be
devoted to compact representations of work. (I am not certain I
agree with the argument. Even for embedded systems in the
higher-performance part of the spectrum, the dense-code ISAs have
not yet taken root much less florished. I am guessing that code
density is more important in some embedded systems because
moderate-capacity persistent storage [flash] is relatively
expensive and the ratio of code memory to data memory is
relatively high [and one often pays twice for memory capacity,
once for persistent store and again for DRAM, correct?]. In more
price-constrained systems, cache [being redundant, low-density
memory] is probably also less desirable relative to more
performance-oriented systems.)
[snip]
>that this will lead to a different "optimal" ISA for processors, an ISA that
>minimizes the need for wasted communication. However, market forces
>will probably cause ISA would be hidden underneath translation hardware.
Software translation, PLEASE! :-)
Paul A. Clayton
just a technophile
And I'm going to claim that for CPUs running full-feature OS's,
that graph is worthless, because it doesn't take into account
the task switching that happens in a full-feature OS, which
throws a big wrench into mid-size cache performance.
[ Similarly, analysis of single task benchmarks has often
led to too small TLBs (and inadequate TLB support in
some older ISAs as well) which really can whack performance. ]
> The implication is that increasing the cache size by a fixed ratio
> will eliminate some fraction of cache misses. The curve is kinky, but
> across a very wide range of cache sizes you get *something* for your
> trouble.
And you pay something for increasing the cache size: your cache leaks
more power, dissipated more energy every time you access, and is slower.
The first two mean that within a particular power envelope (such as
might arise from packaging limitations, cost concerns, or the desire
for decent battery life) you are going to have to slow the thing down.
The last cost means it won't be as fast as a smaller one. At some
point, your bigger cache actually leads to lower performance,
because the better hit rate just doesn't make up for the slowdown.
[...]
> The key things I'm looking for in a new instruction set are:
[...]
> 2) Before you see the instruction bits that you are fetching, you
> should be able to cap how many register reads and writes those
> instructions might make. MIPS16 and Thumb are fine here, since
> they do an explicit mode switch between short and long formats,
> and that mode bit can be propagated to the fetch stage of the
> pipe.
>
> The gain here is that the rename logic can be integrated into
> the fetch pipeline, with no FIFO between the I$ and rename.
> That FIFO adds fetch latency and thus mispredict latency.
I don't see how that gain comes from knowing how many
register reads and writes you are doing. What am I missing ?
> 4) Fetch is power hungry, as it requires a lot more bandwidth
> from usually similar-sized caches to the data cache.
> Compressing the instruction stream is good.
Or just use an ISA that is efficiently encoded to begin with.
Like the VAX ISA, for example: why use a whole 32-bit
word when a single byte will do ? So, the wheel turns,
the underlying technologies change, and now RISC isn't
the best fit anymore, and CISC is ...
But NOT that old "close the semantic gap" CISC !
Geesh, in retrospect, what nonsense that was. We just
need an efficiently coded ISA that's a good target for compilers,
and supports modern operating systems, that's all.
Now, should it be Load-Store, like the RISC guys used to think ?
Maybe not: if you are only going to use a value once (and the compiler
knows whether you are or not), why load it into an (architectural) register
?
Now MEM op MEM is such a pain, I'm all for not hitting the data TLB twice.
But some kind of MEM op REG, that just might be the new sweet spot ...
Thanks, though Dennis and Stephen have spawned
another from it.
[snip]
> As you say, tighter encodings should effectively improve hit rate in
> applications where a significant portion of the on-chip cache is used
> by instructions. My understanding is that's mostly the embedded and
There is also a reduction in bandwidth required.
[snip]
> 2) Before you see the instruction bits that you are fetching, you
> should be able to cap how many register reads and writes those
> instructions might make. MIPS16 and Thumb are fine here, since
How bad are 'pair' operations like load register pair?
[snip]
> 4) Fetch is power hungry, as it requires a lot more bandwidth
> from usually similar-sized caches to the data cache.
> Compressing the instruction stream is good. Arranging for
> the CPU to be able to use as many of the fetched bytes as
> possible is good too, since a lot of fetched bytes get wasted
> these days on each taken branch. This basically comes down
> to predicting branches before they are *fetched*.
Might a faster-cycled, narrower fetch be appropriate, allowing
more code information to be exploited before the next fetch and
possibly allowing faster corrections?
Well, yeah. Can you give me a pointer to some public data which
is better?
In any case, although you certainly wouldn't want to rely on the
specific shape of the curve, other results that I've read have
indicated that cache miss rate continues to fall as cache sizes
get bigger -- compulsory misses almost never dominate.
And yep, many folks I've worked with would agree that big
associative TLBs are better than simple sims indicate.
Dennis> At some point, your bigger cache actually leads to lower
Dennis> performance, because the better hit rate just doesn't
Dennis> make up for the slowdown.
But note that that point is greater than 2MB for secondary caches
for a low-power x86 implementation (named Banias). Or did I get
the number wrong?
Iain> The gain here is that the rename logic can be integrated into
Iain> the fetch pipeline, with no FIFO between the I$ and rename.
Iain> That FIFO adds fetch latency and thus mispredict latency.
Dennis> I don't see how that gain comes from knowing how many
Dennis> register reads and writes you are doing. What am I missing ?
If the fetch pipeline integrates rename with no intervening FIFO,
then for X bytes fetched, you will have Y read renames and Z write
renames. If you can predict, when launching the fetch into the pipe,
how many renames are needed, then you can throttle the pipe to match
your rename bandwidth at the top of the pipe.
Otherwise, with no FIFO, you'll occasionally fetch an instruction
which needs more rename ports than are available. The pipeline gets
the wrong answer and you restart it. I suppose you can tell the
compiler writers to put lotsa-register instructions near ops with
constants, to cut down how often this happens. I've never analyzed
that alternative.
And, of course, on some RISC pipes the prediction is trivial: each
32-bit instruction fetch will need two reads and one write. No
throttling necessary.
Iain> Compressing the instruction stream is good.
Dennis> Now MEM op MEM is such a pain, I'm all for not hitting the
Dennis> data TLB twice. But some kind of MEM op REG, that just
Dennis> might be the new sweet spot ...
Yep. x86 looks pretty good if you use the right subset. And of
course MEM <- MEM op REG is just fine when it's the same MEM: one
TLB access.
Not offhand. Surely you can find some yourself, though.
I'm sure the IEEE and ACM journals have papers on such.
> In any case, although you certainly wouldn't want to rely on the
> specific shape of the curve, other results that I've read have
> indicated that cache miss rate continues to fall as cache sizes
> get bigger -- compulsory misses almost never dominate.
Misses caused by task switching aren't mainly compulsory misses.
> Dennis> At some point, your bigger cache actually leads to lower
> Dennis> performance, because the better hit rate just doesn't
> Dennis> make up for the slowdown.
>
> But note that that point is greater than 2MB for secondary caches
> for a low-power x86 implementation (named Banias). Or did I get
> the number wrong?
Why do you say that ? Do you have numbers to back that claim,
comparing, say, the performance of a CPU with a 2MB L2 against
that of a CPU with a faster 1MB cache ? I'm not saying you are
wrong, mind you. But you are making a claim without any apparant
support beyond "it must be better because that's what they did".
> Iain> The gain here is that the rename logic can be integrated into
> Iain> the fetch pipeline, with no FIFO between the I$ and rename.
> Iain> That FIFO adds fetch latency and thus mispredict latency.
>
> Dennis> I don't see how that gain comes from knowing how many
> Dennis> register reads and writes you are doing. What am I missing ?
>
> If the fetch pipeline integrates rename with no intervening FIFO,
> then for X bytes fetched, you will have Y read renames and Z write
> renames. If you can predict, when launching the fetch into the pipe,
> how many renames are needed, then you can throttle the pipe to match
> your rename bandwidth at the top of the pipe.
Ah, you are assuming a microarchitecture that depends heavily
on register renaming. Not all of them do, you know, and that
may not be the "sweet spot" in the future.
> Iain> Compressing the instruction stream is good.
>
> Dennis> Now MEM op MEM is such a pain, I'm all for not hitting the
> Dennis> data TLB twice. But some kind of MEM op REG, that just
> Dennis> might be the new sweet spot ...
>
> Yep. x86 looks pretty good if you use the right subset. And of
> course MEM <- MEM op REG is just fine when it's the same MEM: one
> TLB access.
But two data cache access, even if one can be done opportunisically
via a write buffer. I'm not sure that's worth it. REG <- MEM op REG
seems easier to pipeline, to me.
Er, yes. I should have made myself clearer :-( I was thinking of
somewhat more generic models, but I realise that I made a typo. and
put 'language' instead of 'languages'. Anyway, I wasn't meaning
to put down the Burroughs, so much as to point out that there
was serious published analysis on ISA design over 35 years back,
covering many of the same topics that were rediscovered in the
mid-1980s and is being rediscovered again today.
In response to a remark of Dennis O'Connor, yes, the cheap memory
of the late 1980s was a factor in RISC getting away with it, but
the code size issue had ceased to be a primary constraint over a
decade before. It was, briefly, a rediscovered issue on the early
microprocessors and was important until recently on embedded ones,
but is almost always overstated.
My belief is that the enduring, if not endearing, infatuation with
code size is because it is one of the few issues simple enough for
inexperienced amateurs to grasp.
Regards,
Nick Maclaren.
Yes, Thumb-2 does mix 16 and 32-bit instructions and can achieve
the same performance as ARM, but at Thumb-1 codesize.
> A decrease in code size would most likely lead to an increase in
> instruction cache hit rate. This obviously depends upon the
> implementation of other microarchitectural features such as prefetch
> though.
This effect can be quite large on a small cache (8 or 16KB), and is still
significant on 32 and 64KB I-caches. On a context switch you get the
working set back with far fewer cache misses/page faults. You can spend
extra transistors on the cache (ARM needs a 50% larger I-cache for identical
performance) or on the memory hierarchy (L2, wider bus - more area and
extra power) or on an instruction decoder that can deal with 16/32-bit
instructions (cheap compared to the other alternatives).
In the embedded world going for code density is definitely the best tradeoff.
Wilco
> ""Torben Ægidius Mogensen"" <tor...@app-4.diku.dk> wrote in ...
> > Dysthy...@aol.com writes:
> >> Torben Ægidius Mogensen wrote:
> >
> >> > - Provide instructions to load/store an interval/set of registers on
> >> > stack to reduce size of procedure prologues/epilogues.
> >>
> >> Optimizing procedure overhead is an attractive idea,
> >> but I am concerned that adding such would increase hardware
> >> complexity and pipeline length.
> >
> > It has been a part of the ARM ISA since ARM1, so it can't be that
> > difficult to implement.
>
> That's very poor reasoning indeed. Some ISA features are easy
> in primitive systems that lack caches, out-of-order completion,
> split transaction buses and MMUs, but become a pain in the ass
> to get right when these things are introduced. For example,
> in a non-MMU system, you don't have to worry about a page fault
> occurring in the middle of your LDM/STM, which is a PITA.
True. I believe early ARMs had an error exactly when this happened.
> LDM/STM is not too hard, but there are a lot of things you
> have to keep in mind to make sure it gets done in the face
> of all the events that might occur during its execution, and
> you do need to add a little hardware to the pipeline to do it.
>
> One of the motivational philosophies of RISC was to avoid
> things that added complexity without adding significant
> performance. Given the high hit rate of instruction caches,
> you can argue that LDM/STM may not be a win compared to
> a sequence of, say, LDRD/STRD (double-word load/stores),
> especially if your ISA is efficiently encoded (like IA32 and Thumb).
> And then you go off and simulate that, for the particular
> implementation technology of the say, and see.
The main issue here was code density, not efficiency, so replacing
LDM/STM by a sequence of load/stores is not a good idea.
> On a related note, one mistake naive architects often make
> is thinking that the "goodness" of a particular ISA feature
> is independent of the underlying implementation technology.
> This is demonstrably untrue.
Definitely. But code density less so than efficiency, though some
choices for making code more dense can adversely affect efficiency.
Torben
> <Dysthy...@aol.com> wrote in message
> news:1121349929.9...@g44g2000cwa.googlegroups.com...
>
> >Torben Ćgidius Mogensen wrote:
> >[snip]
> >>
> >> - Use two-address instructions, i.e., A := A+B instead of C := A+B.
> >> Coalescing register allocators can in most cases eliminate the
> >> extra moves required by two-address code (assuming you have more
> >> than a handful of available registers).
> >
> > The cited paper proposed this also, though it used a 3-operand
> > encoding when appropriate to avoid the extra move instruction.
>
> One variient that I have thought of that might make sense here is to spend a
> single bit in the instruction that when set makes the instruction A+1 = A +
> B. That is, if set, the destination is the "next" register from the first
> source. This complicates register allocation but saves the extra moves
> without using 4-5 bits for a full register specifier.
A more drastic way of saving bits for register specification is to
allow only a small subset of the N^2 or N^3 possible pairs and triples
of registers, selected more or less at random. If the desired pair or
triple isn't available, a sequence of moves may be required to move
the operands into the right registers first (and it may get longer if
the registers on the shortests paths are occupied). The register
allocator will need to minimize the number of such moves.
The problem is obviously NP-complete, but heuristics may be found.
Torben
That depends. I've seen the distribution of the number
of registers saved and restored by compiler-generated
LDM/STM for ARM, and it is illuminating data. Given the
extra bits needed to spec which regs to load or store
in a LDM/STM, one can imagine an ISA where an equivalent
series of LD or ST ops would not take many more bits
to encode. Given that you could then remove all the decode
and pipeline cruft associated with LDM/STM, this might
be a win even in the face of the slightly higher I-Cache
miss rate. I'd have to build a simulator and run some
workloads to know for sure though, and I don't do that
kind of stuff for a living, although I used to.
>>On a related note, one mistake naive architects often make
>>is thinking that the "goodness" of a particular ISA feature
>>is independent of the underlying implementation technology.
>>This is demonstrably untrue.
>
> Definitely. But code density less so than efficiency, though some
> choices for making code more dense can adversely affect efficiency.
"Efficiency" can mean efficiency in use of gates, or time,
or power, or engineer-years to get it out the door. :-)
Soon, I think, only the latter two will really matter
in CPUs designed for full-featured OS's.
--
Dennis M. O'Connor
Opinions do not reflect ...
nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> In response to a remark of Dennis O'Connor, yes, the cheap memory
> of the late 1980s was a factor in RISC getting away with it, but
> the code size issue had ceased to be a primary constraint over a
> decade before. It was, briefly, a rediscovered issue on the early
> microprocessors and was important until recently on embedded ones,
> but is almost always overstated.
"Memory System Characterization of Commercial Workloads"
(http://research.compaq.com/wrl/projects/Database/isca98_1.pdf)
states a rather bad instruction cache behavior for an Oracle
SMTP application. Since then, software has grown, and, thanks
to object-orientation, code that once was one big chunk is now
spread into methods all over the code space. Still, I am not
knowledgeable enough about caches to judge if the is really
a factor in overall performance.
Regards,
Chris
> >> This is interesting. What "level" of implementation technology are you
> > Anyway, does that clarify my meaning ?
> Yes, thanks. That makes a lot of sense. So, if I may ask, or at least try
> to start a "general interest thread" (note subject change) what are the
> characteristics of a thread that is designed for an era where communication
> cost is expensive (in the sense Dennis talkes about above). What are good
> things and what are bad? How will this change underlying architecture and,
> if one is willing to abandon legacy ISAs (say for an embedded processor),
> how will it change ISAs?
To make the topic even more general, I'm going to try to go 'back to
basics'.
Let us say that one has a very nice microprocessor that fits on a given
die - such as the Pentium IV or a comparable chip.
Then, the capability comes along to make bigger chips. What will you
add, so as to provide improved performance?
There are a number of nice things you could add.
You could add a second processor core.
You could add more cache.
You could still improve the arithmetic unit; instead of using the
multiplier unit several times in a row to do a divide, you could have
several multipliers lined up in a division pipeline, and start one
floating-point division every clock cycle.
You could widen the bus connecting main memory to the cache.
For today's chips, I would tend to vote for more cache ahead of a
second processor core, because effective utilization of a single
processor core already requires multithreading, and if multiple threads
are executing, the effective size of the cache is divided between them.
What I would *really* like to do, of course, is just make everything
faster: i.e., by using ECL instead of CMOS, or GaAs instead of silicon.
But at present, that isn't remotely possible - while technologies exist
to make faster gates than those in CMOS microprocessors, the number of
gates per chip falls off so drastically that improvement does not
appear possible in this direction.
Since there were some MIPS chips done in ECL, however, it _is_ possible
to make a microprocessor with hardware floating point in ECL; and as we
are getting into the range where the gains from using more transistors
on a chip are ever more marginal, if ECL is not totally abandoned, if
that technology continues to improve, it might once again become worth
considering. Improvements in chip size with GaAs or some other high
electron mobility semiconductor are also taking place - but very
gradually.
Possibly some other technology, like plastic semiconductors, or
amorphous silicon, will improve things in the other direction, allowing
unlimited transistor budgets but for transistors not quite as fast as
those on silicon CMOS chips.
And then there's the other radical way to increase transistor budgets:
replace lithography with something else, so that the 154 nanometer
barrier can be shattered - the difficulty of extreme ultraviolet
lithography being so great that extraordinary techniques are being used
to get as much mileage as possible out of current ultraviolet
wavelengths.
Perhaps, with current technologies improving as they are, a direction
of improvement might be to put a relatively simple microprocessor on
the same chip as some dynamic RAM instead of just cache - and perhaps
even dense non-volatile RAM so that the *hard drive* is in effect on
the chip too.
John Savard
No problem. It was worth asking.
>
>>> Anyway, does that clarify my meaning ?
>>
>> Yes, thanks. That makes a lot of sense. So, if I may ask, or at least
>> try to start a "general interest thread" (note subject change) what are
>> the characteristics of a thread that is designed for an era where
>> communication
>
> By "thread" do you mean ISA ? Or microarchitecture ?
Sorry. I was thinking about changing the thread subject and simply used the
word thread again when I should have used microacrcitecture or, as I stated
below, ISA.
I understand that "proprietary isues" may prevent you from saying much, but
others may be more free to pontificate..
I think Dennis' point is that "bigger" chips won't provide more performance,
as communication costs dominate over transistor/computing costs. So simply
adding more functions, or even making basic transistors faster won't help
much if you can't reduce the distances that signals have to travel.
I'm not sure what you mean here. If you use a bit set to indicate use a
different register for the destination, then you could only use a subset,
say the even numbered registers for the basic specification and then
substitute the "next" odd numbered one for the results (i.e substituting
that bit for the low order bit of the register specifier for the first
source). While this could certainly work, it halves the number of register
pairs you could use for the purpose.
Note that while the increment does take longer than "bit substitution", the
result of the addition is the destination register, which isn't needed
untill after the operation of the instruction, so it isn't as bad as
delaying getting a source register specification.
Nick Maclaren wrote:
> In article <1121333201.7...@g47g2000cwa.googlegroups.com>,
> "j...@beniston.com" <j...@beniston.com> writes:
> |> > Heidi Pan's "High Performance, Variable-Length Instruction
> |> > Encodings" implies that a 25% reduction in code size relative to
> |> > MIPS should be possible without making a superscalar
> |> > implementation excessively difficult.
> |>
> |> You can get a 40% reduction with just a mix of 16 and 32-bit
> |> instructions, without harming performance or making superscalar
> |> execution more difficult than it otherwise would be. And yes, for
> |> embedded apps, 40% reduction in code size is well worth it, that's why
> |> we have MIPS-16, Thumb, ARCcompact etc.
>
> In the past, I have estimated that you could get a 50-75% reduction
> by a two-level ISA, where the top level was designed for the high
> level language. Designing ISAs for ease of code generation was
> first proposed in the late 1960s, as far as I know, but was never
> done in a mainstream system, and hasn't been attempted at all in
> recent decades.
Debaere & Campenhout's Interpretation and Instruction Path Coprocessing
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=5164
is a fine variation on this theme and was done in the early 90's. Their
idea was to speed-up threaded code by tacking a "decode ROM" onto the
instruction-fetch logic of a 680x0 processor. The decode ROM contained
the native instruction sequences for each threaded code. Threaded code
was fetched by external logic, looked up in the decode rom, and the
corresponding native instructions fed to the processor.
Using a predesigned processor and tacking on some support logic for a
higher-level language makes good economic sense. One saves much of the
design effort of a ground-up design, and can hitch a ride on a commodity
processor whose implementations follow Moore's law instead of being left
behind by a small market niche as were Lisp machines.
I suspect this would have been a better way to implement e.g. picojava.
Its also reminiscent of what one could do with configurable processors
like Tensilica, where you have a predefined RISC core (which
incidentally has two instruction sets, either 16-bit or 24-bit) and the
ability to define additional logic to interface to the core using tools
that compile C definitions of the logic.
[I don't know how Alpha PALcode is implemented, but it is perhaps a
similar idea]
> In order to do this, you need a microcode approach, possibly with
> a programmable microcode. In the heyday of the RISC dogma, this
> was stated to be incompatible with performance, but the Pentium 4
> and Opteron have shown that that is now false, if it ever were
> true (which is very doubtful).
The above examples show you don't need to use microcode at all. You
simply need to provide ways of interfacing additional logic to an
existing core. Using FPGA technology one can imagine e.g. adding
associative memories used to speed-up specific language features such as
oo method lookup.
[and of course whether these kinds of approaches can beat compilation
technology is a separate question. Because compilation technology looks
at much more of the problem than a single bytecode it typically nets
much greater performance benefit.]
--
_______________,,,^..^,,,____________________________
Eliot Miranda Smalltalk - Scene not herd
A quick glance at that gives a strong sense of deja vue - that sort
of analysis dates from the 1960s and 1970s, and nothing has changed
except for the scale. Your point was made then, too, as a reason
that the then trendy 'structured programming' (with every little
basic operation made into a single function) was a performance
problem.
Then, as now, the effect of Dcache has 'steps' in it, but that of
Icache is more like doubling the cache size reduces the miss rate
by 30%. One has to really use a HUGE cache to do anything useful
about a 19% miss rate and, conversely, one has to compress code
beyond reason if you use that approach.
One solution adopted to the 'method' problem (which is the same as
using a large, common set of basic operations) was and is inlining,
but that leads to bloat. A better one was tools for code ordering
(which sometimes allow replication), though that is closely related
to overlay management and suffered the same fate. These approaches
have the potential to tackle even 19% miss rates.
Potential. They aren't easy to design or use, and are very out of
fashion. One thing that most vendors seem to have missed is that
they are a natural for profile-based optimisation, which some of us
were doing in the 1970s for overlay designs. That could and should
be reinvented. It shouldn't be patented, but probably will be.
Regards,
Nick Maclaren.
Wow!!!
That is either the best reference I've ever seen to a group of lawyers,
or the worst reference to a group of managers. :-)
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Interesting. That's the sort of thing that I mean, but see below
about its limitations.
>I suspect this would have been a better way to implement e.g. picojava.
Yes.
>
>Its also reminiscent of what one could do with configurable processors
>like Tensilica, where you have a predefined RISC core (which
>incidentally has two instruction sets, either 16-bit or 24-bit) and the
>ability to define additional logic to interface to the core using tools
>that compile C definitions of the logic.
Yes. There are many ways to approach this.
>The above examples show you don't need to use microcode at all. You
>simply need to provide ways of interfacing additional logic to an
>existing core. Using FPGA technology one can imagine e.g. adding
>associative memories used to speed-up specific language features such as
>oo method lookup.
No, you don't need microcode, or even a microcode appproach, but I
don't think that the gain is big enough to be worthwhile without.
A fixed front-end decoder is fine for a single language, but even
then is fairly limited.
One of the reasons that I don't think that modern FPGA approaches
will work is you really need access to a lower-level ISA than is
normally defined (i.e. the same level as traditional microcode
used). A front-end can't be used to do anything that isn't easy
in the ISA (and many things aren't) and few applications have
single operations that dominate enough to be worth putting into
FPGA.
Regards,
Nick Maclaren.
Nick Maclaren wrote:
Well, if instruction density is important, then the above approach can
have a significant effect. Bytecode to native dynamic translators can
easily average 10 to 20 bytes of machine code for each byte of bytecode.
But a naive one-to-n decoder isn't going to allow the processor to run
at full tilt. One needs to be able to produce enough code for the
processor's lookahead facilities to work. But that's doable; similar
things are done in dynamic reorganization of binaries (there's work at
HP I can't track down).
> One of the reasons that I don't think that modern FPGA approaches
> will work is you really need access to a lower-level ISA than is
> normally defined (i.e. the same level as traditional microcode
> used). A front-end can't be used to do anything that isn't easy
> in the ISA (and many things aren't) and few applications have
> single operations that dominate enough to be worth putting into
> FPGA.
Right. For example one would like to be able to define instructions or
trap semantics for dealing with tags. [Shame that the SPARC hard-wired
the integer tag pattern to 00, instead of making it part of the
per-process state; our Smalltalk implementation still doesn't use the
tagged instructions because we find non-zero immediate tags more
convenient across all our platforms].
My point is that is isn't of major importance, because it is merely
the converse of increasing the Icache size.
> But a naive one-to-n decoder isn't going to allow the processor to run
>at full tilt. One needs to be able to produce enough code for the
>processor's lookahead facilities to work. But that's doable; similar
>things are done in dynamic reorganization of binaries (there's work at
>HP I can't track down).
That is a much more important issue, but it is NOT doable. Doubtless
the work you are referring to was related to the Itanium project;
assuming that was a soluble problem was the principal performance
mistake made by that project.
Regards,
Nick Maclaren.
Hmn? If the bit replaces the LSb of the register number,
an even 2nd source operand could specify either replacement
or use of the odd partner (incidentally the same as your
incrementing scheme) while an odd 2nd source operand could
specify either replacement or use of the even partner. It
does have the relative disadvantage of not providing new
destination registers for a series of operations in which
many intermediate results are preserved.
If one is willing to accept delay (and even more compiler
complexity!), one might even consider including the opcode
or the 1st source operand in a decision of which bit to
set or begin the (shifted) increment. (Adding complexity
is relatively easy, I suppose.)
Paul A. Clayton
http://home.earthlink.net/~paaronclayton
Nick Maclaren wrote:
It was work on reorganizing binaries on PA-RISC, improving performance
by reorganizing basic blocks to remove branches. It wasn't very
interesting performance-wise since they only saw increases in the 5%-10%
range (IIRC) but it did demonstrate organizing a cache of native code.
Transmeta has to do things like this internally, and JITs do it all the
time. The simple trick is to have any attempt to execute outside of the
currently cached code cause that external code to be brought into the
cache, possibly evicting less-recently-used cached code. Its similar to
paging.
Hand-waiving (I'm not a processor architect) I imagine the scheme would
work something like... the compact-code to native-code converter
(CCNCC?) outputs to a bounded buffer containing at least enough native
code to keep the lookahead facilities fed. Branches outside the buffer
would be traps causing much the same problems as an Icache miss does
anyway. The CCNCC needs to go fetch and translate some compact code
PDQ. The better integrated the lookahead logic is with the CCNCC the
earlier the CCNCC can go fetch code for translation.
I've got data for that too for large applications, and while the average
isn't anywhere near the maximum 16, the codesize saving of having LDM
is about 15-25%... LDMs could allow at most 8 registers rather than 16
and still get 99% of the benefit.
>Given the
> extra bits needed to spec which regs to load or store
> in a LDM/STM, one can imagine an ISA where an equivalent
> series of LD or ST ops would not take many more bits
> to encode. Given that you could then remove all the decode
> and pipeline cruft associated with LDM/STM, this might
> be a win even in the face of the slightly higher I-Cache
> miss rate. I'd have to build a simulator and run some
> workloads to know for sure though, and I don't do that
> kind of stuff for a living, although I used to.
Splitting LDM/STM into a sequence of loads would be slower
unless you can execute at least 2 loads/stores per cycle (current
ARMs do 2 registers per cycle, future ones may do 4). Power
would be a LOT worse, unless you add all the cruft back in...
You could split into LDRD (2 registers per cycle) but then you
need to handle 4-byte aligned addresses efficiently to get the
performance back (and that is just as complex as doing general
LDMs).
LDMs definitely improve performance as they can more efficiently
deal with unaligned data. They are essential for SIMD where you
need to read unaligned structures and swizzle the data into vectors.
> "Efficiency" can mean efficiency in use of gates, or time,
> or power, or engineer-years to get it out the door. :-)
> Soon, I think, only the latter two will really matter
> in CPUs designed for full-featured OS's.
Constraints like these work differently in the embedded world.
For example adding a feature to improve codedensity will add gates,
use extra power in the decoder and take extra engineering time if
you just consider the core on its own. But if you consider the complete
system you may actually save gates, power, and engineering time.
Wilco
OK, I understand. Thanks for the explanation. Yes, it is another
alternative with some advantages and some disadvantages over my original
proposal. Ahh, tradeoffs! :-)
That's exactly what Visual C -- and I presume other commercial
compilers -- are doing when compiling applications with POGO:
* Compiler decides which function should be optimized for speed,
and which for space; usually only ~5-15% of all functions are
optimized for speed.
* Compiler makes inline decisions based on profile results,
not on the usual heuristics. That means that we are inlining
more aggressively in the very hot functions, less aggressive
in the "normal" functions, and only when saving size in the
cold functions.
* Compiler often can speculatively inline virtual functions,
thus partially fixing part of the OOP-style problems.
* Compiler use profile results to layout functions.
* Compiler use profile results to layout code inside functions.
(I listed only optimizations relevant to current discussion; much
more optimizations benefit from profiling data).
Large server applications benefit from POGO much more
than SPEC benchmarks. SQL Server gains more than 30%.
You can dig out that number from official MS document at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/profileguidedoptimization.asp
Thanks,
Eugene
>
> Regards,
> Nick Maclaren.
> Splitting LDM/STM into a sequence of loads would be slower
> unless you can execute at least 2 loads/stores per cycle (current
> ARMs do 2 registers per cycle, future ones may do 4). Power
> would be a LOT worse, unless you add all the cruft back in...
PowerPC 970 does two loads or stores per cycle, and LDM/STM are
internally split into pairs of two loads or stores. A pair of loads or
stores has the advantage that another two instructions can be dispatched
simultaneously, so LDM/STM is definitely not the fastest way.
Sun's compile time/link time/binary optimizer do the code reordering,
and our top ISVs use them to reduce I$ miss substantially
for their applications.
A new article posted at:
http://developers.sun.com/prodtech/cc/articles/codelayout.html
has a very graphical illustration of the effect of such optimizations
and how to use them.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Precisely. That has been possible since time immemorial; the insoluble
problem is to get realistic optimisations (factors of 2-10) on
'ordinary' (i.e. convoluted) code.
Regards,
Nick Maclaren.
No, I am NOT talking about inlining - I am talking about code
reordering. As Seongbae Park says, Sun's compiler does something
along those lines, but I know of few others that do, and even Sun's
compiler doesn't tackle the whole of this problem.
Regards,
Nick Maclaren.
I belive that he was talking about Power as in Watts :-)
LDM/STM are microcoded, so they can't share dispatch groups with other
instructions. On the other hand, they pump out 4 load/stores per group.
But then again, there is a 3 clock dispatch bubble after an LDM/STM. And
there is a 1 clock bubble when a microcoded instruction follows a
non-microcoded one. See section 6.3.3.4 in the 970FX manual.
BTW, the POWER5+ is going to add quad word load instructions (presumably
for FP). This is really weird as the Gigaprocessor lineage has carefully
avoided microops with 2 destinations (of the same flavour) previously.
One way this could work, would be to put a different value in each copy
of the FP register file. Special instructions would then be needed, that
would double dispatch to each FP pipe and inhibit forwarding of results
to the opposite copy of the FP pipe. (Alternatively forward to the
opposite pipe, but inhibit storing to the local register file.)
Unfortunately, the slides with the info about the quad load instructions
make no mention of new FP arithmetic instructions.
The POWER5+ is supposed to be out RSN, so we should know what is
actually done pretty soon.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
I was indeed. Splitting LDMs in hardware like that is the right thing to do,
as you could still mark them as being sequential to get the power saving
(no need to check the TLB or lookup the cacheline N times unless you fall
off the end of a page/line). A substantial number of accesses use LDM on
ARM, as LDMs can also be used for loading structure fields, passing
structures, memcpy, memset etc.
> LDM/STM are microcoded, so they can't share dispatch groups with other
> instructions. On the other hand, they pump out 4 load/stores per group.
> But then again, there is a 3 clock dispatch bubble after an LDM/STM. And
> there is a 1 clock bubble when a microcoded instruction follows a
> non-microcoded one. See section 6.3.3.4 in the 970FX manual.
So it looks like LDM/STM are not faster than the equivalent sequence of
loads and stores on the 970... Are they still used a lot by PPC compilers or
do they always split them into loads/stores?
Wilco
The way I understood it was that we have to take cognizance of the
relative cost of memory versus the cost of chip size, and the
limitations of our buses, to determine how to most effectively make use
of them.
Making a superscalar design wider by sticking in multiple cores indeed
does cause problems; since communications to main memory are slow and
limited, it means each thread needs its own chunk of the cache. So,
instead of making the widest possible superscalar design, it is
preferable to ensure that there is lots of cache for each core first,
and decide how many cores to put in on that basis.
That makes sense to me. The new Cell design attempts to ration
communication by only allowing the master core to access main memory,
while the subsidiary cores can only access their own private on-chip
memories. Thus, one doesn't have to add circuitry to the chip that may
manage contention between processors for access to memory wrongly;
instead, the compiler can take plenty of time to work out the optimal
strategy.
I certainly don't think he was claiming there was no hope to improve
computer performance, just that the right direction to go in is
strongly influenced by what is available.
John Savard
> >However, market forces
> >will probably cause ISA would be hidden underneath translation hardware.
> Software translation, PLEASE! :-)
Sorry, market forces dictate that if you want to build a computer in a
box and sell it, it must run Microsoft Windows without fuss, muss, or
bother, including all DRM features.
Hence, there shall ever be one ISA to rule them all.
John Savard
unfortunately the link time / binary optimizer only works with
applications, not shared libraries, which makes it be of rather
limited value.
--
Sander
+++ Out of cheese error +++
You may be surprised, but inlining *has* something to do with code
locality. And inlining was only part of my original message -- I
explicitly wrote that Visual C is reordering code depending on
profiling results. We are splitting the code into hot and cold regions,
reordering function layout depending on the dynamic call graph, and
inside each function we reorder the code to improve locality as well.
I know things that can be improved, and we may address some of the
issues in the next compiler version, but even current implementation
helps large server applications a lot -- much more than it helps SPEC
benchmarks.
Of course Dennis can resspond himself to clarify his meaning, but I was
reacting to the following excerpt from his post
Quoted from Dennis O'Connor
Now we are seeing a change within the processor logic itself:
the ratio of logic speed to communication speed is changing,
as is the ratio of power dissipated by wire loads as opposed to gates.
6 years ago, on-chip communication days were small compared
to logic delays, and the power dissipated by wide busses going
across a chip could be neglected. Not so now. Communication
is now expensive in both time and power, the two things you
shouldn't be wasting in a modern CPU. This is what will eventually
kill wide superscalar designs: communication overhead. And I contend
that this will lead to a different "optimal" ISA for processors, an ISA that
minimizes the need for wasted communication. However, market forces
will probably cause ISA would be hidden underneath translation hardware.
end quotation
snip
I don't necessarily disagree with your position - I don't claim enough to
know. But I do think it was responding to a different problem then Dennis
talked about, as above.
> I certainly don't think he was claiming there was no hope to improve
> computer performance, just that the right direction to go in is
> strongly influenced by what is available.
On that, I think we all agree! :-)
HP's HP-UX compilers have been doing that for more than a decade,
and used by the kernel and many ISVs (including the major ones).
Some of the DEC research papers on I Cache performance and subsequent
(post link-time) optimizations would have had different results
if the GEM compiler had implemented those techniques at the time.
So don't bother reiterating your opinion that those probably "don't
tackle the whole of this problem". People who keep complaining
about youngsters not knowing their computer history should have
a better grasp of the current state of technology.
Profile-based optimizations (aka feedback based, aka POGO) work well
in practice, despite the thought experiments we can all make. My own favorite
example is a profile date-processing code being collected on a leap year
- and making the mistake of only using dates from the current year -
leading to 'wrong' profile information for 3/4th of the time.
Practice shows that almost all branches are very heavily biased,
that simple training runs are typically sufficient to obtain good-enough
information about those biases (or the absence of bias) and that modern
micro-architectures (BTB and caches) are doing a decent job of compensating
for compiler imperfections in predicting hot paths.
Everytime PBO is mentioned in this newsgroup, some posters (who
should recognize themselves) will claim horror stories and theoretical
failings. Those are doing a disservice to software users. They are
the computer industry equivalents to those so-called ecologists
throwing specious arguments against sustainable energy sources
("windmills as bird killers", "photovoltaic cells require
more energy than they produce", "photovoltaic cells won't be
enough at night, so why bother") or recycling ("styrofoam more
friendly than recycling glass"). In both cases interesting
theoretical studies or rare special cases are employed to detract
attention from the overall results.
Eric
Nick Maclaren wrote:
> In article <7zhdexr...@app-4.diku.dk>,
> tor...@app-4.diku.dk (=?iso-8859-1?q?Torben_=C6gidius_Mogensen?=) writes:
> |> >
> |> > In the past, I have estimated that you could get a 50-75% reduction
> |> > by a two-level ISA, where the top level was designed for the high
> |> > level language. Designing ISAs for ease of code generation was
> |> > first proposed in the late 1960s, as far as I know, but was never
> |> > done in a mainstream system,
> |>
> |> You can argue that Digital's VAX did this.
>
> Only feebly. It was a very bad design if that was an objective,
> for reasons that were well-known before 1975.
Even a casual inspection shows that the VAX instruction set was
modelled after the imagined needs of high level languages (string
processing and diverse numeric types come to mind, as well as several
instructions which correspond directly to HL control constructs, such
as CASE. I'm away from my VAX references so I won't try to dredge up
the rest from memory).
In hindsight the effort largely went to waste; compiler designers by
and large shunned the "assists" - ironically, ignoring them probably
made the compiler simpler - and two decades later we know RISC was the
better direction. I doubt that was considered proven in 1975, if that's
what you're getting at. The VAX was one of the most ambitious (albeit
doomed) experiments in the CISC direction.
>
> |> > and hasn't been attempted at all in recent decades.
> |>
> |> Mainly because, at this level, code generation is fairly trivial.
>
> Yes. At the cost of larger code, much more difficulty debugging,
> and making it much harder to write correct interrupt handling.
These may be contributing factors to VAX compiler design mentioned
above.
>
> |> 70's minicomputers often used interpreted microcode, which is bad for
> |> performance. What you see in P4 etc. is compiled microcode where each
> |> instruction is decoded by a combinatoric circuit and translated into
> |> microinstructions all in hardware. This is quite different from
> |> microcode that would decode instructions in software (or "firmware",
> |> as it was often called).
>
> Hmm. What makes that different from the argument that interpreted
> and compiled programming languages are entirely different? That
> viewpoint is generally agreed to be misleading.
>
> |> In any case, there are many simple ways to increase code density
> |> without seriously complicating implementation or impacting
> |> performance: ...
>
> The techniques that I am thinking of could actually SIMPLIFY
> implementation! They would do that by removing much of the need
> for the baroque tweaks (often involving interrupt handling) to
> deal with the cases that are almost, but not entirely, handled in
> hardware.
I have no doubt you're right - I'm just an amateur observer in these
matters - but I just wanted to add some context to your remark on the
VAX.
--Toby
>
>
> Regards,
> Nick Maclaren.
> But ARM has also been very slow to go superscalar relative
> to MIPS, PowerPC, and Alpha. (There are other reasons for
>this, but I suspect that the ISA makes a superscalar
>implementation more challenging.)
ARM is perhaps not the simplest RISC to make superscalar, however
there is nothing in the instruction set that makes it extremely difficult.
The most complicated bit I imagine is being able to execute 2 flag-setting
instructions in parallel and merging the resulting flags. I believe that the
first superscalar ARM will easily outperform similar 2-way in-order
designs like the MIPS20k or PPC440 series.
I guess the main reason that it didn't happen sooner is that there was no
demand for a serious high-end ARM until recently (both Samsung and
Intel demonstrated 1Ghz ARM prototypes 3 years ago, but they haven't
gone to production). High performance means using long pipelines,
and this is bad for power. The only option to remain power efficient
is to improve CPI, and dual-issue is the obvious choice.
Wilco
The IBM RT had two forms of the shift instruction. One was the usual
two operand form,
R_n = R_n << op2
The other was called a "paired" shift and worked like
R_(n xor 1) = R_n << op2
The RT was an example of the kind of density improvement we are
talking about here. It had a mix of 2 and 4 byte instructions.
The two byte instructions included loads and stores with 4 bit
scaled offsets, three operand add, two operand ALU ops, and
load register with 4 bit constant. Using shorter instructions
potentially reduced contention between instruction and data
access; there was no cache. (Arguably, there was a 16 byte
instruction cache. Tight loops didn't make instruction fetches
after the first iteration.)
--
John Carr (j...@mit.edu)
> Right. For example one would like to be able to define instructions
> or trap semantics for dealing with tags. [Shame that the SPARC
> hard-wired the integer tag pattern to 00, instead of making it part of
> the per-process state; our Smalltalk implementation still doesn't use
> the tagged instructions because we find non-zero immediate tags more
> convenient across all our platforms].
Is there a publicly available description of the tag patterns used in
your Smalltalk implementation?
Thanks,
Carl
> Seongbae Park <Seongb...@sun.com> wrote:
>> Sun's compile time/link time/binary optimizer do the code reordering,
>> and our top ISVs use them to reduce I$ miss substantially
>> for their applications.
>>
>> A new article posted at:
>>
>> http://developers.sun.com/prodtech/cc/articles/codelayout.html
>>
>> has a very graphical illustration of the effect of such optimizations
>> and how to use them.
>
> unfortunately the link time / binary optimizer only works with
> applications, not shared libraries, which makes it be of rather
> limited value.
No doubt displaying my ignorance here, but it would seem to me that
something like a database server (Oracle in the example) would be amenable
to whole-program optimization and static linking, and have no need of
shared libraries. If there's only one application to run, there's no one
to share with. No?
--
Andrew
Well, of course it does. When did I ever say anything different?
However, even a simple analysis shows that it is generally a poor
way of improving locality, both because the bloat it causes reduces
cache efficiency and because 'inlinable' functions are not the
dominant cause of locality problems. To take a trivial example, if a
function uses a lot of strcmp calls, the best solution is likely to
be including the function as an out-of-line internal function with a
non-standard interface. Some compilers have done this, but I don't
know how common it was or is.
>And inlining was only part of my original message -- I
>explicitly wrote that Visual C is reordering code depending on
>profiling results. We are splitting the code into hot and cold regions,
>reordering function layout depending on the dynamic call graph, and
>inside each function we reorder the code to improve locality as well.
Ah! I had misunderstood. Yes, it was the latter aspects I was
referring to - I have never used Microsoft's Visual C (which I assume
is unrelated to IBM's!), but clearly need to take a look.
Regards,
Nick Maclaren.
Amazing. I last used a HP-UX system less than a decade ago, and
those facilities were extremely limited and not the sort of thing
I was referring to. When I last looked at their specifications
(quite recently), I didn't see anything that sprang out as doing
what I am referring to.
Can you state more precisely what you are referring to, so that I
can track it down?
> So don't bother reiterating your opinion that those probably "don't
>tackle the whole of this problem". People who keep complaining
>about youngsters not knowing their computer history should have
>a better grasp of the current state of technology.
Well, so far there is no evidence to tell whether I have missed what
HP have done or whether you have misunderstood what I am talking
about. Please note that I am asking for better information, to
check this up - you might like to do the same.
Please also note that what I am referring to was NOT done by older
compilers, except in a few special cases. We knew that it was
possible but not worthwhile given the limited resources (mainly
memory) available at the time.
> Profile-based optimizations (aka feedback based, aka POGO) work well
>in practice, despite the thought experiments we can all make. ...
That is not the experience of most of the people I know of who have
tried them (including me, of course).
>Practice shows that almost all branches are very heavily biased,
>that simple training runs are typically sufficient to obtain good-enough
>information about those biases (or the absence of bias) and that modern
>micro-architectures (BTB and caches) are doing a decent job of compensating
>for compiler imperfections in predicting hot paths.
Jesus wept. Let me TRY to explain.
That assumption, and the usefulness of profile-based optimisations
in general, depends very much on the characteristics of code locality,
execution paths etc. being dependent more on the code being executed
than on the data being presented.
Experience is that this tends to be true for commercial codes (and
similar uses of kernels), but much less true for scientific ones.
Scientific codes tend to be far more extreme, are thus both very
optimisable and very pessimisable, and both the locality and code
paths is very dependent on the data. The same is true for GUIs etc.,
which are so unscientific that they have wrapped around and reach
a similar point from the other direction!
For example, it is NORMAL for the memory characteristics and even
the code paths to be entirely different for different sizes and
shapes of data, often in ways that are not easy to predict. It is
NORMAL for unexpected alignments at high powers of two to cause
major performance issues, and these may be triggered by data
differences that have nothing to do with size.
Plus, of course, the shared library and related issues, which are
often closely tangled up in where the time goes (think BLAS),
> Everytime PBO is mentioned in this newsgroup, some posters (who
>should recognize themselves) will claim horror stories and theoretical
>failings. Those are doing a disservice to software users. ...
Most such people do so mainly as a counterbalance to the snake oil
pedlars who claim that profile-based optimisations are a nearly
universal solution to all sorts of performance problem.
Well, they ain't.
Regards,
Nick Maclaren.
Oh, yes, indeed. It was. But the designers had not investigated the
real requirements carefully enough, and got it wrong. Sorry, but ....
Regards,
Nick Maclaren.
Nick Maclaren wrote:
> In article <1121540813.3...@z14g2000cwz.googlegroups.com>,
> toby <to...@telegraphics.com.au> wrote:
> >
> >Even a casual inspection shows that the VAX instruction set was
> >modelled after the imagined needs of high level languages...
>
> Oh, yes, indeed. It was. But the designers had not investigated the
> real requirements carefully enough, and got it wrong. Sorry, but ....
That's why I used the word "imagined" :-)
--Toby
>
>
> Regards,
> Nick Maclaren.
I had assumed that :-)
The ARM project took a very similar approach, in some respects, but
had done their homework a lot better. While the VAX was a success,
one cannot honestly say that it was because of the ISA. One could
say that about the ARM (though I am not saying whether or not I
believe it).
Regards,
Nick Maclaren.
> Even a casual inspection shows that the VAX instruction set was
> modelled after the imagined needs of high level languages (string
> processing and diverse numeric types come to mind, as well as several
> instructions which correspond directly to HL control constructs, such
> as CASE. I'm away from my VAX references so I won't try to dredge up
> the rest from memory).
>
> In hindsight the effort largely went to waste; compiler designers by
> and large shunned the "assists" - ironically, ignoring them probably
> made the compiler simpler - and two decades later we know RISC was the
> better direction. I doubt that was considered proven in 1975, if that's
> what you're getting at. The VAX was one of the most ambitious (albeit
> doomed) experiments in the CISC direction.
IIRC the initial U of Washington Pascal compiler for the VAX was based on
a (the U of W?) Pascal compiler for the CDC 6600 and generated code that
used the most 6600-like instructions and addressing modes of the VAX
perhaps like
a:= b+c
load 1 a
load 2 b
add 1 2
store 2 c
IIRC a later version of compiler generated more VAX-like instructions
and addressing modes
perhaps like
add c, a, b
Does anyone else remember this?
Does anyone have any records of sizes and performances of the two kinds
of output?
JKA
[Heavily encoded architectures like the VAX made sense when microcode
ROM was faster than RAM and RAM was expensive. Unfortunately, that
era ended around the time the VAX was introduced making it an instant
dinosaur. -John]
I do not see this limitation mentioned in the page above. I do not
quite understand why such a limitation would exist. FYI there is no
such limitation in HP's compilers and tools.
> No doubt displaying my ignorance here, but it would seem to me that
> something like a database server (Oracle in the example) would be amenable
> to whole-program optimization and static linking, and have no need of
> shared libraries.
- Oracle does use shared libraries, even though most of the code
in still in one binary, the 'oracle' executable.
- As a point of comparison DB2 uses shared libraries quite heavily.
I am not intimate with DB2 but I believe that the "core" logic
leaves in a shared library.
- Oracle is frequently re-linked at customer sites and ships
as a collection of .a . Whole-program optimizations are pretty
much forbidden by this requirement
- static linking is very much out of favor those days.
I know of at least two OSes where the user/kernel syscall ABI
is not public (although trivial to reverse engineer). Every
executable is supposed to link the C library shared, except for
a few vendor-supplied binaries used for system recovery purposes
(/sbin).
Yet it's not really relevant to the problem here.
> If there's only one application to run, there's no one
> to share with. No?
I don't understand what you mean. Given that all the high-end
Unix DB servers are running as collection of processes, you at
least share with other instances of the same executable. There
are fewer differences between a 'main' binary and a shared library
than your question seems to imply.
Well-engineered shared library usage is not an issue. By that
I mean executables using few "large" libraries, each library
offering a small API and only exporting those API functions,
with most calls being made intra-module, and with few load-time
initializations.
Eric
Eric Gouriou wrote:
> Andrew Reilly wrote:
[...]
>>>> http://developers.sun.com/prodtech/cc/articles/codelayout.html
[...]
>> No doubt displaying my ignorance here, but it would seem to me that
>> something like a database server (Oracle in the example) would be
>> amenable
>> to whole-program optimization and static linking, and have no need of
>> shared libraries.
>
>
> - Oracle does use shared libraries, even though most of the code
> in still in one binary, the 'oracle' executable.
>
> - As a point of comparison DB2 uses shared libraries quite heavily.
> I am not intimate with DB2 but I believe that the "core" logic
> leaves in a shared library.
^^^^^^ lives
> - Oracle is frequently re-linked at customer sites and ships
> as a collection of .a . Whole-program optimizations are pretty
> much forbidden by this requirement
I forgot to mention:
On HP-UX (at least), Oracle ships with a link order file - a file
indicating to the linker how procedures should be laid out in the resulting
binary. This is done so that customers can benefit from the those link-time
optimizations described in the page above, even though the link
step is often redone on site.
The profile-based intra-procedure layout optimizations are done
at compilation time and present in the .a files.
This Oracle-specific requirement disallows some aggressive "whole binary"
optimizations (+O4), yet many benefits of profile-based optimizations
are still being delivered to our customers.
Eric
On commercial code, how much performance difference does having
32 vs 16 registers make? I would guess its not much given that
basic blocks tend to be quite small, 6 to 10 inst.
It seems to me that most of these dual format wide-narrow inst.
encoding problems arise when you try to shoehorn a 32 register
RISC cpu into a 16 bit inst. format. 16 registers allows an
8-4-4 format which gives complete registers bank coverage and
a large opcode space.
So it would seem that if one wants dual format instructions,
you need to limit to a 16 register set.
Eric
> Potential. They aren't easy to design or use, and are very out of
> fashion. One thing that most vendors seem to have missed is that
> they are a natural for profile-based optimisation, which some of us
> were doing in the 1970s for overlay designs. That could and should
> be reinvented. It shouldn't be patented, but probably will be.
My impression (mostly based on reading PLDI, ISCA, and ASPLOS proceedings
and other rumors here and there) is rather that such techniques have become
pretty much standard, except maybe for gcc.
This is based on no hard fact at all, but I'd be very surprised to hear that
it's only rarely available.
Stefan
If you wanted to, you could design your own microcode and load it at boot
time. So if you had wanted to, you could have written microcode that
directly execute Fortran or Basic or other languages.
I myself took a VAX 11/750 course and wrote a memory test program that ran
from microcode.
rtt
"John Ahlstrom" <ahlst...@comcast.net> wrote in message
news:05-0...@comp.compilers...
----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Much earlier work (early 1970s) found that almost no compilers used
more than a relatively few fairly simple instructions, even when
there were ones specifically designed for fancy addressing. The
reason was that the complexity of mapping the language constructs to
the instruction formats was excessive relative to the gain.
Exactly what proportion of this was because complicated instructions
tended (even then) to be slow and what was because compiler design
was and is based on a process of dismantling and reconstruction, I
can't say. I can remember being unconvinced by the arguments in the
papers (which I cannot remember references to).
Exactly where the optimum point lay (or lies) between extreme RISC
and (say) the VAX has varied in time, but has never been clear.
However, I have never known a time when the VAX level of complexity
was close to optimal (rather than too complex). The 68000 or even
the 68020, yes, but no further.
Regards,
Nick Maclaren.
FS in the early 70s went over to be super complex (but it was canceled
w/o every being announced ... although possibly billions were spent on
it before it was canceled)
http://www.garlic.com/~lynn/subtopic.html#futuresys
I've periodically commented that 801/RISC was large part re-action to
future system failure ... swinging in the exact opposite direction.
There were periodic comments in the mid-70s about 801/RISC consistently
trading off software (& compiler) complexity for hardware simplicity.
http://www.garlic.com/~lynn/subtopic.html#801
What do you think they got wrong with the VAX ISA? PL/I code generation
on the VAX was certainly far easier than Alpha, which appeared to have a
C model in mind.
>
>
> Regards,
> Nick Maclaren.
snip
> On commercial code, how much performance difference does having
> 32 vs 16 registers make?
I don't know the answer to that. But I do note that a design probably shoud
not be targeted just to commercial code, though it is a relativly large
marketpkace. Embedded is much larger, but even scientific (HPC), which
presumably needs lots of registers is important to think about.
> I would guess its not much given that
> basic blocks tend to be quite small, 6 to 10 inst.
Could be. I just don't know, though I do remember that there was some
research on this.
> It seems to me that most of these dual format wide-narrow inst.
> encoding problems arise when you try to shoehorn a 32 register
> RISC cpu into a 16 bit inst. format. 16 registers allows an
> 8-4-4 format which gives complete registers bank coverage and
> a large opcode space.
But that still means two register instructions, which necessitates at least
occasional "extra" register move instructions to preserve values for
subsequent use.
> So it would seem that if one wants dual format instructions,
> you need to limit to a 16 register set.
Perhaps. There is certainly interaction between the decisions on number of
registers and the issue of two versus three register specifiers. For
example, if you wanted 32 registers and were willing to sacrifice three
register instructions, you could do something like the following:
There are 32 registers that are logically divided into four "banks" of eight
registers each. The instructions contain a two bit "register bank
designator", which specify which of the four banks the destination and, for
most instructions, the source register are in. Thus for most instructions,
you need two plus two times three or a total of eight bits to specify the
registers. Note taht this is the same as your proposed design uses. One
instruction uses two other bits to fully specify a source for a register
move to get values between banks, but that should be needed infrequently..
This represents a compromise that might be workable, but I certainly have
not tried to work out all of the implications.
Three things, all of which were well-known much earlier:
1) The more extreme operations (e.g. polynomial evaluation) were a
pointless complication, as they would not benefit more than a
trivial proportion of executed code (even numeric code), and would
not be generated by compilers (i.e. would be useful only for hand
coded assembler).
Similar remarks can be made about the packed format (even IBM was
backing off it by then) and the plethora of floating-point formats.
There were better software technologies for dealing with such
requirements.
2) Compilers almost never use very complex addressing modes or
instructions, as compiler technology is based on breaking down HLL
code into simple primitives and matching that to the architecture.
That doesn't mean that some extra complication here doesn't pay,
but that the VAX level doesn't.
3) The very complex addressing modes and operations are an absolute
nightmare in signal handlers, when dealing with SMP, TLB miss handling
nd so on - and don't even think about going out-of-order if the decoder
can't even tell what location may be used for address calculations.
Regards,
Nick Maclaren.
>>Even a casual inspection shows that the VAX instruction set was
>>modelled after the imagined needs of high level languages (string
>>processing and diverse numeric types come to mind, as well as several
>>instructions which correspond directly to HL control constructs, such
>>as CASE. I'm away from my VAX references so I won't try to dredge up
>>the rest from memory).
(snip)
Two that I remember from the early days of VAX were the BOUND and POLY
instructions, for array bounds checking and polynomial evaluations.
Both were, at least on the 780, slower than code not using those
instructions. As a result, there was no reason for compilers to
ever use them.
> [Heavily encoded architectures like the VAX made sense when microcode
> ROM was faster than RAM and RAM was expensive. Unfortunately, that
> era ended around the time the VAX was introduced making it an instant
> dinosaur. -John]
The VAX 512 byte page size was also a result of RAM being considered
expensive.
I am now considering your statement in terms of S/360.
S/360 used a variety of different microcode ROM systems,
I believe different for each model. One used capacitors which could
be punched on a regular card punch, another used one turn transformers,
which could also be punched. (Maybe only one of those could use a
regular card punch.) The expense, then, was in the addressing and
decoding logic. RAM was magnetic core which was also pretty expensive.
For the S/360 ROMs, though, the cost would not be linear with storage
size, but closer to log of the storage size.
Though the solution when RAM is cheaper and faster than ROM is to
store microcode in RAM. The floppy disk was invented as the storage
media for loadable microcode...
-- glen
I skipped a number of steps in my explanation. I have been pondering
this question for a number of years. When I first saw the Alpha inst
set back in 1992 I liked it for the most part but I was bothered by
the "large number of zeros" in it. So I began doodling on a dual
format inst set.
- There is no mode bit. Both 32 bit wide and 16 bit narrow inst
formats coexist.
- Instructions come in a 64 bit aligned 'package'
- Each package contains two 32 bit instruction 'slots'
- Each slot contains either one 32 bit wide instruction or
two 16 bit narrow instructions. The opcode fields for wide
and one of the narrow inst overlap so you can tell which format
it is which by looking at the slot's opcode field.
|31....16|15.....0|
wide | opcode|
narrow | opcode| opcode|
- Halt, Noop and Pause inst all use 16 bit formats so you
can always pad unused portions to align on a boundary.
- All slots in a 64 bit package must be filled in, possibly
with Noops, so a whole package can decode at once.
- There are both 3 operand wide inst and 2 operand narrow forms,
and these can be intermixed. This produces some redundancy in
the inst set but nothing too nasty. Obviously an
ADD3 R1=R1+R2 is the same as ADD2 R1+=R2
One consequence of not having a mode flag is that the opcodes
must coexist so you start running out of values. A 6 bit primary
opcode is not big enough to handle all the inst, which means
that it can't have a 5 bit register number so one must consider
alternatives.
> > So it would seem that if one wants dual format instructions,
> > you need to limit to a 16 register set.
>
> Perhaps. There is certainly interaction between the decisions on number of
> registers and the issue of two versus three register specifiers. For
> example, if you wanted 32 registers and were willing to sacrifice three
> register instructions, you could do something like the following:
>
> There are 32 registers that are logically divided into four "banks" of eight
> registers each. The instructions contain a two bit "register bank
> designator", which specify which of the four banks the destination and, for
> most instructions, the source register are in. Thus for most instructions,
> you need two plus two times three or a total of eight bits to specify the
> registers. Note taht this is the same as your proposed design uses. One
> instruction uses two other bits to fully specify a source for a register
> move to get values between banks, but that should be needed infrequently..
>
> This represents a compromise that might be workable, but I certainly have
> not tried to work out all of the implications.
Another option would be to have 32 registers for wide inst
and limit access to regs 0..15 for narrow inst. Then have two
narrow inst that MovHighToLow and MovLowToHigh to access between
the two sets on the rare occasion that was needed.
Or you could just use a wide 3 operand form.
Coding standards would define R15 and R14 as the stack and
frame pointers so they would always be accessible,
and R0..R6 for function value and arg passing.
The compiler could choose whether to emit wide inst
which can access all registers and have more options,
or narrow inst which pack in tightly together.
But it seemed worthwhile to question whether those extra
16 registers are worth keeping in the first place.
It is hard to tell whether something like this would be a benefit
without building a compiler for the hypothetical ISA and running
sample code through it.
Eric
Stephen> This represents a compromise that might be workable, but I
Stephen> certainly have not tried to work out all of the implications.
For scientific code, try writing a blocked matrix multiply with
register sets like that. My guess is that you're going to find you
need a lot of bank-to-bank copies.
This is not a lot of work. Take a look at the worked example at
OK
snipped more detailed explanation
I understand. It seems you have put a lot of thought into this.
snip
> Another option would be to have 32 registers for wide inst
> and limit access to regs 0..15 for narrow inst. Then have two
> narrow inst that MovHighToLow and MovLowToHigh to access between
> the two sets on the rare occasion that was needed.
Isn't this essentially what ARM/Thumb does, but with 8 for the narrow and 15
for the wide?
> But it seemed worthwhile to question whether those extra
> 16 registers are worth keeping in the first place.
Asking the question is usually worthwhile, even if the answer confirms your
initial beliefs. :-)
> It is hard to tell whether something like this would be a benefit
> without building a compiler for the hypothetical ISA and running
> sample code through it.
If your proposed ISA were close enough to an existing ISA, then perhaps you
could go through compiler output and see how many registers are really
needed. ISTM that registers are used for other things than"direct
computations". They are used as quickly available temporary stroage, and
with memory, even caches getting farther away (in time) from the ALU, that
use becomes more and more important. But I don't know enough to evaluate
the effect of this.
One of my previous compilers (for embedded system that used VAX-
compatible CPU) emitted it. That pattern happened often enough in
the customers' code, and customer was happy that we used it.
Equivalent sequence of simpler instructions probably would work
as well, but one of the greatest concerns was code size. For that
reason we had to use VAX general function call mechanism instead
of much faster one that used custom calling conventions, but requred
extra instruction or two per function.
> Similar remarks can be made about the packed format (even IBM was
> backing off it by then) and the plethora of floating-point formats.
> There were better software technologies for dealing with such
> requirements.
>
> 2) Compilers almost never use very complex addressing modes or
> instructions, as compiler technology is based on breaking down HLL
> code into simple primitives and matching that to the architecture.
> That doesn't mean that some extra complication here doesn't pay,
> but that the VAX level doesn't.
The abovementioned compiler used such address modes. It's not
hard to do global pattern matching for the entire function if you are
ready to implement such pass, and you need such pattern recognition
pass to fully utilize (symbol + integer offset + index + scale)
address mode on x86 as well.
Thanks,
Eugene
Don't forget CRC, the instruction whos' sole purpose was to
quadruple the time for backups.
> > [Heavily encoded architectures like the VAX made sense when microcode
> > ROM was faster than RAM and RAM was expensive. Unfortunately, that
> > era ended around the time the VAX was introduced making it an instant
> > dinosaur. -John]
>
> The VAX 512 byte page size was also a result of RAM being considered
> expensive.
780 processor:
- circa 1979
- used standard TTL chips as I recall.
I don't recall seeing any custom chips inside.
The cpu board had 74181 ALU's on it.
- 24 KB Writable Control Store, 2048 * 96 bit +
48 KB ROM microcode, 4096 * 96 bit
200 ns access time
- optional 24 KB customer Writable Control Store
- used "state of the art" 4Kb (as in kilo bit) DRAM chips.
Due to memory board limitations this allowed a max of 1 MB
main memory which rose to 4 MB when the 16 Kb chips came out.
Memory had single correct, double detect ECC.
- VMS 1.0 could boot in 1/2 MB, leaving 1/2 MB for users IIRC
- memory speed was something like 800 ns read, 1400 ns write
- 8 KB two way set assoc cache, 300 ns
- 128 entry TLB, 64 system space, 64 user space
- fully configured system used about 15,000 watts
Eric
The 32-bit format is buried in some places yes. But they're rather simple.
32-bit, tags are the least significant two bits.
00 - normal pointer
01 - immediate character in upper 30 bits
10 - not used
11 - 2's complement signed integer in upper 30 bits
64-bit, tags are the least significant three bits.
000 - normal pointer
001 - immediate character in upper 61 bits
010 - not used
011 - 2's complement signed integer in upper 61 bits
100 - not used
101 - not used
110 - not used
111 - rotated IEEE double with 8-bit exponent in upper 61 bits
So in both the 32-bit and the 64-bit implementations bit 0 is an "is
immediate" bit, where an immediate is an object encoded entirely in a
pointer, and bit 1 is an "is SmallInteger" (is fixnum) bit, and in
64-bits bit 2 is an "is SmallDouble" bit. Single bit tests have a code
density advantage on some processors.
The rotated IEEE double with 8-bit exponent in upper 61 bits looks like
| 8-bit exponent | 53-bit mantissa | sign bit | tags |
The 8-bit exponent is 0 for +/- 0 and the 11-bit exponent - 1023 for
non-zero exponents, and so encodes +/- 0 and the central eight of the
IEEE double range. With the rotated sign a test for +/- zero is an
unsigned test <= 15, which simplifies/cheapens determining if the
exponent bias need be applied or not on decode.
In addition in the 64-bit implementation the referent of a normal
pointer is a 2 word header which can be aligned on an odd or even
boundary modulo 128 bits. So bit 3 distinguishes two heap spaces, old
space, which is incrementally scan-marked, and perm space, which isn't.
Some of the rationale for the tag format is historical; the
implementation having evolved since 1988. It may be worth-while
flipping the immediate bit and using 00/000 for SmallIntegers/fixnums.
But on contemporary processors memory dominates so I doubt there's a lot
of benefit to be had.
Any similarities/jarring dissonances with other dynamic language
implementations you're familiar with?
Regards,
--
_______________,,,^..^,,,____________________________
Eliot Miranda Smalltalk - Scene not herd
All right, I was over-generalising. But it was very rarely generated,
and yours is the first example I have heard of.
>The abovementioned compiler used such address modes. It's not
>hard to do global pattern matching for the entire function if you are
>ready to implement such pass, and you need such pattern recognition
>pass to fully utilize (symbol + integer offset + index + scale)
>address mode on x86 as well.
You used the full range of VAX addressing modes? Boggle. Yes, I
can see that would help with code size, but did it make any real
difference to the performance?
You are quite right to nitpick that I overstated the case, but the
generality was that compilers did not use the VAX ISA as it was
designed, but a 68000-like subset. Even DEC ones :-)
Regards,
Nick Maclaren.
PL/I was implemented with the VCG backend which also supported Pascal, C
SCAN, Pearl and (I think ) Coral66 and it used all the addressing modes
and performs very well.
>
>
> Regards,
> Nick Maclaren.
Yes, it did. Compiled program fit into ROM...
Thanks,
Eugene
No it's like Thumb-2 and some other mixed-width RISCs. ARM/Thumb is
different as you can't easily switch on a per instruction basis, so you're stuck with
whole functions using either 16-bit or 32-bit instructions.
> > It is hard to tell whether something like this would be a benefit
> > without building a compiler for the hypothetical ISA and running
> > sample code through it.
You don't need to do anything complex to evaluate a new architecture like this.
Simply compile a lot of code with your base 32 register ISA (using a good
compiler that produces small code to start with is important), and then look for
the most frequently occuring (patterns of) instructions and create small instructions
for them. It turns out you only need a few key instructions to make a big saving.
Mixed width instructions have the advantage that the smaller instructions don't
need to form a complete instruction set - infrequently used instructions can
remain large without a penalty.
If you take this approach to the limit you can compress any fixed-size instruction
set to about 55% its original size - practical variants get about 65% however.
> If your proposed ISA were close enough to an existing ISA, then perhaps you
> could go through compiler output and see how many registers are really
> needed. ISTM that registers are used for other things than"direct
> computations". They are used as quickly available temporary stroage, and
> with memory, even caches getting farther away (in time) from the ALU, that
> use becomes more and more important. But I don't know enough to evaluate
> the effect of this.
Yes, most functions only use a few registers, so you can get away with only
8 registers like Thumb. If you have a mixed width instruction set you don't
have to spill if you run out of registers: you simply use the larger instructions.
The same is true for 2 operand instructions: if you need a 3-operand instruction
you simply use it (rather than insert a move - 2 16-bit instructions would be
slower than a 32-bit one). If done right, this approach means you can get
the performance of the 32-bit instruction set but with the codesize of a
16-bit instruction set.
Wilco