The coming death of all RISC chips.

Brett Davis

unread,

Mar 15, 2009, 6:30:22 PM3/15/09

to

Also the coming death of the brain damaged dogma of fixed width
instructions. Two dumb birds with one stone.
Dont get me wrong, I LIKE RISC and HATE x86, its just that all the RISC
chips went fixed width, for the same stupid shortsighted and dogmatic
reasons, so now they are all going to die for that very same reason.

The next generation of CPU chips will have ~256 floating point registers
that support four operand instructions. (Think Multiply Accumulate) Just
try and fit that into your fixed 32 bit instructions, nope, wont fit.
Then try and tell people that we need to continue the brain damage and
convert to fixed 64 bit instructions. I will bet you that almost every
CPU company is blowing a million+ dollars in a cover your ass exercise
to prove that fixed 64 bits instructions are stupid, even though most
know that the result of that wasted money will prove the obvious. (Some
will go Itanic packed instructions.)

MIPS and Power had an excuse, they were first, PowerPC/ARM/SPARC and the
rest did not.
SPARC went another level of hubris stupidity and added register windows,
which Itanic not only failed to learn from but actually copied!
And then there was Alpha, a supposed clean slate design, that was to
last something like 40 years. But the day the Alpha was designed it was
well known that a measly 32 floating point vector registers was not
going to cut it in the future. Failure to actually look into the future
and plan for it relegated Alpha to being just another RISC copy, a late
one at that as the party was already ending.

The real future is clear, 16 bit opcodes of variable length. Half the
smart people that read/post here in comp.arch are likely working on this
now and as such will not reply usefully to this thread. Which is part of
the reason I am posting, to see which design teams will continue into
the future, and which will die. ;)

The 16 bit fixed width Thumb mode of ARM and MIPS16 deserve special
mention for the brain damage that these chips represent. (Software
engineers love going back to limitations of 1980s CPUs, NOT.) I would
like to vent but it would just make me look bad, and serve no purpose as
these chips are dead meat anyway in the long run.

Lets look at the alternatives:
8 bit opcodes, we all know x86, and it sucks massively.
32 bit opcodes, but not compatible with old RISC instruction sets. 64
bit opcodes will only be used for floating/vector math. Lots of people
are likely looking at this seriously, but I bet it looses badly to 16
bit opcodes.
16 bit opcodes, ~75% of your instructions fit in one word, (two operand
+ short branch) smaller code equals faster code. the second opcode word
will fit most of the rest of the integer instruction set, plus maybe
some two operand FPU instructions. The third opcode word will be used
for most floating code, and loads that need a 32 bit immediate, etc.
(~75% of FPU code fits in 3 words) The forth opcode finish off the
floating point instruction set, plus whatever else is left over.

Cray was a 16+16 variable width instruction set, my proposal is
16+16+16+16.

There are only two big CPU markets left that x86 does not own:
Consoles + servers: PowerPC owns this space, PS3+XBox360+Wii and non-x86
IBM mainframes/servers.
Embedded space: ARM, MIPS, SH, 68000(!), etc.

The embedded space will all say smaller opcode means smaller code, means
smaller/cheaper embedded die, plus smaller code equals faster. The
consoles are just really big and powerful embedded systems. 16 bit wins,
a mad rush ensues and the 32 bit opcode chips are ignored.

A variable width instruction format will of course cost you ~tens of
thousands of transistors, but will save you millions of transistors in
the embedded RAM array. The savings in the instruction cache alone for a
level of performance probably pays for it, and a smaller cache is
faster...

For extra credit:
As a software engineer I have noticed and heard that having 32 visible
integers registers is just extra baggage, it helps a few small
benchmarks, but hurts most code compared to having 16 integer registers.
Is this true?

I do expect someone will do a CPU targeted at embedded systems that has
say 16 integer registers and ~64 floats. Using a variable 16 bit opcode
format. The benefits being smaller code, smaller CPU, smaller die, with
similar speed...

Some useful links:
http://en.wikipedia.org/wiki/Addressing_mode
http://users.encs.concordia.ca/~tahar/coen6741/notes/Chapter2-4p.pdf
http://fit.hnue.edu.vn/itlib/uploads/f59.pdf

Brett

Jacko

unread,

Mar 16, 2009, 11:41:28 AM3/16/09

to

On 15 Mar, 22:30, Brett Davis <gg...@yahoo.com> wrote:
> The coming death of all RISC chips.

All is a strong word.

> Also the coming death of the brain damaged dogma of fixed width
> instructions. Two dumb birds with one stone.
> Dont get me wrong, I LIKE RISC and HATE x86, its just that all the RISC
> chips went fixed width, for the same stupid shortsighted and dogmatic
> reasons, so now they are all going to die for that very same reason.

serial instruction formats? With registers moved into the instruction
register pipeline first, and the operation code last?

> The next generation of CPU chips will have ~256 floating point registers
> that support four operand instructions. (Think Multiply Accumulate) Just
> try and fit that into your fixed 32 bit instructions, nope, wont fit.
> Then try and tell people that we need to continue the brain damage and
> convert to fixed 64 bit instructions. I will bet you that almost every
> CPU company is blowing a million+ dollars in a cover your ass exercise
> to prove that fixed 64 bits instructions are stupid, even though most
> know that the result of that wasted money will prove the obvious. (Some
> will go Itanic packed instructions.)

256 register bottlenoeck , humm, why not have 2 inputs and one output
register, and set up the routing by programming the systolic switch?
only fools rush in to 64 bit instructions.

> MIPS and Power had an excuse, they were first, PowerPC/ARM/SPARC and the
> rest did not.
> SPARC went another level of hubris stupidity and added register windows,
> which Itanic not only failed to learn from but actually copied!
> And then there was Alpha, a supposed clean slate design, that was to
> last something like 40 years. But the day the Alpha was designed it was
> well known that a measly 32 floating point vector registers was not
> going to cut it in the future. Failure to actually look into the future
> and plan for it relegated Alpha to being just another RISC copy, a late
> one at that as the party was already ending.

My less than 1GHz pentium still cuts it. It should be much more power
efficient though.

> The real future is clear, 16 bit opcodes of variable length. Half the
> smart people that read/post here in comp.arch are likely working on this
> now and as such will not reply usefully to this thread. Which is part of
> the reason I am posting, to see which design teams will continue into
> the future, and which will die. ;)

They will all die to open source kids.

> The 16 bit fixed width Thumb mode of ARM and MIPS16 deserve special
> mention for the brain damage that these chips represent. (Software
> engineers love going back to limitations of 1980s CPUs, NOT.) I would
> like to vent but it would just make me look bad, and serve no purpose as
> these chips are dead meat anyway in the long run.

Software engineers get given a platform. Just because a feature is
assumed better for software does not mean it is in any way suitable
hardware. Failure to develop with screams of my tools don't work like
the hardware should, is just missing the ENGINEERING point. This is
the difference between a programmer and an engineer.

> Lets look at the alternatives:
> 8 bit opcodes, we all know x86, and it sucks massively.
> 32 bit opcodes, but not compatible with old RISC instruction sets. 64
> bit opcodes will only be used for floating/vector math. Lots of people
> are likely looking at this seriously, but I bet it looses badly to 16
> bit opcodes.
> 16 bit opcodes, ~75% of your instructions fit in one word, (two operand
> + short branch) smaller code equals faster code. the second opcode word
> will fit most of the rest of the integer instruction set, plus maybe
> some two operand FPU instructions. The third opcode word will be used
> for most floating code, and loads that need a 32 bit immediate, etc.
> (~75% of FPU code fits in 3 words) The forth opcode finish off the
> floating point instruction set, plus whatever else is left over.

Cache waste and code load decode waste, a nice routing pattern!! Yes
how novel.

> Cray was a 16+16 variable width instruction set, my proposal is
> 16+16+16+16.

Bigger = better???

> There are only two big CPU markets left that x86 does not own:
> Consoles + servers: PowerPC owns this space, PS3+XBox360+Wii and non-x86
> IBM mainframes/servers.
> Embedded space: ARM, MIPS, SH, 68000(!), etc.
>
> The embedded space will all say smaller opcode means smaller code, means
> smaller/cheaper embedded die, plus smaller code equals faster. The
> consoles are just really big and powerful embedded systems. 16 bit wins,
> a mad rush ensues and the 32 bit opcode chips are ignored.

and?

> A variable width instruction format will of course cost you ~tens of
> thousands of transistors, but will save you millions of transistors in
> the embedded RAM array. The savings in the instruction cache alone for a
> level of performance probably pays for it, and a smaller cache is
> faster...

Bring me my bloat, of unfirled code, bring me my RAM of multi-giga
byte shit lacking in subroutines!!!

> For extra credit:
> As a software engineer I have noticed and heard that having 32 visible
> integers registers is just extra baggage, it helps a few small
> benchmarks, but hurts most code compared to having 16 integer registers.
> Is this true?

Amdahl <= read and learn.

> I do expect someone will do a CPU targeted at embedded systems that has
> say 16 integer registers and ~64 floats. Using a variable 16 bit opcode
> format. The benefits being smaller code, smaller CPU, smaller die, with
> similar speed...

and?

cheers jacko

http://nibz.googlecode.com

Brett Davis

unread,

Mar 21, 2009, 7:49:59 PM3/21/09

to

In article
<ec38afb1-9160-45d8...@l16g2000yqo.googlegroups.com>,
Jacko <jacko...@gmail.com> wrote:

> On 15 Mar, 22:30, Brett Davis <gg...@yahoo.com> wrote:
> > The coming death of all RISC chips.
>
> All is a strong word.

It will take time, I am talking first about the high end 64 bit market,
and others that need high performance floating point.
Once a 16 bit variable instruction format takes off I do not see how
MIPS and ARM can survive without doing the same.

CORRECTION: ARM is first with Thumb-2 a 16+16 variable instruction size.

> > Also the coming death of the brain damaged dogma of fixed width
> > instructions. Two dumb birds with one stone.
> > Dont get me wrong, I LIKE RISC and HATE x86, its just that all the RISC
> > chips went fixed width, for the same stupid shortsighted and dogmatic
> > reasons, so now they are all going to die for that very same reason.
>
> serial instruction formats? With registers moved into the instruction
> register pipeline first, and the operation code last?

I am no clueless academic who wants to revive some VAX barf because it
looks pretty. I need to get real work done and instructions that impair
performance or upgrade path are simply idiotic.

Clocks have not gotten faster in the past half decade, so the time is
right to rethink all the assumptions that led us here.

> > The next generation of CPU chips will have ~256 floating point registers
> > that support four operand instructions. (Think Multiply Accumulate) Just
> > try and fit that into your fixed 32 bit instructions, nope, wont fit.
> > Then try and tell people that we need to continue the brain damage and
> > convert to fixed 64 bit instructions. I will bet you that almost every
> > CPU company is blowing a million+ dollars in a cover your ass exercise
> > to prove that fixed 64 bits instructions are stupid, even though most
> > know that the result of that wasted money will prove the obvious. (Some
> > will go Itanic packed instructions.)
>
> 256 register bottlenoeck , humm, why not have 2 inputs and one output
> register, and set up the routing by programming the systolic switch?
> only fools rush in to 64 bit instructions.

Google matrix multiply and vertex shader. Both use lots of MACs because
you get a multiply and a Add in one cycle. Yes you can use a hidden ACC
register as the destination to get by with only 3 visible registers in
the instruction but that causes issues for both software and hardware.
Now google VMX-128 which is used in the XBox360, a PowerPC with 128
vector float registers. Lots of compromises to shoehorn this into a 32
bit instruction format. (Only a few instructions can use the high 96
registers, and a bunch of useful instructions got cut to give you those
registers.)
To be competitive in the console market you NEED big vector units with
128+ registers, that is why Sony and Microsoft paid tens of millions of
dollars to have them designed.

Ok maybe Thumb Mode makes sense if you really want a crippled CPU that
can fit in 1mm squared in a modern fab process, surrounded by a sea of
embedded dram you added for free because otherwise that space would be
wasted under the wire pads you need to connect the chip to the chip
carrier. ;)
Google "thumb mode performance", 20%-30% code size savings and 30%+
slower code, unless you are crippled by a 16 bit bus. CRAP.

> > Lets look at the alternatives:
> > 8 bit opcodes, we all know x86, and it sucks massively.
> > 32 bit opcodes, but not compatible with old RISC instruction sets. 64
> > bit opcodes will only be used for floating/vector math. Lots of people
> > are likely looking at this seriously, but I bet it looses badly to 16
> > bit opcodes.

Actually if you can shoe horn in a bunch of the common FPU instructions
into the 32 bit opcodes this might not suck too bad. ;)

> > 16 bit opcodes, ~75% of your instructions fit in one word, (two operand
> > + short branch) smaller code equals faster code. the second opcode word
> > will fit most of the rest of the integer instruction set, plus maybe
> > some two operand FPU instructions. The third opcode word will be used
> > for most floating code, and loads that need a 32 bit immediate, etc.
> > (~75% of FPU code fits in 3 words) The forth opcode finish off the
> > floating point instruction set, plus whatever else is left over.
>
> Cache waste and code load decode waste, a nice routing pattern!! Yes
> how novel.

Just found out about the THUMB-2/Cortex instruction set, 16+16 variable
sized instructions. ARM says "Thumb-2 has the performance close to or
better than that of the ARM instruction set and has the code density of
the original Thumb ISA."

Someone finally learned from Cray, though it took a lot programmers
screaming that Thumb-1 was CRAP. ;)

> > Cray was a 16+16 variable width instruction set, my proposal is
> > 16+16+16+16.
>
> Bigger = better???

Went for best instruction packing, as that gives best L1 cache
performance, and best L2 cache performance. While also still giving you
lots of bits for instructions, for best pipeline performance.

The brain damaged x86 instruction encoding format does not seem to hurt
performance at all. (Does cost die size...)

> > There are only two big CPU markets left that x86 does not own:
> > Consoles + servers: PowerPC owns this space, PS3+XBox360+Wii and non-x86
> > IBM mainframes/servers.
> > Embedded space: ARM, MIPS, SH, 68000(!), etc.
> >
> > The embedded space will all say smaller opcode means smaller code, means
> > smaller/cheaper embedded die, plus smaller code equals faster. The
> > consoles are just really big and powerful embedded systems. 16 bit wins,
> > a mad rush ensues and the 32 bit opcode chips are ignored.
>
> and?
>
> > A variable width instruction format will of course cost you ~tens of
> > thousands of transistors, but will save you millions of transistors in
> > the embedded RAM array. The savings in the instruction cache alone for a
> > level of performance probably pays for it, and a smaller cache is
> > faster...
>
> Bring me my bloat, of unfirled code, bring me my RAM of multi-giga
> byte shit lacking in subroutines!!!

I have posted here about RRAM, we are all shortly going to have all the
ram we could ever want. But 64 bit instructions will still not make any
sense due to the limitations in L1 and L2 cache sizes.

> > For extra credit:
> > As a software engineer I have noticed and heard that having 32 visible
> > integers registers is just extra baggage, it helps a few small
> > benchmarks, but hurts most code compared to having 16 integer registers.
> > Is this true?
>
> Amdahl <= read and learn.

I will make a different post on this as it deserves its own thread.

> > I do expect someone will do a CPU targeted at embedded systems that has
> > say 16 integer registers and ~64 floats. Using a variable 16 bit opcode
> > format. The benefits being smaller code, smaller CPU, smaller die, with
> > similar speed...
>
> and?

Thumb-2, already here, I just did not know about it. Mostly because ARM
will not tell anyone...
"ARM Architecture Reference Manual - This document is available only to
registered ARM customers."
Anyone want to leak the instruction encoding format? Messy?

> cheers jacko
>
> http://nibz.googlecode.com

Jacko

unread,

Mar 23, 2009, 1:01:39 PM3/23/09

to

Hi

Lets suppose for the sake of argument you may be right. Then there
will be no alternate thinking.

I do not dispute the want for higher floating.fixed point performance
in some application markets. But adding multiplexers and limiting the
number of execution units, is counter productive.

If you have half the number of registers with two execution units
performance could be twice as good, considering latency of instruction
fetch is small compared to execution latency. Cross coupling these
execution units would extract a minor speed penality. Unless alternate
register windows opened onto each set in alternate cycles (1 clock
delay for other set).

Now assuming each result calculated is place in the head of the
register que, then no target register need be specified, but the
compilier would have to be smarter. This shift register queue model
could be effective.

In fact if more than one queue per execution unit was maintained, then
processing could be distributed between register cells, and execution
bypass (alternates) could be selected from operations down the queue
chain.

The concept of injecting NaN opcodes and its implication for exception
processing here is not beyond imagination.

The instruction format itself , beyond the constraints of containment
of operator set, is a secondary issue to the flow control strategy for
achiving maximal thruput and acceptable latency.

The sweeping statement of instruction format standardization to 16 bit
is premature.

cheers jacko

Brett Davis

unread,

Mar 27, 2009, 1:12:13 AM3/27/09

to

In article
<2bd56b7f-c703-4917...@c9g2000yqm.googlegroups.com>,
Jacko <jacko...@gmail.com> wrote:

> Hi
>
> Lets suppose for the sake of argument you may be right. Then there
> will be no alternate thinking.

I am one of those few people that dont mind being wrong, because I will
learn something in the process. But I definitely did not expect to be
proved right so quick. ARM is producing lots of Cortex designs that are
Thumb2 only, and dumping their own old ARM instruction set.
The low end Cortex is only 30,000 transistors so the cost of a 16+16
variable instruction set cannot be much. ARM is aiming at the 8 bit
chips, and scales up to PC levels.

http://www.arm.com/products/CPUs/archi-thumb2.html

> I do not dispute the want for higher floating.fixed point performance
> in some application markets. But adding multiplexers and limiting the
> number of execution units, is counter productive.

The FPU market is one of the few that needs more flops and has the
dollars to buy them. These people are moving to programmable graphics
chips where they can use the 800 vector processors on say a ATI 700
series.

> If you have half the number of registers with two execution units
> performance could be twice as good, considering latency of instruction
> fetch is small compared to execution latency. Cross coupling these
> execution units would extract a minor speed penality. Unless alternate
> register windows opened onto each set in alternate cycles (1 clock
> delay for other set).

The last DEC Alpha did this to win the clock rate war for a year or two,
unfortunately this chip was slower than the previous chip at getting
work done.

> Now assuming each result calculated is place in the head of the
> register que, then no target register need be specified, but the
> compilier would have to be smarter. This shift register queue model
> could be effective.

Sounds like the four stack processor described here last year(s), found
his web site to refresh my memory. For simplicity of implementation he
went with a 64 bit VLIW with lots of NOP subfields when a stack is not
used that cycle.
The 32 bit RISC chips went with fixed width width instructions for the
same reasons, but its not 1980 anymore, anyone can design a RISC chip
today. Its time to actually optimize the instruction path and get back
that 5% performance you lost by going VLIW.
Actually ARM is claiming 31% smaller code, and 38% faster code. Of
course the marketing department rigged those numbers. ;)

> In fact if more than one queue per execution unit was maintained, then
> processing could be distributed between register cells, and execution
> bypass (alternates) could be selected from operations down the queue
> chain.
>
> The concept of injecting NaN opcodes and its implication for exception
> processing here is not beyond imagination.
>
> The instruction format itself , beyond the constraints of containment
> of operator set, is a secondary issue to the flow control strategy for
> achiving maximal thruput and acceptable latency.

Everyone seems to have to same flow control strategy, a three cycle
pipeline for the low end cheap/simple/small/slow market, a five cycle
design for the mid/high/fast market with dual instruction execution, and
longer pipelines with wider execution for PC class designs.

> The sweeping statement of instruction format standardization to 16 bit
> is premature.

In a mere half decade or so you will no longer be able to buy a new chip
with the old ARM instruction set. MIPS will do the same if they exist.
When IBM announces their variable width instruction set the fat lady
will have sung.

> cheers jacko

Salutations, Brett

nm...@cam.ac.uk

unread,

Mar 27, 2009, 5:25:19 AM3/27/09

to

>> I do not dispute the want for higher floating.fixed point performance
>> in some application markets. But adding multiplexers and limiting the
>> number of execution units, is counter productive.
>
>The FPU market is one of the few that needs more flops and has the
>dollars to buy them. These people are moving to programmable graphics
>chips where they can use the 800 vector processors on say a ATI 700
>series.

Not really. That was true in the 1970s, but ceased being so in the
1990s - there are virtually no significant applications that are
limited by floating-point performance any longer. The main limit
in the HPC area is memory access - sometimes bandwidth, sometimes
latency - followed by branch misprediction and sommunication (again,
sometimes bandwidth, sometimes latency).

One thing that people miss about most GPUs is that they do not deliver
just lots of floating-point, but lots of memory bandwidth - and, because
rendering doesn't need it, they don't have all the complexity and delays
of full coherence. And THAT is why they can go a lot faster, not the
floating-point as such.

Regards,
Nick Maclaren.

Torben Ægidius Mogensen

unread,

Mar 27, 2009, 5:47:27 AM3/27/09

to

Brett Davis <gg...@yahoo.com> writes:

> I am one of those few people that dont mind being wrong, because I will
> learn something in the process. But I definitely did not expect to be
> proved right so quick. ARM is producing lots of Cortex designs that are
> Thumb2 only, and dumping their own old ARM instruction set.
> The low end Cortex is only 30,000 transistors so the cost of a 16+16
> variable instruction set cannot be much. ARM is aiming at the 8 bit
> chips, and scales up to PC levels.

Cortex-M0 has only 12,000 gates, actually.

It doesn't seem to run full Thumb2, though. The information I could
find (http://www.arm.com/miscPDFs/24481.pdf) says Thumb + a few Thumb2
instructions.

> http://www.arm.com/products/CPUs/archi-thumb2.html

This says that Thumb2 is backwards compatible with existing ARM and
Thumb solutions, and the text on
http://www.arm.com/products/CPUs/architecture.html says

"All ARMv7 architecture profiles implement Thumb® -2 technology which
is built on the foundation of the ARM industry-leading Thumb code
compression technology, while retaining complete code compatibility
with existing ARM solutions."

I may be wrong, but this sounds to me like you can still execute ARM32
and Thumb instructions on ARMv7. So when the diagram says "Thumb-2
only" for V7 M, it might just mean no SIMD or FP, only the base
integer instruction set. On the other hand, the slides at
http://www.arm.com/miscPDFs/8498.pdf say "ARM and Thumb-2 execution"
for v7A and v7R but "Thumb-2 only" for v7M, so it isn't exactly clear.

If someone from ARM is around, I hope they can clarify.

> In a mere half decade or so you will no longer be able to buy a new chip
> with the old ARM instruction set.

That is possible, but I wouldn't be sure. It is difficult making a
clean cut from backwards compatibility.

Torben

Wilco Dijkstra

unread,

Mar 27, 2009, 4:47:56 PM3/27/09

to

"Torben "Ægidius" Mogensen" <tor...@pc-003.diku.dk> wrote in message news:7zd4c32...@pc-003.diku.dk...

> Brett Davis <gg...@yahoo.com> writes:
>
>
>> I am one of those few people that dont mind being wrong, because I will
>> learn something in the process. But I definitely did not expect to be
>> proved right so quick. ARM is producing lots of Cortex designs that are
>> Thumb2 only, and dumping their own old ARM instruction set.
>> The low end Cortex is only 30,000 transistors so the cost of a 16+16
>> variable instruction set cannot be much. ARM is aiming at the 8 bit
>> chips, and scales up to PC levels.
>
> Cortex-M0 has only 12,000 gates, actually.
>
> It doesn't seem to run full Thumb2, though. The information I could
> find (http://www.arm.com/miscPDFs/24481.pdf) says Thumb + a few Thumb2
> instructions.

Yes, Cortex-M0 and -M1 are Thumb-1 plus the minimum number of
Thumb-2 instructions to become a standalone instruction set. The original
Thumb-1 instruction set always needed ARM at startup and for things like
interrupt enable/disable.

>> http://www.arm.com/products/CPUs/archi-thumb2.html
>
> This says that Thumb2 is backwards compatible with existing ARM and
> Thumb solutions, and the text on
> http://www.arm.com/products/CPUs/architecture.html says
>
> "All ARMv7 architecture profiles implement Thumb® -2 technology which
> is built on the foundation of the ARM industry-leading Thumb code
> compression technology, while retaining complete code compatibility
> with existing ARM solutions."
>
> I may be wrong, but this sounds to me like you can still execute ARM32
> and Thumb instructions on ARMv7. So when the diagram says "Thumb-2
> only" for V7 M, it might just mean no SIMD or FP, only the base
> integer instruction set. On the other hand, the slides at
> http://www.arm.com/miscPDFs/8498.pdf say "ARM and Thumb-2 execution"
> for v7A and v7R but "Thumb-2 only" for v7M, so it isn't exactly clear.
>
> If someone from ARM is around, I hope they can clarify.

v7-M (Cortex-M3) doesn't implement ARM, while v7-A and v7-R support the
ARM instruction set. So v7-M is only backwards compatible with Thumb-1 code,
not ARM code.

>> In a mere half decade or so you will no longer be able to buy a new chip
>> with the old ARM instruction set.
>
> That is possible, but I wouldn't be sure. It is difficult making a
> clean cut from backwards compatibility.

More likely a decade. But indeed everybody converts to Thumb-2 when it is
available rather than sticking with ARM, so support for ARM will be needed
for backwards compatibility only. Once those old binaries are replaced there
won't be a need to support ARM any longer.

Wilco

Brett Davis

unread,

Mar 28, 2009, 3:59:22 PM3/28/09

to

In article <YXazl.141830$IC4....@newsfe13.ams2>,
"Wilco Dijkstra" <Wilco.remove...@ntlworld.com> wrote:

> "Torben "Ægidius" Mogensen" <tor...@pc-003.diku.dk> wrote in message
> news:7zd4c32...@pc-003.diku.dk...
> > Brett Davis <gg...@yahoo.com> writes:
> >
> >
> >> I am one of those few people that dont mind being wrong, because I will
> >> learn something in the process. But I definitely did not expect to be
> >> proved right so quick. ARM is producing lots of Cortex designs that are
> >> Thumb2 only, and dumping their own old ARM instruction set.
> >> The low end Cortex is only 30,000 transistors so the cost of a 16+16
> >> variable instruction set cannot be much. ARM is aiming at the 8 bit
> >> chips, and scales up to PC levels.
> >
> > Cortex-M0 has only 12,000 gates, actually.
> >
> > It doesn't seem to run full Thumb2, though. The information I could
> > find (http://www.arm.com/miscPDFs/24481.pdf) says Thumb + a few Thumb2
> > instructions.
>
> Yes, Cortex-M0 and -M1 are Thumb-1 plus the minimum number of
> Thumb-2 instructions to become a standalone instruction set. The original
> Thumb-1 instruction set always needed ARM at startup and for things like
> interrupt enable/disable.

So the Cortex-M0 has only 16 registers not 32, and only does two operand
instructions, saving a read port on the register file. Together that
would save somewhere around ~10,000 gates.
The cost of variable instruction width decoding is looking really really
small.

> >> In a mere half decade or so you will no longer be able to buy a new chip
> >> with the old ARM instruction set.
> >
> > That is possible, but I wouldn't be sure. It is difficult making a
> > clean cut from backwards compatibility.
>
> More likely a decade. But indeed everybody converts to Thumb-2 when it is
> available rather than sticking with ARM, so support for ARM will be needed
> for backwards compatibility only. Once those old binaries are replaced there
> won't be a need to support ARM any longer.

The benefits of Thumb-2 are huge, the cost is tiny. I would bet the last
ARM extensions are done, no new designs, all new work and extensions is
being switched to Thumb-2.
ARM mode will stick around for a few new chips for backwards
compatibility. Thumb-1+ may outlive ARM, kept alive to steal 8 and 16
bit socket wins. Though the 32 bit H8 may be a better design in that
space, as it is an upgrade from the 8/16 bit versions its a plug in
(recompile) and use upgrade:
http://documentation.renesas.com/eng/products/mpumcu/rej09b0348_h838776hm
.pdf

> Wilco

Brett

Brett Davis

unread,

Mar 28, 2009, 4:19:41 PM3/28/09

to

Variable length instructions, the future for all CPU designs.
http://www.arm.com/products/CPUs/archi-thumb2.html

How to split up a 16+16+ instruction format?
The biggest issue is some hand optimized functions can use lots of
registers, while most of the code I disassemble on MIPS tend to use
about 4 registers...
Then I thought about leveraging the AMD64 compilers which bias code
generation to the first eight registers as the second eight need
extension opcode bytes.

The first 16 bit opcode contains up to two 3 bit register specifiers, (8
FPU or 8 INT registers) does loads/stores with short offsets, branches
with short offsets, etc.

The second 16 bit opcode (extension) add 2 bits to the 2 register
specifiers and adds a third 5 bit register specifier. (9 bits, leaving 7
more for opcode.) The second opcode could also just be immediate data
for longer offsets or branches, etc. This gets you classic RISC three
operand 32 registers environment.

The third 16 bit opcode (extension 2) adds another 3 bits to the three
FPU registers and adds a forth 8 bit register specifier. (17 bits, oops,
actually one of those bits was in the previous opcode.) This gives you
four operand instructions with 256 FPU/Vector registers. (Or more
immediate data/offset.) This is what you see in high end designs like
XBox360/PS3/ATI/NVidea/Cray, etc.

The compiler and programmer will of course think of this as a RISC or
VLIW chip, only the assembler or last pass code gen will realize that
most instructions used will condense down to short instructions.
Basically all the branchy code and all the code with lots of aliasing
(most code) will tend to use the short opcodes.

There are lots of variations on where you deploy the bits in which
opcode words, the first opcode could contain 4 bits of register selector
instead, or three register selectors of 3 bits, etc. The most integer
registers I have used was 16, so I doubt you need 32, if you really need
32 you could push them to the third opcode, etc. You could go crazy/VAXy
and have a few select three operand codes in the first opcode and most
in the second. You could add a forth opcode and redistribute the bits,
etc. I have not done an analysis of which combination is "best", if
there is a best.

To find the best combination of bit usage I would look at PowerPC code
as its the most modern, and look at the most commonly used registers.
IBM gives away a code compressor that gets an almost guaranteed 50% code
compression. (It gave exactly 50% or 0% per page of RAM) On the Nintendo
64 it was used to page code off of compressed ROM cartridges, you would
have 4 megs of code and tell the OS to use half a meg of memory and the
virtual memory system would page and decompress blocks off the ROM like
a UNIX system does off of disk.

You could also look at AVR32, or Thumb-2 if you can get the manuals out
of ARM. (A hint to ARM, the secret is out, give up the docs.)
http://en.wikipedia.org/wiki/AVR32
http://www.atmel.com/dyn/resources/prod_documents/doc32000.pdf

Brett

Brett Davis

unread,

Mar 29, 2009, 2:21:04 PM3/29/09

to

Renesas intros RX series Harvard CISC.
http://sg.renesas.com/fmwk.jsp?cnt=concept3.htm&fp=/new_cpu/child_folder/
&title=The%20high-performance%20RX%20CPU%20architecture
http://documentation.renesas.com/eng/products/mpumcu/rej09b0435_rxsm.pdf

The RX series is an ordinary 32 bit load/store RISC chip with 16
registers, Renesas calls it CISC because the market demographics of
embeeded LIKES CISC, and because of BYTE sized instruction encoding.
(Instructions vary from one byte to nine, most instructions seem about
three bytes on average, or so.)

The byte encoding is obviously to save instruction RAM, but I am willing
to bet that my proposed 16 bit encoding design will give the same code
density. (I give priority to the low eight registers, the extra register
bits are in the extension word. Only the assembler needs to know this,
the design looks like an ordinary RISC chip to the complier and most
programmers.)

Not a fan of byte wise instruction coding, I am pointing out this new
chip as yet another data point that designing a variable width
instruction set is not difficult, does not limit clocks, will not impair
performance, will not cost a lot of transistors, etc.

Fixed width 32 bit instructions will be facing extinction in another
decade.
http://www.arm.com/products/CPUs/archi-thumb2.html

Brett

ha...@watson.ibm.com

unread,

Mar 30, 2009, 2:46:08 PM3/30/09

to

On Mar 27, 1:12 am, Brett Davis <gg...@yahoo.com> wrote:
> When IBM announces their variable width instruction set the fat lady
> will have sung.

Well, she's been singing since 1964 then. Today's z/Architecture
still
has the same three instruction lengths of 16, 32 or 48 bits,
determined
by the first two bits of the opcode.

Michel.

ArarghMai...@not.at.arargh.com

unread,

Mar 30, 2009, 4:17:17 PM3/30/09

to

Ahhh, the 1401 had variable length instructions. Somewhere around
1959, I think.
--
ArarghMail903 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

To reply by email, remove the extra stuff from the reply address.

Brett Davis

unread,

Mar 30, 2009, 11:16:59 PM3/30/09

to

In article <i4a2t4tceeb8lft65...@4ax.com>,
ArarghMai...@NOT.AT.Arargh.com wrote:

> On Mon, 30 Mar 2009 11:46:08 -0700 (PDT), ha...@watson.ibm.com wrote:
>
> >On Mar 27, 1:12 am, Brett Davis <gg...@yahoo.com> wrote:
> >> When IBM announces their variable width instruction set the fat lady
> >> will have sung.
> >
> >Well, she's been singing since 1964 then. Today's z/Architecture
> >still
> >has the same three instruction lengths of 16, 32 or 48 bits,
> >determined
> >by the first two bits of the opcode.

Let me rephrase that "When IBM announces their variable width LOAD/STORE

instruction set the fat lady will have sung."

I do like the IBM 360, (z/Architecture) but its not a load/store
machine, like x86 you can do adds from memory. This adds extra
complications and costs to a modern implementation. (Costs which both
Intel/AMD and IBM can afford, to keep those customers.)
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/download/A2278713.pd
f?DT=20070807125007&XKS=DZ9ZBK07
http://www.research.ibm.com/journal/rd/464/plambeck.html

Of course you could try and argue that a register plus memory
architecture is superior to load/store due to code density...
Making me eat my words that code density rules the future. ;)
(I have no idea if the code density is better, might be worth looking
into if the hardware/design costs are not excessive for mid to high end.
Which would be dual issue to quad issue per cycle... ;)

> Ahhh, the 1401 had variable length instructions. Somewhere around
> 1959, I think.

Found a nice web site on the 1401, way cool cabinet, not a normal CPU by
any stretch of ones imagination. But that was your point. ;)
http://www.ed-thelen.org/1401Project/1401RestorationPage.html
http://www.ed-thelen.org/1401Project/1401BackAnnot.jpg
http://www.ed-thelen.org/1401Project/1401GuidePosterV9.rtf

Brett

ArarghMai...@not.at.arargh.com

unread,

Mar 31, 2009, 3:42:09 AM3/31/09

to

On Tue, 31 Mar 2009 03:16:59 GMT, Brett Davis <gg...@yahoo.com> wrote:
> ArarghMai...@NOT.AT.Arargh.com wrote:
<snip>

>> Ahhh, the 1401 had variable length instructions. Somewhere around
>> 1959, I think.
>
>Found a nice web site on the 1401, way cool cabinet, not a normal CPU by
>any stretch of ones imagination. But that was your point. ;)
>http://www.ed-thelen.org/1401Project/1401RestorationPage.html
>http://www.ed-thelen.org/1401Project/1401BackAnnot.jpg
>http://www.ed-thelen.org/1401Project/1401GuidePosterV9.rtf
>

Except for the placement of the lights & switches, a 1401 CPU box
looked like most S/360 boxes. Big, and blue (at least most that I saw
- I think one PPOE had red boxes.)

Also some 1401 info here: http://www.piercefuller.com/

Torben Ægidius Mogensen

unread,

Mar 31, 2009, 6:08:11 AM3/31/09

to

"Wilco Dijkstra" <Wilco.remove...@ntlworld.com> writes:

> "Torben "Ægidius" Mogensen" <tor...@pc-003.diku.dk> wrote in message news:7zd4c32...@pc-003.diku.dk...

>> Cortex-M0 has only 12,000 gates, actually.

>>
>> It doesn't seem to run full Thumb2, though. The information I could
>> find (http://www.arm.com/miscPDFs/24481.pdf) says Thumb + a few Thumb2
>> instructions.
>
> Yes, Cortex-M0 and -M1 are Thumb-1 plus the minimum number of
> Thumb-2 instructions to become a standalone instruction set. The original
> Thumb-1 instruction set always needed ARM at startup and for things like
> interrupt enable/disable.

> v7-M (Cortex-M3) doesn't implement ARM, while v7-A and v7-R support the

> ARM instruction set. So v7-M is only backwards compatible with Thumb-1 code,
> not ARM code.

Thanks for the clarifications. Is a complete description of the
Thumb-2 ISA (including bitpatterns for instructions) available
anywhere public?

Torben

Torben Ægidius Mogensen

unread,

Mar 31, 2009, 6:50:16 AM3/31/09

to

Brett Davis <gg...@yahoo.com> writes:

> Variable length instructions, the future for all CPU designs.
> http://www.arm.com/products/CPUs/archi-thumb2.html
>
> How to split up a 16+16+ instruction format?

There are many ways to have variable-length instructions. The
traditional way (PDP-11, 6502, ...) is to have the first byte/word
contain the opcode and the following bytes/words have operands for
this.

The Pentium (and others) added prefix opcodes: These modify the
meaning of the following opcode. It is in theory possible to have
operands before opcodes as long as no operand "looks like" an opcode.
The Transputer used prefix modifiers like this to extend constants.

Thumb-1 used modes: In one mode, all instructions are 16 bits long and
in another all are 32 bits long. That is only borderline
variable-length. Thumb-2 encodes 16-bit instructions so no 16-bit
instruction can be a prefix of a 32-bit instruction, so the decoder
can look at 32 bits and see if it is a full 32-bit instruction or if
it should use only the first 16 bits and shift only 16 new bits in for
the next decode cycle.

A simpler idea that I like is to make all jump targets 32-bit
boundaries and split a 32-bit word into one or two instructions
depending on the first one or two bits. One possible instantiation
is:

00: The remaining 30 bits encode one 30-bit instruction.

01: The remaining 30 bits encode one 10-bit instruction and one
20-bit instruction.

10: As 01, but the 20-bit instruction is before the 10-bit
instruction.

11: The remaining 30 bits encode three 10-bit instructions.

The advantage is simplicity and alignment (no instruction is split
over two words). The disadvantage is that you might need to add NOPs
if you need many 20-bit instructions in sequence. A good scheduler
would minimize this, though.

10 bits is not much for instructions, but if you limit to two 3-bit
register specifiers, you have four bits left for opcodes. This is
enough to encode the most common operations (addition, subtraction,
move, bitwise logic, shift, load, store, ...). 20-bit instructions
can add larger register specifiers and immediates and more opcodes.
30-bit instructions are all those that can't easily fit 10 or 20 bits:
Procedure calls, long jumps, coprocessor instructions (FP, media,
etc.), load/store with large offsets, etc.

Torben

Brett Davis

unread,

Apr 1, 2009, 1:11:02 AM4/1/09

to

In article <7zprfy6...@pc-003.diku.dk>,

tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:

> Brett Davis <gg...@yahoo.com> writes:
>
> > Variable length instructions, the future for all CPU designs.
> > http://www.arm.com/products/CPUs/archi-thumb2.html
> >
> > How to split up a 16+16+ instruction format?
>
> There are many ways to have variable-length instructions. The
> traditional way (PDP-11, 6502, ...) is to have the first byte/word
> contain the opcode and the following bytes/words have operands for
> this.
>
> The Pentium (and others) added prefix opcodes: These modify the
> meaning of the following opcode. It is in theory possible to have
> operands before opcodes as long as no operand "looks like" an opcode.
> The Transputer used prefix modifiers like this to extend constants.

You only use prefix opcodes when you have no choice, this causes code
density to crater, and the whole point here is good code density.

Good code density can get you better performance, and lower cost for the
total system package. A new design needs an advantage, otherwise your
design will be just another "me to" design destined to obscurity.

> Thumb-1 used modes: In one mode, all instructions are 16 bits long and
> in another all are 32 bits long. That is only borderline
> variable-length. Thumb-2 encodes 16-bit instructions so no 16-bit
> instruction can be a prefix of a 32-bit instruction, so the decoder
> can look at 32 bits and see if it is a full 32-bit instruction or if
> it should use only the first 16 bits and shift only 16 new bits in for
> the next decode cycle.

Thumb-2 is my poster child, 31% smaller code, and 38% faster.
To stay even handed I keeping pointing out that Thumb-1 was CRAP. ;)

> A simpler idea that I like is to make all jump targets 32-bit
> boundaries and split a 32-bit word into one or two instructions
> depending on the first one or two bits. One possible instantiation
> is:
>
> 00: The remaining 30 bits encode one 30-bit instruction.
>
> 01: The remaining 30 bits encode one 10-bit instruction and one
> 20-bit instruction.
>
> 10: As 01, but the 20-bit instruction is before the 10-bit
> instruction.
>
> 11: The remaining 30 bits encode three 10-bit instructions.
>
> The advantage is simplicity and alignment (no instruction is split
> over two words). The disadvantage is that you might need to add NOPs
> if you need many 20-bit instructions in sequence. A good scheduler
> would minimize this, though.

I am willing to bet that every fifth instruction will be a NOP, which
will just about put you back at RISCs poor code density.
Plus you have the extra complications, cost and heat, of having to have
three simultaneous decoders, for the three formats in that 32 bit word.
(With clever layout you can reduce that to two or so.)

Itanic went part way down that path, I know of no other examples.

A 16+16 format needs just one 32 bit decoder, and a simple 16 bit shift
when the opcode is 16 bits to get the next 32 bits to decode. Also a
latch to zero out any effects the extra 16 bits might cause for a 16 bit
instruction. Overall pretty trivial and nothing that will cause issues.

Torben Ægidius Mogensen

unread,

Apr 1, 2009, 5:24:03 AM4/1/09

to

Brett Davis <gg...@yahoo.com> writes:

> In article <7zprfy6...@pc-003.diku.dk>,
> tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:
>
>> Brett Davis <gg...@yahoo.com> writes:

>> The Pentium (and others) added prefix opcodes: These modify the
>> meaning of the following opcode. It is in theory possible to have
>> operands before opcodes as long as no operand "looks like" an opcode.
>> The Transputer used prefix modifiers like this to extend constants.
>
> You only use prefix opcodes when you have no choice, this causes code
> density to crater, and the whole point here is good code density.

I agree. I just mentioned past ways to get variable instruction
lengths.

> Thumb-2 is my poster child, 31% smaller code, and 38% faster.

The speed inscrease is hardly a result of the instruction encoding,
but rather a result of other improvements.

> To stay even handed I keeping pointing out that Thumb-1 was CRAP. ;)

Crap is a bit harsh, it did improve code density without a large
impact on speed. But it could have been done a lot better, I agree.

>> A simpler idea that I like is to make all jump targets 32-bit
>> boundaries and split a 32-bit word into one or two instructions
>> depending on the first one or two bits. One possible instantiation
>> is:
>>
>> 00: The remaining 30 bits encode one 30-bit instruction.
>>
>> 01: The remaining 30 bits encode one 10-bit instruction and one
>> 20-bit instruction.
>>
>> 10: As 01, but the 20-bit instruction is before the 10-bit
>> instruction.
>>
>> 11: The remaining 30 bits encode three 10-bit instructions.
>>
>> The advantage is simplicity and alignment (no instruction is split
>> over two words). The disadvantage is that you might need to add NOPs
>> if you need many 20-bit instructions in sequence. A good scheduler
>> would minimize this, though.
>
> I am willing to bet that every fifth instruction will be a NOP, which
> will just about put you back at RISCs poor code density.

Even if every fifth instruction is a NOP (which I doubt), code density
is still better than fixed-length 32-bit instructions. You only need
NOPs if you have several instructions in one 32-bit word, so at worst,
you have one non-NOP per word, which is the same as fixed-length. But
that is absolutely worst case. A probable mix (IMO) is

- 1/4 30-bit instructions
- 1/2 20+10-bit pairs
- 1/4 3x10-bit instructions

Some of which may be NOPs.

Not considering NOPs, four 32-bit words would, hence, on average hold
8 instructions, which is an average of 16 bits per instruction.

If every fifth instruction is a NOP, the average number of
instructions in four 32-bit words is 8*4/5 = 32/5. So the average
number of bits per instruction is 4*32/(32/5) = 4*5 = 20. This is
37.5% less than fixed-length 32-bit code.

If you are more pessimistic and say

- 1/3 30-bit instructions
- 1/2 20+10-bit pairs
- 1/6 3x10-bit instructions

you get 11 instructions per six words, so with 1/5 NOPs, you get
4*11/5 non-NOPs per six words, so 6*32/(44/5) = 22 bits per
instruction, which is still 31% better than fixed-length 32-bit
instructions.

> Plus you have the extra complications, cost and heat, of having to have
> three simultaneous decoders, for the three formats in that 32 bit word.
> (With clever layout you can reduce that to two or so.)

You only need multiple decoders if you intend to execute multiple
instructions simultaneously.

> Itanic went part way down that path, I know of no other examples.

Itanium does execute several instructions simultaneously, so that is
another kettle of fish.

> A 16+16 format needs just one 32 bit decoder, and a simple 16 bit shift
> when the opcode is 16 bits to get the next 32 bits to decode. Also a
> latch to zero out any effects the extra 16 bits might cause for a 16 bit
> instruction. Overall pretty trivial and nothing that will cause issues.

And you also need a state bit to see if you need to read in a new word
after decoding a 16-bit instruction.

The format I described is not much different: You have one state bit,
so you decode 33 bits rather than 32. When you read in a new word,
the state bit is set to 0. In each decode cycle you use the state bit
and the first two bits of the 32-bit word to decide what to do:

0 00: Decode the next 30 bits as one instruction and read a new word.
0 01: Decode the next 10 bits as one instruction, shift by 10 bits
and set the three bits to 1 10.
0 10: Decode the next 20 bits as one instruction, shift by 20 bits
and set the three bits to 1 01.
0 11: Decode the next 10 bits as one instruction, shift by 10 bits
and set the three bits to 1 11.
1 00: Can't happen.
1 01: Decode the next 10 bits as one instruction and read a new word.
1 10: Decode the next 20 bits as one instruction and read a new word.
1 11: Decode the next 10 bits as one instruction, shift by 10 bits
and set the three bits to 1 01.

So, you see, this is not really very complicated.

Torben

nm...@cam.ac.uk

unread,

Apr 1, 2009, 6:32:44 AM4/1/09

to

In article <7zvdpog...@pc-003.diku.dk>,

Torben =?iso-8859-1?Q?=C6gidius?= Mogensen <tor...@pc-003.diku.dk> wrote:
>
>The format I described is not much different: You have one state bit,
>so you decode 33 bits rather than 32. When you read in a new word,
>the state bit is set to 0. In each decode cycle you use the state bit
>and the first two bits of the 32-bit word to decide what to do:
>
> 0 00: Decode the next 30 bits as one instruction and read a new word.
> 0 01: Decode the next 10 bits as one instruction, shift by 10 bits
> and set the three bits to 1 10.
> 0 10: Decode the next 20 bits as one instruction, shift by 20 bits
> and set the three bits to 1 01.
> 0 11: Decode the next 10 bits as one instruction, shift by 10 bits
> and set the three bits to 1 11.
> 1 00: Can't happen.
> 1 01: Decode the next 10 bits as one instruction and read a new word.
> 1 10: Decode the next 20 bits as one instruction and read a new word.
> 1 11: Decode the next 10 bits as one instruction, shift by 10 bits
> and set the three bits to 1 01.
>
>So, you see, this is not really very complicated.

I find your way of describing it unnecessarily serial :-)

Essentially, the first two bits select between 4 formats - trivial
and fast.

Regards,
Nick Maclaren.

Brett Davis

unread,

Apr 2, 2009, 2:21:32 AM4/2/09

to

In article <7zvdpog...@pc-003.diku.dk>,

tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:
> If you are more pessimistic and say
>
> - 1/3 30-bit instructions
> - 1/2 20+10-bit pairs
> - 1/6 3x10-bit instructions
>
> you get 11 instructions per six words, so with 1/5 NOPs, you get
> 4*11/5 non-NOPs per six words, so 6*32/(44/5) = 22 bits per
> instruction, which is still 31% better than fixed-length 32-bit
> instructions.

You are thinking strait line code, how about:
if (a > 0)
b += 1;
else
b -= 1;

Two branches and two math, assuming nothing before packs with the IF
statement you have three 32 bit blocks that contain four instructions
total. (Still slightly better than RISC, but not the win you thought.)
With 16+16 this is four short instructions.

You could use the two low bits of the jump address to select which sub
instruction to branch to... Scary sounding but perhaps not that
difficult. (Giving you four shorter instructions than 16+16.)

> > Plus you have the extra complications, cost and heat, of having to have
> > three simultaneous decoders, for the three formats in that 32 bit word.
> > (With clever layout you can reduce that to two or so.)
>

> > A 16+16 format needs just one 32 bit decoder, and a simple 16 bit shift
> > when the opcode is 16 bits to get the next 32 bits to decode. Also a
> > latch to zero out any effects the extra 16 bits might cause for a 16 bit
> > instruction. Overall pretty trivial and nothing that will cause issues.
>

> The format I described is not much different: You have one state bit,
> so you decode 33 bits rather than 32. When you read in a new word,
> the state bit is set to 0. In each decode cycle you use the state bit
> and the first two bits of the 32-bit word to decide what to do:

You schooled me on this, did not think to treat it the same as 16+16.
Dual issue needed for the mid market looks a little scary, but that is
likely minor.

This is maybe worth a research paper, going to be hard to convince
someone to make it a real design.

Tensillica web page: (Sandblaster v2 architecture?)
http://www.tensillica.com/products/xtensa/xtensalx/arch/index.htm
http://www.tensillica.com/pdf/Xtensa_ISA.pdf

24 bit instructions, with a 16 bit short instruction option, looks like
you can branch to any byte.

Brett

Torben Ægidius Mogensen

unread,

Apr 2, 2009, 5:27:31 AM4/2/09

to

Brett Davis <gg...@yahoo.com> writes:

> In article <7zvdpog...@pc-003.diku.dk>,
> tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:
>> If you are more pessimistic and say
>>
>> - 1/3 30-bit instructions
>> - 1/2 20+10-bit pairs
>> - 1/6 3x10-bit instructions
>>
>> you get 11 instructions per six words, so with 1/5 NOPs, you get
>> 4*11/5 non-NOPs per six words, so 6*32/(44/5) = 22 bits per
>> instruction, which is still 31% better than fixed-length 32-bit
>> instructions.
>
> You are thinking strait line code, how about:
> if (a > 0)
> b += 1;
> else
> b -= 1;
>
> Two branches and two math, assuming nothing before packs with the IF
> statement you have three 32 bit blocks that contain four instructions
> total. (Still slightly better than RISC, but not the win you thought.)
> With 16+16 this is four short instructions.

The above can be implemented in two words:

if a>0 goto L1 (20 bits)
b += 1 (10 bits)
-----------
L1: if a<=0 goto L2 (20 bits)
b -= 1 (10 bits)
-----------
L2:

where the lines denote word boundaries. All this needs is that the
instruction in the then-branch does not affect the condition at L1,
which is quite common.

> You could use the two low bits of the jump address to select which sub
> instruction to branch to... Scary sounding but perhaps not that
> difficult. (Giving you four shorter instructions than 16+16.)

This is possible, but using two bits for this seems wasteful. I also
don't think the need is that great, as the above example shows.

I would rather add an instruction that conditionally executes only one
of the two following instructions regardless of size or alignment.
This would make the above code encodable as

if a>0 select (20 bits)
b += 1 (10 bits)
b -= 1 (10 bits)

with no restriction on alignment.

>> > A 16+16 format needs just one 32 bit decoder, and a simple 16 bit shift
>>

>> The format I described is not much different
>

> You schooled me on this, did not think to treat it the same as 16+16.
> Dual issue needed for the mid market looks a little scary, but that is
> likely minor.

I don't think it is any harder than for 16+16. The easiest option
restructs multiple issue to instructions in the same word, so at least
one of the simultaneous instructions is a simple 10-bit isntruction.
So you need only duplicate functional units for simple instructions.

> This is maybe worth a research paper, going to be hard to convince
> someone to make it a real design.

I doubt it is even worth a research paper. Maybe a student project.

Torben

Torben Ægidius Mogensen

unread,

Apr 2, 2009, 5:29:26 AM4/2/09

to

nm...@cam.ac.uk writes:

> In article <7zvdpog...@pc-003.diku.dk>,
> Torben =?iso-8859-1?Q?=C6gidius?= Mogensen <tor...@pc-003.diku.dk> wrote:
>>
>>The format I described is not much different: You have one state bit,
>>so you decode 33 bits rather than 32. When you read in a new word,
>>the state bit is set to 0. In each decode cycle you use the state bit
>>and the first two bits of the 32-bit word to decide what to do:
>>

> I find your way of describing it unnecessarily serial :-)

The challenge was size, not speed.

> Essentially, the first two bits select between 4 formats - trivial
> and fast.

This would be the obvious solution for multiple issue, yes.

Torben

Andrew Reilly

unread,

Apr 2, 2009, 7:58:27 AM4/2/09

to

On Thu, 02 Apr 2009 11:27:31 +0200, Torben Ægidius Mogensen wrote:

> I would rather add an instruction that conditionally executes only one
> of the two following instructions regardless of size or alignment. This
> would make the above code encodable as
>
> if a>0 select (20 bits)
> b += 1 (10 bits)
> b -= 1 (10 bits)
>
> with no restriction on alignment.

I prefer something like:

b' = b + 1
b = b - 1
if (a > 0) b = b'

Conditional moves rock. No mucking about with branch targets or pipeline
bubbles or branch prediction. Makes quite a lot of "twisty" code look
straight.

Cheers,

--
Andrew

Torben Ægidius Mogensen

unread,

Apr 2, 2009, 11:46:03 AM4/2/09

to

Andrew Reilly <andrew-...@areilly.bpc-users.org> writes:

> On Thu, 02 Apr 2009 11:27:31 +0200, Torben Ćgidius Mogensen wrote:
>
>> I would rather add an instruction that conditionally executes only one
>> of the two following instructions regardless of size or alignment. This
>> would make the above code encodable as
>>
>> if a>0 select (20 bits)
>> b += 1 (10 bits)
>> b -= 1 (10 bits)
>>
>> with no restriction on alignment.
>
>
> I prefer something like:
>
> b' = b + 1
> b = b - 1
> if (a > 0) b = b'
>
> Conditional moves rock. No mucking about with branch targets or pipeline
> bubbles or branch prediction. Makes quite a lot of "twisty" code look
> straight.

That works too, but is less general. A conditional select like the
above works even if the branches are loads, stores or function calls.

It doesn't really require branch targets etc, just a way to
squash/bypass instructions in the pipeline.

ARM has/had all instructions conditional, which is neat but perhaps
overdoing it a bit (using 4 bits in every instruction word for
something that is used rarely).

Torben

Terje Mathisen

unread,

Apr 2, 2009, 3:08:37 PM4/2/09

to

Brett Davis wrote:
> You are thinking strait line code, how about:
> if (a > 0)
> b += 1;
> else
> b -= 1;
>
> Two branches and two math, assuming nothing before packs with the IF
> statement you have three 32 bit blocks that contain four instructions
> total. (Still slightly better than RISC, but not the win you thought.)
> With 16+16 this is four short instructions.

It would be pretty horrible code though, unless the branch was very
predictable, or the compiler very good indeed:

For a system with a conditional move I would convert it into something
like this:

lea temp,[b+1] ; temp = b+1
dec b
cmp a,0
cmovg b,temp

With predicates you would instead generate both an increment and a
decrement operation, predicated on opposite results of test.

Using plain branch prediction means something like

inc b
cmp a,0
jg done
sub b,2
done:

The key here is that the actual instruction encoding isn't interesting
at _all_, compared to how fast the actual code will run.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Brett Davis

unread,

Apr 2, 2009, 11:22:38 PM4/2/09

to

In article <krSdnflBfbqu-07U...@bestweb.net>,
Mayan Moudgill <ma...@bestweb.net> wrote:
> The Sandblaster v2 architecture uses 8bytes (i.e. 64 bits) to encode 3
> 21-bit ops. An op can be combined with the next op to provide an extra
> 12 immediate bits (so an add-immediate op has 4 bit immediate field and
> occupies one 21-bit op and an add-immediate-long has 16 bits of
> immediate, but occupies two ops).
>
> The left-over bit is used to determine if the 3 ops are executed one at
> a time or in parallel.
>
> All jump-targets are 8-byte aligned.
>
> Code density is pretty decent.

I found the correct manual, and the first page confirms what you said:
http://glossner.org/john/papers/moudgill_sdr_forum_2008_sandblaster_v6_su
bmitted.pdf

Brett

Brett Davis

unread,

Apr 3, 2009, 1:10:12 AM4/3/09

to

In article <7zeiwbp...@pc-003.diku.dk>,

tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:

> Brett Davis <gg...@yahoo.com> writes:
> > You are thinking strait line code, how about:
> > if (a > 0)
> > b += 1;
> > else
{
> > b -= 1;

c += 2;

}
> >
> > Two branches and two math, assuming nothing before packs with the IF
> > statement you have three 32 bit blocks that contain four instructions
> > total. (Still slightly better than RISC, but not the win you thought.)
> > With 16+16 this is four short instructions.
>
> The above can be implemented in two words:
>
> if a>0 goto L1 (20 bits)
> b += 1 (10 bits)
> -----------
> L1: if a<=0 goto L2 (20 bits)
> b -= 1 (10 bits)
> -----------
> L2:
>
> where the lines denote word boundaries. All this needs is that the
> instruction in the then-branch does not affect the condition at L1,
> which is quite common.

I have never seen a compiler do this, and likely never will as it has
two conditional branches, and is thus a bad idea.
You should also note that the two unconditional branches are folded away
and are "free" on a high end design.
In the real world you still have three 32 bit blocks that contain four
instructions. Just add "c += 2" to the else clause so this stays a real
if-then-else, and try again.

> I would rather add an instruction that conditionally executes only one
> of the two following instructions regardless of size or alignment.
> This would make the above code encodable as
>
> if a>0 select (20 bits)
> b += 1 (10 bits)
> b -= 1 (10 bits)

Everyone does conditionals today, I did not think it worth mention.
Some CPUs still dont implement it, so it must be somewhat costly...
ARM spent the cost, then went nuts with it. Not a good idea.

> > This is maybe worth a research paper, going to be hard to convince
> > someone to make it a real design.
>
> I doubt it is even worth a research paper. Maybe a student project.

Ah, you are a software instructor and do not appreciate the difficulties
this could impose on a hardware engineer. ;)
I expect there to be some reason no one has tried this design.
The closest is the Sandblaster 2.0 DSP that Mayan Moudgill mentioned:
http://glossner.org/john/papers/moudgill_sdr_forum_2008_sandblaster_v6_su
bmitted.pdf

(Last time I picked the wrong CPU, this is the real Sandblaster.)

Brett

Torben Ægidius Mogensen

unread,

Apr 3, 2009, 4:39:41 AM4/3/09

to

Brett Davis <gg...@yahoo.com> writes:

If you have branch prediction, it wouldn't be bad. But, even so, a
conditional move or skip will give better performance and code
density.

But, A Terje said, code like this can be compiled differently. Since
the two branches use the same variable, you can compile it to

b += 1

if a>0 goto L1

b -= 2
L1:

thus saving a branch. With conditional moves or skips, you can save
branches altogether. Ditto if you have a SLT (set less than)
instruction:

c = (0<a)
b += c
b += c
b -= 1

This assumes (0<a) evaluates to 1 if true and 0 if false.

> In the real world you still have three 32 bit blocks that contain four
> instructions. Just add "c += 2" to the else clause so this stays a real
> if-then-else, and try again.

b += 1 (10 bits)

if a>0 goto L1 (20 bits)

-----------
b -= 2 (10 bits)
c += 2 (10 bits)
NOP (10 bits)
-----------
L1:

Yes, I did exploit that b is modified in both branches, but in most
places where you could use a CMOV, you can do tricks like this.

If you want worst-case code, this is better:

if (a<0) b += 1; else { c += 2; d += 3; }

With no conditional moves etc. this needs five instructions two of
which are branches (one conditional, one not). On your 16+16 machine,
this will need 5x16 bits (assuming all instructions fit in 16 bits).
With my proposed encoding, you get

if a<=0 goto L1 (20 bits)

b += 1 (10 bits)
-----------

goto L2 (20 bits)
NOP (10 bits)
-----------
L1: c += 2 (10 bits)
d += 3 (10 bits)
NOP (10 bits)
-----------
L2:

This is 3 words = 6x16 bits, so only 20% worse than 16-bit code with
16-bit branch targets. And if you have a 10-bit instruction (say,
i += 1) at L2, this can be folded into both branches:

if a<=0 goto L1 (20 bits)

b += 1 (10 bits)
-----------

i += 1 (10 bits)
goto L2 (20 bits)
-----------
L1: c += 2 (10 bits)
d += 3 (10 bits)
i += 1 (10 bits)
-----------
L2:

thus saving 10 bits.

>> I would rather add an instruction that conditionally executes only one
>> of the two following instructions regardless of size or alignment.
>> This would make the above code encodable as
>>
>> if a>0 select (20 bits)
>> b += 1 (10 bits)
>> b -= 1 (10 bits)
>
> Everyone does conditionals today, I did not think it worth mention.
> Some CPUs still dont implement it, so it must be somewhat costly...
> ARM spent the cost, then went nuts with it. Not a good idea.

The main disadvantage for the ARM was spending instruction bits that
would better have been spent on something else, but the hardware cost
is small. The main reason for omitting conditional non-branches is if
you believe branch prediction will make branches virtually free, since
this will make branch-full code faster than otherwise conditional
code. Even with good prediction, I would include at least a
conditional move, since some conditions are inherently unpredictable.

>> > This is maybe worth a research paper, going to be hard to convince
>> > someone to make it a real design.
>>
>> I doubt it is even worth a research paper. Maybe a student project.
>
> Ah, you are a software instructor and do not appreciate the difficulties
> this could impose on a hardware engineer. ;)

It is just a matter of decoding, so I can't see what is difficult from
a hardware viewpoint. The main issue is (as this discussion has
shown) how well compilers can avoid NOPs.

> I expect there to be some reason no one has tried this design.

There are so many ways of encoding instructions that "not having been
done" doesn't necessarily imply it is a bad idea.

Torben

Mayan Moudgill

unread,

Apr 3, 2009, 9:54:06 AM4/3/09

to

The Sandblaster v2 DSP has 4 units: branch, alu, load-store, SIMD.

In a parallel instruction, the 3x21 bit operations must be against
different units, and not all orderings are valid. (e.g. branch
operations must be in the first operation in the instruction).

In a serial instruction, the 3 operations are sent out one at a time as
instructions with two nops (i.e. if the instruction was {X Y Z}, then
the decoder would see {X NOP NOP}, {Y NOP NOP}, {Z NOP NOP}). Thus the
decoders didn't have to change much; there are 4 decoders (one for each
unit) and they pick up the operation that applies to their unit from
between 1 to 3 positions.

[There are a few details I am abstracting; for instance what is sent to
the decoder is either {X Y NOP} and Y is treated as a NOP if it is NOT
an immediate extender.]

BTW: the SIMD unit has 8x32B registers, and can do 16 16-bit multiplies
per cycle. The load/store path supports 32B load/stores. Peak
performance is 9.6GMACs (where a MAC is a 16 bit multiply followed by a
32 bit add) at 600MHz per core. There are 3 cores per SB3500 chip.

Jacko

unread,

Apr 3, 2009, 6:56:16 PM4/3/09

to

RISC is essential to reduce core area. This is important for
interleaving core gates into the memory array.

cheers jacko

Mayan Moudgill

unread,

Apr 3, 2009, 8:47:43 PM4/3/09

to

Area-wise, its not clear that RISC is the way to go. The decoder on a
chip is generally a small fraction of the total area. A simplistic
argument is: RISC require smaller decoder. However, the instructions are
bigger. So, the instruct-cache has to be larger. The increased area of
the instruction-cache is much larger than the saving due to the decoder.

Of course, its not that straight-forward. But the trade-offs are too
complex for anyone who makes a statement like "RISC is essential to
reduce core-area" to comprehend.

Jacko

unread,

Apr 3, 2009, 10:18:12 PM4/3/09

to

Who says the instruction area is more complicated?

Reduced Instruction Complexity Computer.
===============================
All instructions are removed and logic area simplified till the
instruction stream fits in a nibble or less per instruction.

Implications:
1. The instruction width is small.
2. The CPU area is low
3. Any instruction cache is layered into an instruction stream and a
subroutine address and literal cache
4. High code density is achived by a compact subroutine calling
instruction cache. The effective area of this cache may be reduced by
noting that only certain values of subroutine entry point and literal
values are ever used. So a cache of used addresses and literals, is
complimented by an execution order index into the subroutine and
literal value cache.
5. The instruction cache may be bigger using the above method when
large number of cores all have small instruction caches.
6. The cache values are moved with the register set toward the memory
on which execution next depends.
7. The memory is placed on the chip as the L2 cache, and the small
DRAM L3 memory segment
8. Multiplex decode logic of the memory is reduced due to segmentation
of memory any core will access.
9. Each core executes and latches into the next core's register and
L0,L1 static latches. In this way execution will eventually reach the
require core linked to the required memory. There may be cross
coupling between L2 caches by exchange too. Occasional optional on
address in register loopbacks of state may improve seek performance of
this solid state DPU (Distributed Processing Unit).

If you can make a CISC with less gates and registers please do tell.

cheers jacko

Jacko

unread,

Apr 3, 2009, 10:37:19 PM4/3/09

to

This post costitues prior art...

Use of small circle of cores, repeated many times conected by exchange
core state lines joining these ring nodes would also be a suitable
topology.

cheers jacko

Jacko

unread,

Apr 3, 2009, 10:52:47 PM4/3/09

to

No need to build a big statue, just a 1cm high one welded onto the
sholder of faraday down the embankment.

http://nibz.googlecode.com

Brett Davis

unread,

Apr 4, 2009, 12:25:41 AM4/4/09

to

In article <7z7i22b...@pc-003.diku.dk>,

tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:
> If you want worst-case code, this is better:
>
> if (a<0) b += 1; else { c += 2; d += 3; }
>

How about best case, using select:
if (a<0) b = c; else b = d;
Becomes:
subi.1 a,a,0x0 // subtract immediate, set condition codes.
select.lt b,c,d

The first instruction might be 16 bit two op, the second 32 bit three op.
This saves an instruction verses using two conditional moves.

I have not seen Select used outside of the four operand vector
instructions in PowerPC and MIPs where you use a mask register.
This instruction seems to be a big win in certain situations.

> > Everyone does conditionals today, I did not think it worth mention.
> > Some CPUs still dont implement it, so it must be somewhat costly...
> > ARM spent the cost, then went nuts with it. Not a good idea.
>
> The main disadvantage for the ARM was spending instruction bits that
> would better have been spent on something else, but the hardware cost
> is small. The main reason for omitting conditional non-branches is if
> you believe branch prediction will make branches virtually free, since
> this will make branch-full code faster than otherwise conditional
> code. Even with good prediction, I would include at least a
> conditional move, since some conditions are inherently unpredictable.

Yes, with good branch prediction, predication becomes a net lose unless
you have lots of unused integer units, then its still a wash unless the
branch cannot be predicted well.

Jacko

unread,

Apr 4, 2009, 1:59:41 AM4/4/09

to

hi

Option 1 : to buy outright (sic sarc sucks) => your an idiot.

Option 2 : 51% => your a monolpolist => idot

Option 3 : 49% => your a person with that commodity of money. Maybe
you know another such sole.

Option 4 : 10 % => your too standard.

Option 5 : You want to pit a workers co-operative against the half
open market, sounds interesting, but is it capital intense?

option 6 : Make one with differing case joint appollo 13 style with
the sticky semi filter. You sound a bit to on the edge for me.

option 7 : open to options.

Freaky non conditionals ....

cheers jacko

Brett Davis

unread,

Apr 4, 2009, 4:26:20 PM4/4/09

to

You are obviously feeling left out, I did look at your NIBZ design.
NIBZ is an angle on the MISC (Minimum Instruction Set Computer) with
some newer hardware angles thrown at it to make the design even smaller.
(Sharing parts between pairs of CPUs, etc.)

The basic problem is that todays level of integration has made all MISC
designs obsolete. You can buy a ATI chip with 800 vector processors with
128 vector registers each. You could easily fit a dozen of your NIBZ
processors in the space of one of those vector processors. But how in
the hell are you going to wire up those 10,000 NIBZ cores and succeed in
getting any useful work out of them. The interconnect design is
impossible, and the software design for the networking is worse.

When it comes to getting work done 800 gas guzzling SUVs will beat the
crap out of 10,000 unicycles every time.

The problem today is an embarrassment of riches, which tends to embolden
bloat, and then the bloat slows you down and costs you cycles, and heat,
and clock, and die area, etc.

You also have obsolete ideas like RISC which are slow to die, once upon
a time ago RISC made excellent sense for hardware AND software, but time
has moved on.

We have three levels of cache today on a modern CPU, because the first
two cannot be made any bigger without making them slower. Huge bloated
32 bit VLIW instructions that spill out to L2 and L3 are just a bad idea
today.

Today you want to compress your instructions down, fitting almost twice
as many in that limited L1 and L2, giving you a big boost in performance
for almost free. Also with variable width instructions this enables you
to come up with those long instructions that will give you a performance
advantage where you need it.

Anyone that has worked with SSE2 or Altivec knows that CPU design is
trying to advance, but we are being held back by what we can put into
the limitations of 32 bit VLIW instructions.

We have a new design, we just need an acronym to rule them all. ;)

Brett

VWI - Variable Width Instructions. To literal, wont sell.
AIEX - Advanced Instruction EXtensions. Much better, but not quite.
AISE - Advanced Instruction Set Extensions. Sound not right.
MISE - Modern Instruction Set Extensions. I kinda like this one. ;)
AWE - Advanced Wide Extensions. NICE.
MAWE - Minimal And Wide Extensions. GOOD.
WASI - Wide And Small Instructions.

New acronyms welcome. ;)

Brett Davis

unread,

Apr 4, 2009, 7:01:34 PM4/4/09

to

> > Everyone does conditionals today, I did not think it worth mention.
> > Some CPUs still dont implement it, so it must be somewhat costly...
> > ARM spent the cost, then went nuts with it. Not a good idea.
>
> The main disadvantage for the ARM was spending instruction bits that
> would better have been spent on something else, but the hardware cost
> is small. The main reason for omitting conditional non-branches is if
> you believe branch prediction will make branches virtually free, since
> this will make branch-full code faster than otherwise conditional
> code. Even with good prediction, I would include at least a
> conditional move, since some conditions are inherently unpredictable.

Actually over in a different thread here in comp.arch I found some
people that say to many Conditionals are poison to OoO designs:

From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Subject: predicated instructions (was: Intel publishes Larrabee paper)
EricP <ThatWould...@thevillage.com> writes:
>EricP wrote:
>>
>> Hmmm... an OoO Predicate machine might not cost much more
> > snip
>
>More thoughts...
>Now the real problem and where this breaks down
>(well... actually explodes): register rename.
>
>The predicate value must be taken into account
>by register rename. Each instruction can be predicated
>by a value of execute Always, when P=False, when P=True.
>But different paths through the code will cause different
>mappings of the logical to physical register names.
>For example
>
> P=A set P = r1 < r2 // Set predicate
> P=T mov r3, r4
> P=F mov r3, r5
>
>The value of P can be Unknown, False or True.
>It will be Unknown if either r1 or r2 was not
>known because of prior value dependencies.
>
>But even if the value of P is known we don't want to have the
>decoder or scheduler take it into account as that would make
>the early logic stages dependant on later calculated values.
>(Though there would be a great advantage to finding _some_
>way to filter out instructions that can't possibly execute).
>
>So we must issue both mov instructions giving two
>different r3 depending on how the flag P resolves.
>
>But that gives a 2**N growth in the number of physical
>registers for each predicate set in the execution window.
>It also applies to the number of canceled instructions,
>and the instruction window and physical register bank
>quickly wind up largely populated with canceled junk.
>
>So it would seem that unless you are able to recover
>resources as the predicates resolve then you can only
>handle very small amounts of speculative execution.

I guess that's why they don't do speculative execution for predicated
instructions on any CPU I have heard of. Instead, they treat the
predicate and the original value of the target register as data flow
dependencies. I.e., in an OoO design the last two instructions of
your example would be implemented like this:

r3 = if P then r4 else r3
r3 = if !P then r5 else r3

That costs quite a bit more than an OoO machine without predicated
instructions: more register ports and dealing with more dependencies
in the scheduler. In the in-order case the predicates cost relatively
little extra.

- anton
--
M. Anton Ertl Some things have to be seen to be
believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be
seen
http://www.complang.tuwien.ac.at/anton/home.html

Jacko

unread,

Apr 5, 2009, 4:43:39 PM4/5/09

to

On 4 Apr, 21:26, Brett Davis <gg...@yahoo.com> wrote:

> Jacko <jackokr...@gmail.com> wrote:
> > No need to build a big statue, just a 1cm high one welded onto the
> > sholder of faraday down the embankment.
>
> >http://nibz.googlecode.com
>
> You are obviously feeling left out, I did look at your NIBZ design.
> NIBZ is an angle on the MISC (Minimum Instruction Set Computer) with
> some newer hardware angles thrown at it to make the design even smaller.
> (Sharing parts between pairs of CPUs, etc.)

Yes including droping the snoop bus model of memory.

> The basic problem is that todays level of integration has made all MISC
> designs obsolete. You can buy a ATI chip with 800 vector processors with
> 128 vector registers each. You could easily fit a dozen of your NIBZ
> processors in the space of one of those vector processors. But how in
> the hell are you going to wire up those 10,000 NIBZ cores and succeed in
> getting any useful work out of them. The interconnect design is
> impossible, and the software design for the networking is worse.

> a dozen. Each core will latch the execution of a previous core, in small rings, interconnected into bigger rings. The network routing is based on address bits. Very multi-threaded I agree.

> When it comes to getting work done 800 gas guzzling SUVs will beat the
> crap out of 10,000 unicycles every time.

Work done? per metric ton of metal? If you only make 20 unicycles out
of the resources of an SUV your not going to get a job ... and one
articulated will get an SUV impossible job done. The design of the
nibz DPU does not prevent floating or vector unit SUV cores on the
bus. But placing a chip in the socket plastic of all DIP, PLCC, and
other would be quite impossible if every chip had to be an SUV.

> The problem today is an embarrassment of riches, which tends to embolden
> bloat, and then the bloat slows you down and costs you cycles, and heat,
> and clock, and die area, etc.

Its not know as a die for nothing.

> You also have obsolete ideas like RISC which are slow to die, once upon
> a time ago RISC made excellent sense for hardware AND software, but time
> has moved on.

and it will go on some more.

> We have three levels of cache today on a modern CPU, because the first
> two cannot be made any bigger without making them slower. Huge bloated
> 32 bit VLIW instructions that spill out to L2 and L3 are just a bad idea
> today.

I agree, spilling out to L2 etc is not good, but registers as L0 I'm
not so sure. Static multiplexed ram versus sequential shift register
logic.

> Today you want to compress your instructions down, fitting almost twice
> as many in that limited L1 and L2, giving you a big boost in performance
> for almost free. Also with variable width instructions this enables you
> to come up with those long instructions that will give you a performance
> advantage where you need it.

I agree the L1 should only contain compact subroutine addresses, and
the L0 should not be RUU etc, but very compact 4 bit instructions

> Anyone that has worked with SSE2 or Altivec knows that CPU design is
> trying to advance, but we are being held back by what we can put into
> the limitations of 32 bit VLIW instructions.

Thats like saying my pots arn't big enough, so give up. Instead of
saying use loads of tiny pots. and pot interpretors.

> We have a new design, we just need an acronym to rule them all. ;)

> VWI - Variable Width Instructions. To literal, wont sell.

> AIEX - Advanced Instruction EXtensions. Much better, but not quite.
> AISE - Advanced Instruction Set Extensions. Sound not right.
> MISE - Modern Instruction Set Extensions. I kinda like this one. ;)
> AWE - Advanced Wide Extensions. NICE.
> MAWE - Minimal And Wide Extensions. GOOD.
> WASI - Wide And Small Instructions.
>
> New acronyms welcome. ;)

CLCI - Central Lunchbox Compressed Instructions.

cheers jacko

p.s. if the imperative tense was called the lazative tense, then the
world would possibly be nicer.

Brett Davis

unread,

Apr 5, 2009, 11:42:38 PM4/5/09

to

In article <7zprfy6...@pc-003.diku.dk>,

tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:

> A simpler idea that I like is to make all jump targets 32-bit
> boundaries and split a 32-bit word into one or two instructions
> depending on the first one or two bits. One possible instantiation
> is:
>
> 00: The remaining 30 bits encode one 30-bit instruction.
> 01: The remaining 30 bits encode one 10-bit instruction and one
> 20-bit instruction.
> 10: As 01, but the 20-bit instruction is before the 10-bit
> instruction.
> 11: The remaining 30 bits encode three 10-bit instructions.

I have one major concern about this design verses 16+16+.

How are you going to support four operand vector instructions on 128+
registers, like Xbox36/PS3/ATI/NVidea/Cray.

You only have 30 bits, so you are hard pressed to support everything a
current (obsolete) RISC chip does.

My whole goal as a software engineer is for the hardware engineers to
design something clean and fast, so I can do my job. That 16+16+ opcodes
give better code density and thus performance is merely a huge bonus.

I want 256 vector registers that are quad vector wide, kinda like
Larrabee but without all the sucky limitations. 32 bit RISC cannot do
that, and 64 bit instruction RISC would be stupid.

We are at the tipping point where RISC dies, and is replaced by the next
new thing that wows everyone because it was such a great idea. Obvious,
but only in hindsight, to most people.

ARM is already post-RISC with Thumb-2. ARM was dragged there kicking and
screaming by software engineers yelling at them that Thumb-1 was CRAP,
and to do a variable width instruction set design instead. Reminds me
about the story that you are not supposed to see how sausage is made,
because its an ugly process. Same for making Laws, and making most
things.

We have a new design, we just need an acronym to rule them all. ;)

Brett

MISE - Modern Instruction Set Extensions.

SWELS - Small and Wide Extended Load Store.
EVILS - Extendable Variable Instruction Load Store. ;)

AWE - Advanced Wide Extensions.

MAWE - Minimal And Wide Extensions.

WAS** - Wide And Small Instructions(?) * *.
SAW** - Small And Wide * *.
***LS - * * * Load Store.
**LAS** - * * Load And Store * *.

Load Store, Advanced, Variable, Instruction Set, eXtensions. Wide,
Minimal, Small, Universal, Optimized, Optimal.
All the vowels and as many other letters as we need. ;)

New acronyms welcome. ;)

Brett Davis

unread,

Apr 10, 2009, 1:10:11 AM4/10/09

to

In article <7zd4c32...@pc-003.diku.dk>,
tor...@pc-003.diku.dk (Torben Ægidius Mogensen) wrote:
> Cortex-M0 has only 12,000 gates, actually.
>
> It doesn't seem to run full Thumb2, though. The information I could
> find (http://www.arm.com/miscPDFs/24481.pdf) says Thumb + a few Thumb2
> instructions.
>
> > http://www.arm.com/products/CPUs/archi-thumb2.html
>
> This says that Thumb2 is backwards compatible with existing ARM and
> Thumb solutions, and the text on
> http://www.arm.com/products/CPUs/architecture.html says
>
> "All ARMv7 architecture profiles implement Thumb® -2 technology which
> is built on the foundation of the ARM industry-leading Thumb code
> compression technology, while retaining complete code compatibility
> with existing ARM solutions."

Marvell has the Sheeva line of ARM compatible CPUs, high end and fully
bus-word compliant:

http://www.linleygroup.com/pdf/WP_MRVL.pdf

"To further boost performance, the CPU includes cache optimizations such
as fetching the critical word first, reading from the cache while a miss
is being processed (hit-undermiss), and reading from the cache while a
store is being processed (nonblocking store).
Important code can be locked into the cache on a per-way (but not
per-line) basis. To speed context switches, each cache line has two
dirty bits, so only the dirty half of the line needs to be flushed."

"Marvell¹s efficient CPU pipeline also reduces power. CPUs with long
pipelines often waste power due to pipeline stalls and mispredicted
branch penalties. Marvell¹s nextgeneration mobile CPU uses a
variable-length pipeline that is 7 stages for basic integer
instructions and up to 10 stages for load instructions. To minimize
time- and powerwasting branch penalties, the CPU implements a complex
prediction methodology, including a Gshare-based branch history table
(BHT), a branch target buffer (BTB), a branch return stack, and when all
else fails, static prediction. Branches that hit in the BTB execute
immediately; other correctly predicted branches require one cycle to load
the target instructions."

"Marvell has extended its license to cover ARM v6 and v7, the most
recent version of the architecture. The company expects to sample its
first ARM v7 CPU in late 2008."

Brett

ranjit_...@yahoo.com

unread,

Apr 10, 2009, 10:43:34 AM4/10/09

to

On Mar 15, 6:30 pm, Brett Davis <gg...@yahoo.com> wrote:
> SPARC went another level of hubris stupidity and added register windows,

That came from Berkeley RISC.
http://en.wikipedia.org/wiki/Berkeley_RISC

> which Itanic not only failed to learn from but actually copied!

What's wrong with the Itanium's (variable size) register windows?

Jacko

unread,

Apr 10, 2009, 2:31:36 PM4/10/09

to

On 10 Apr, 15:43, "ranjit_math...@yahoo.com"

<ranjit_math...@yahoo.com> wrote:
> On Mar 15, 6:30 pm, Brett Davis <gg...@yahoo.com> wrote:
>
> > SPARC went another level of hubris stupidity and added register windows,
>

> That came from Berkeley RISC.http://en.wikipedia.org/wiki/Berkeley_RISC

>
> > which Itanic not only failed to learn from but actually copied!
>
> What's wrong with the Itanium's (variable size) register windows?

register multiplex logic? CR stripline charging? Jelly?

cheers jacko

Brett Davis

unread,

Apr 10, 2009, 11:43:02 PM4/10/09

to

In article
<675a2525-d900-4506...@s21g2000vbb.googlegroups.com>,
"ranjit_...@yahoo.com" <ranjit_...@yahoo.com> wrote:

> On Mar 15, 6:30 pm, Brett Davis <gg...@yahoo.com> wrote:
> > SPARC went another level of hubris stupidity and added register windows,
>
> That came from Berkeley RISC.
> http://en.wikipedia.org/wiki/Berkeley_RISC

I was a little unfair, SUN took a RISC on register windows, to be
different from everyone else that did not. They paid the price for not
doing their homework. SUN was a tiny company at the time, it was a seat
of the pants decision.

SPARC always had the slowest clock rate and lowest performance of all
the RISC chips. This was blamed on the register windows, which none of
the other RISC chips used. To add insult to injury some code ran faster
if you complied without register windows.
It was not until the last of the single thread SPARC chips that SUN made
up the difference. The lead CPU designer stated that with all the
enhancements register windows were no longer a performance burden, but
it still ended up being a wash. (Somewhere on the internet should be
some video of his presentation at something like IEEE. It was a Q&A
answer or off the cuff remark, not part of the written presentation.)

Kind of like x86, if you throw enough transistors at a problem you can
overcome some bad design choices.

If register windows had finally become a win with the last SPARC design,
you can bet we all would have heard about it.
It also took a decade for SPARC to catch up, by then SPARC had failed to
grab critical mass and was all but dead.

Lets play what if SPARC did not have register windows: Faster, Cheaper,
more critical mass, maybe SPARC replaces PowerPC in XBox360/PS3/Wii,
maybe SPARC replaces PowerPC in Macs. ;)

Compiler design has changed, instead of "caller save" its "callee save",
so small functions are no longer causing lots of pointless register
saving and restoring. Also compilers inline more code today, including
at link time, to reduce the number of small functions that impair
performance on a big register set designs.

20 years ago stack overhead might be estimated at 30%, now its down near
5%, in the noise, a non-problem with no real solution.

No one would add register windows to a design today, well, except for
Intel, and they failed. ;)

It is interesting that RISC II included short 16 bit opcodes. "This
simple technique yielded a surprising 30% improvement in code density,
making an otherwise identical program on Blue run faster than on Gold
due to the decreased number of memory accesses."

> > which Itanic not only failed to learn from but actually copied!
>
> What's wrong with the Itanium's (variable size) register windows?

According to sources, late in the design of the Itanic they had to add
two stages to the pipeline to deal with the complications register
windows add to the register re-naming logic. These two stages right in
the middle of where the real work happens killed performance. If those
two stages would have been included in the original paper design then
the chip might never have been made.

Intel could have learned from this mistake and removed register windows
from the next design, (forcing those few customers to recompile) but for
some reason they didnt. (I think the compiler could ignore the register
windows, maybe the Itanic design still sucked anyway, for reasons beyond
register windows, lots of stuff/crap in the Itanic design...)

Hubris: No one else could make it work, but we will throw more
people/dollars at it and make it work.

Stupidity: Intel has engaged in so many multi-billion dollar boondoggles
that any other company would have gone bankrupt, several times over.
Nice to have a monopoly to cover over those screw-ups. (Intel is
something like zero for ten in its billion++ dollar non-x86 chip
investments: NOR flash, NAND flash, HDTV chips, ARM chips, network
chips, Itanic, etc.)

Brett

nm...@cam.ac.uk

unread,

Apr 11, 2009, 5:18:40 AM4/11/09

to

In article <ggtgp-85F492....@netnews.asp.att.net>,

Brett Davis <gg...@yahoo.com> wrote:
>In article
><675a2525-d900-4506...@s21g2000vbb.googlegroups.com>,
> "ranjit_...@yahoo.com" <ranjit_...@yahoo.com> wrote:
>
>> On Mar 15, 6:30 pm, Brett Davis <gg...@yahoo.com> wrote:
>> > SPARC went another level of hubris stupidity and added register windows,
>>
>> That came from Berkeley RISC.
>> http://en.wikipedia.org/wiki/Berkeley_RISC
>
>I was a little unfair, SUN took a RISC on register windows, to be
>different from everyone else that did not. They paid the price for not
>doing their homework. SUN was a tiny company at the time, it was a seat
>of the pants decision.

The Hitachi SR2201 had register windows (on a PA-RISC based design),
a really slow clock, and beat the living daylights out of machines
with clocks of up to 10 times faster. Yes, it was a specialist
system, but register windows aren't a stupid idea.

What IS a stupid idea is to jumble half a dozen such techniques
together, with the assumption that their advantages will enhance each
other rather than interfere with each other. And, yes, you can think
of the design I am referring to.

Regards,
Nick Maclaren.

Rob Warnock

unread,

Apr 11, 2009, 6:04:30 AM4/11/09

to

<nm...@cam.ac.uk> wrote:
+---------------
| Brett Davis <gg...@yahoo.com> wrote:

| > "ranjit_...@yahoo.com" <ranjit_...@yahoo.com> wrote:
| >> Brett Davis <gg...@yahoo.com> wrote:
| >> > SPARC went another level of hubris stupidity and added register windows,
| >>
| >> That came from Berkeley RISC.
| >> http://en.wikipedia.org/wiki/Berkeley_RISC
| >
| >I was a little unfair, SUN took a RISC on register windows, to be
| >different from everyone else that did not. They paid the price for not
| >doing their homework. SUN was a tiny company at the time, it was a seat
| >of the pants decision.
|
| The Hitachi SR2201 had register windows (on a PA-RISC based design),
| a really slow clock, and beat the living daylights out of machines
| with clocks of up to 10 times faster. Yes, it was a specialist
| system, but register windows aren't a stupid idea.

+---------------

Indeed. The AMD Am29000 variable-sized register windows
worked really, really well! The 29k was very close to the
same speed as the MIPS R3000 at the same clock speed for
typical Unix workloads.

Unfortunately, AMD had an aversion to being a "systems" house,
and dropped their Unix ports [both System-V & BSD] shortly after
they were completed. (*sigh*)

-Rob

p.s. Obligatory disclosure: I was on the (small) consulting
team which ported 4.3-BSD/Reno to the 29k...

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Stephen Sprunk

unread,

Apr 11, 2009, 2:29:36 PM4/11/09

to

Brett Davis wrote:
> SPARC always had the slowest clock rate and lowest performance of all
> the RISC chips. This was blamed on the register windows, which none of
> the other RISC chips used. To add insult to injury some code ran faster
> if you complied without register windows.
> It was not until the last of the single thread SPARC chips that SUN made
> up the difference. The lead CPU designer stated that with all the
> enhancements register windows were no longer a performance burden, but
> it still ended up being a wash. (Somewhere on the internet should be
> some video of his presentation at something like IEEE. It was a Q&A
> answer or off the cuff remark, not part of the written presentation.)
>
> Kind of like x86, if you throw enough transistors at a problem you can
> overcome some bad design choices.

...

> Compiler design has changed, instead of "caller save" its "callee save",
> so small functions are no longer causing lots of pointless register
> saving and restoring. Also compilers inline more code today, including
> at link time, to reduce the number of small functions that impair
> performance on a big register set designs.
>
> 20 years ago stack overhead might be estimated at 30%, now its down near
> 5%, in the noise, a non-problem with no real solution.

There's another major change in compilers, too: instead of push/pop
instructions, it's a lot more common for compilers to ADD/SUB the stack
pointer, then move data to/from offsets from the fixed stack pointer.
The dedicated stack unit in modern x86 chips can then translate those
stack loads into renamed registers and reduce latency. This seems to
get much of the benefit of register windows without the architectural
baggage since burning extra silicon on that stack unit is entirely
optional...

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Isaac Jaffe

Anton Ertl

unread,

Apr 11, 2009, 2:41:00 PM4/11/09

to

Stephen Sprunk <ste...@sprunk.org> writes:
>There's another major change in compilers, too: instead of push/pop
>instructions, it's a lot more common for compilers to ADD/SUB the stack
>pointer, then move data to/from offsets from the fixed stack pointer.
>The dedicated stack unit in modern x86 chips can then translate those
>stack loads into renamed registers and reduce latency.

I never heard about the stack in IA32 or AMD64 implementations being
mapped to renamed registers, and I don't think it has been
implemented. The best that happens AFAIK is fast forwarding from the
store buffer to the loads using address and alias prediction. Hmm, I
should rerun my microbenchmark that measures the overhead of keeping a
variable in the stack.

Also, the dedicated stack units in hardware actually optimize PUSHes
and POPs so that they are no longer as expensive as a store/load plus
a stack pointer change (which implied an AGI) later. So if compilers
avoid PUSHes and POPs, it's to help older hardware.

One thing that has changed is that now a number of arguments are
passed in registers on AMD64 (at least in the calling convention used
in Linux) rather than on the stack in IA32.

>This seems to
>get much of the benefit of register windows without the architectural
>baggage since burning extra silicon on that stack unit is entirely
>optional...

Mapping stack slots to registers would be expensive, because the stack
is in memory and therefore can be accessed indirectly (and in an
unaligned way). That's really heavy architectural baggage if you want
to optimize it, and consequently, it has not been done.

Jacko

unread,

Apr 11, 2009, 5:30:40 PM4/11/09

to

On 11 Apr, 19:41, an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Stephen Sprunk <step...@sprunk.org> writes:
> >There's another major change in compilers, too: instead of push/pop
> >instructions, it's a lot more common for compilers to ADD/SUB the stack
> >pointer, then move data to/from offsets from the fixed stack pointer.
> >The dedicated stack unit in modern x86 chips can then translate those
> >stack loads into renamed registers and reduce latency.
>
> I never heard about the stack in IA32 or AMD64 implementations being
> mapped to renamed registers, and I don't think it has been
> implemented. The best that happens AFAIK is fast forwarding from the
> store buffer to the loads using address and alias prediction. Hmm, I
> should rerun my microbenchmark that measures the overhead of keeping a
> variable in the stack.

Umm sounds like it, but the un-necessary condition of stack in memory
cohierance causes an extra write.

> Also, the dedicated stack units in hardware actually optimize PUSHes
> and POPs so that they are no longer as expensive as a store/load plus
> a stack pointer change (which implied an AGI) later. So if compilers
> avoid PUSHes and POPs, it's to help older hardware.

That's just fixed base rather than variable SP base addressing.

> One thing that has changed is that now a number of arguments are
> passed in registers on AMD64 (at least in the calling convention used
> in Linux) rather than on the stack in IA32.

Could be useful.

> >This seems to
> >get much of the benefit of register windows without the architectural
> >baggage since burning extra silicon on that stack unit is entirely
> >optional...
>
> Mapping stack slots to registers would be expensive, because the stack
> is in memory and therefore can be accessed indirectly (and in an
> unaligned way). That's really heavy architectural baggage if you want
> to optimize it, and consequently, it has not been done.

Droping the desire to indirect other than top of stack, does allow
dataflow execution, where top of stack at each execution stage becomes
insert for push, and skip feed through for pop. The memory targeting
writes are a bad idea and destroy this dataflow purity.

This is why http://nibz.googlecode.com and me do not advise using
stack caches which insist on writeback/writethru. Spills yes, but why
keep poped and used values updated in memory just to be overwritten
again?

cheers jacko

Oh well, maybe in 20 years people will dig up these ideas, call them
new and then discover I was right all along.

Brett Davis

unread,

Apr 11, 2009, 5:56:13 PM4/11/09

to

In article <2009Apr1...@mips.complang.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Stephen Sprunk <ste...@sprunk.org> writes:
> >There's another major change in compilers, too: instead of push/pop
> >instructions, it's a lot more common for compilers to ADD/SUB the stack
> >pointer, then move data to/from offsets from the fixed stack pointer.
> >The dedicated stack unit in modern x86 chips can then translate those
> >stack loads into renamed registers and reduce latency.
>
> I never heard about the stack in IA32 or AMD64 implementations being
> mapped to renamed registers, and I don't think it has been
> implemented. The best that happens AFAIK is fast forwarding from the
> store buffer to the loads using address and alias prediction. Hmm, I
> should rerun my microbenchmark that measures the overhead of keeping a
> variable in the stack.

That would be the next logical thing to do beyond "store load
forwarding". (This works for more than just the stack) Cancel the load
and reactivate the store register. Faster?

Memory aliasing forces the compiler to spill registers with wild
abandon, fixing up at least the reload side is vital to performance.

This would be another reason that register windowing is obsolete.

Intel chips also do special stuff to make recursion almost free, over
half the instructions can go away, giving an impossibly high number for
the number of instructions executed per cycle.

Brett Davis

unread,

Apr 12, 2009, 12:30:18 AM4/12/09

to

In article <7ztz5a6...@pc-003.diku.dk>,
tor...@pc-003.diku.dk (Torben Ćgidius Mogensen) wrote:

> Is a complete description of the Thumb-2 ISA
> (including bitpatterns for instructions) available
> anywhere public?
>
> Torben

I finally found a PDF that explains Thumb-2 bit layout.

http://www.ittc.ku.edu/~kulkarni/research/thumb_ax.pdf

Its basically Thumb-1 extended. Five bits spread across two 16 bit
instruction words signify a 32 bit instruction. The end result is you
get some three operand instructions and larger offsets, plus some other
useful stuff. You are still stuck with 8 registers total, though you can
pick which 8 of the 16 total dynamically. Maybe 8 is plenty?

Not in the same league as a real variable width format, (one bit not
five to signify extension word, giving more register bits, etc) but the
results are excellent.

Also found a crazy prof that is experimenting with 8 bit ARM opcodes,
accumulator plus register format. Intriguing.

Pick a register as the accumulator and intermix your 8 bit opcodes with
16 bit+ opcodes. (Would require another incompatible instruction set
extension.) Maybe useful for all that branchy aliased code, the 50% of
your code that use 5% of your cycles. Maybe useful elsewhere...

http://www.eng.auburn.edu/~vagrawal/E6970/PAPERS/ISA_For_Low_Power.DOC

Instruction set design is getting a mini revival, after decades of
boring fixed width 32 bit RISC.

Brett

Torben Ægidius Mogensen

unread,

Apr 14, 2009, 10:51:12 AM4/14/09

to

Mayan Moudgill <ma...@bestweb.net> writes:

It is, indeed, not so straight-forward. It is a common misconception
that the main benefit of RISC i simpler decode. A more important
benefit is simpler exception handling. For example, If an instruction
has at most one TLB lookup, handling of TLB misses is much simpler.

Also, it is not given that RISC implies bigger instructions. While
_some_ instructions in the x86 instruction set are only one byte, an
average instruction is 3-4 bytes, so a Thumb2-like mixed 16 and 32-bit
ISA will in general have better code density.

Another trait of RISC is absense of complex call and return
instruction (where CISC call/return instruction typically at least
handle stacking/unstacking of the return address). While this may
seem to increase instruction count in call sequences and
prologues/epilogues of functions, it allows more optimisations: A RISC
processor will often not have to stack the return address, but can
keep it in a register. Tail-call optimisations are also easier on a
RISC.

Indeed, most RISCs are designed to be good compiler targets, where
most CISC designs (at least originally) are designed for ease of
coding assembly by hand.

Torben

Torben Ægidius Mogensen

unread,

Apr 14, 2009, 11:20:26 AM4/14/09

to

Brett Davis <gg...@yahoo.com> writes:

> Compiler design has changed, instead of "caller save" its "callee save",
> so small functions are no longer causing lots of pointless register
> saving and restoring.

It is not difficult to design register windows for callee-saves:
Instead of integrating window sliding into the call instruction, add
"stack N" and "unstack N" instructions that stack or unstack the first
N callee-saves registers on a register stack, with an additional bit
to specify if the link (return address) register should also be
including in the stacked/unstacked registers.

> 20 years ago stack overhead might be estimated at 30%, now its down near
> 5%, in the noise, a non-problem with no real solution.

That is debatable. Modern programming styles include a lot of dynamic
method/function calls that are impossible to inline at compile time,
as well as reducing the distance between calls, so you often have a
lot of calls to functions that either return quickly or call another
function using tail calls. A register stack like the above works well
in both situations.

> No one would add register windows to a design today, well, except for
> Intel, and they failed. ;)

I doubt Itanium failed because it used a register stack -- it failed
because it added a lot of complex features that the compiler was
expected to exploit, without making sure the required compiler
technology was in place. It turned out that the compiler technology
did not materialise, so run-time hardware mechanisms were added to
compensate, which removed nearly all benefit of these features.

>> What's wrong with the Itanium's (variable size) register windows?
>
> According to sources, late in the design of the Itanic they had to add
> two stages to the pipeline to deal with the complications register
> windows add to the register re-naming logic. These two stages right in
> the middle of where the real work happens killed performance. If those
> two stages would have been included in the original paper design then
> the chip might never have been made.

Itanium's register model was needlessly compex. In addition to a
register stack, it also had renaming in loop iterations (to allow
scheduling multiple iterations without unrolling the loop).

> Intel could have learned from this mistake and removed register windows
> from the next design, (forcing those few customers to recompile) but for
> some reason they didnt. (I think the compiler could ignore the register
> windows, maybe the Itanic design still sucked anyway, for reasons beyond
> register windows, lots of stuff/crap in the Itanic design...)
>
> Hubris: No one else could make it work, but we will throw more
> people/dollars at it and make it work.

That sounds more likely.

> Stupidity: Intel has engaged in so many multi-billion dollar boondoggles
> that any other company would have gone bankrupt, several times over.
> Nice to have a monopoly to cover over those screw-ups. (Intel is
> something like zero for ten in its billion++ dollar non-x86 chip
> investments: NOR flash, NAND flash, HDTV chips, ARM chips, network
> chips, Itanic, etc.)

It is not the first time that Intel has had overly ambitious plans for
a next-generation processor: The iAPX 432 was a prime example of
throwing wast resources at implementing an overly complex design that
never ran at competitive speeds. 8086 was only meant as a stop-gap
design between Intel's successful 8080 8-bit processor and the 32-bit
iAPX 432, which was planned to make 8086 outmoded (once completed).

What is more amazing is that so many computer vendors believed Intel's
hype about Itanium: Compaq, who had just aquired Digital's Alpha
design dropped it and jumped on the Itanium bandwagon. Silicon
Graphics dropped their MIPS development and did the same. HP stopped
developing their PA-RISC design and actually (IIRC) lent some hardware
designers to Intel to work on the Itanium. All this for a product
that (even optimistically) was years from being in production.

One would think that Compaq, SG and HP had learned from the iAPX 432
failure. But I guess the management had changed to people with short
memories and no willingness to listen to people with longer memories.

Torben

nm...@cam.ac.uk

unread,

Apr 14, 2009, 12:18:01 PM4/14/09

to

In article <7zws9n2...@pc-003.diku.dk>,

Torben =?iso-8859-1?Q?=C6gidius?= Mogensen <tor...@pc-003.diku.dk> wrote:
>
>I doubt Itanium failed because it used a register stack -- it failed
>because it added a lot of complex features that the compiler was
>expected to exploit, without making sure the required compiler
>technology was in place. It turned out that the compiler technology
>did not materialise, so run-time hardware mechanisms were added to
>compensate, which removed nearly all benefit of these features.

Indeed. And some of us predicted that failure, because we didn't
believe that HP and Intel could pull a rabbit out of a hat that
extensive research by the combined CS and IT communities had failed
to produce in 25 years of trying ....

Regards,
Nick Maclaren.

Robert Myers

unread,

Apr 14, 2009, 1:30:56 PM4/14/09

to

On Apr 14, 12:18 pm, n...@cam.ac.uk wrote:
> In article <7zws9n2ret....@pc-003.diku.dk>,

> Torben =?iso-8859-1?Q?=C6gidius?= Mogensen <torb...@pc-003.diku.dk> wrote:
>
>
>
> >I doubt Itanium failed because it used a register stack -- it failed
> >because it added a lot of complex features that the compiler was
> >expected to exploit, without making sure the required compiler
> >technology was in place. It turned out that the compiler technology
> >did not materialise, so run-time hardware mechanisms were added to
> >compensate, which removed nearly all benefit of these features.
>
> Indeed. And some of us predicted that failure, because we didn't
> believe that HP and Intel could pull a rabbit out of a hat that
> extensive research by the combined CS and IT communities had failed
> to produce in 25 years of trying ....
>

I imagine a huge stack:

problem conceptualization
algorithm development
choice of language
coding
compilation--run-time software, including the OS
isa
implementation of the isa in hardware.

To locate the problem at the compilation stage (or lower) seems an
indefensible oversimplification. If the top of the stack has been
built with a certain kind of processing in mind (do what you're told
when you're told to do it), it's not surprising that a bottom of a
stack designed to function with a different kind of processing
(anticipating the future as much as possible, which places a high
premium on minimizing uncertainty and ambiguity) wouldn't be very
successful.

Robert.

nm...@cam.ac.uk

unread,

Apr 14, 2009, 1:57:30 PM4/14/09

to

In article <ef2f5057-75b5-4ab5...@21g2000vbk.googlegroups.com>,
Robert Myers <rbmye...@gmail.com> wrote:

>On Apr 14, 12:18=A0pm, n...@cam.ac.uk wrote:
>> In article <7zws9n2ret....@pc-003.diku.dk>,

>> Torben =3D?iso-8859-1?Q?=3DC6gidius?=3D Mogensen <torb...@pc-003.diku.dk>=

> wrote:
>>
>> >I doubt Itanium failed because it used a register stack -- it failed
>> >because it added a lot of complex features that the compiler was
>> >expected to exploit, without making sure the required compiler

>> >technology was in place. =A0It turned out that the compiler technology

>> >did not materialise, so run-time hardware mechanisms were added to
>> >compensate, which removed nearly all benefit of these features.
>>

>> Indeed. =A0And some of us predicted that failure, because we didn't

>> believe that HP and Intel could pull a rabbit out of a hat that
>> extensive research by the combined CS and IT communities had failed
>> to produce in 25 years of trying ....
>>
>I imagine a huge stack:
>
>problem conceptualization
>algorithm development
>choice of language
>coding
>compilation--run-time software, including the OS
>isa
>implementation of the isa in hardware.
>
>To locate the problem at the compilation stage (or lower) seems an

>indefensible oversimplification. ...

You have missed the point, possibly because you were not informed of
the original intent of the project. You would need to have known a
fair amount about it before 1995 to be aware of that - by 1996, they
were changing the objectives in a vain attempt to keep the Itanic
afloat.

The original intent of the IA64 design was largely to simplify the
hardware, with the assumption that the compiler could convert even
existing code (Fortran, C etc.) into instructions that would execute
efficiently and correctly. Inter alia, it assumed that compilers
could extract at least 4-way ILP from typical code.

God alone knows why they assumed that, because it had been known
since the 1970s that the 'typical' extractable ILP from a language
like Fortran 77 is more like 2, and that a language like C would
be worse. I assume that anyone who didn't "think positively" was
elbowed out of the project.

Regards,
Nick Maclaren.

Robert Myers

unread,

Apr 14, 2009, 2:26:02 PM4/14/09

to

On Apr 14, 1:57 pm, n...@cam.ac.uk wrote:
> In article <ef2f5057-75b5-4ab5-a5ba-2ee46aee4...@21g2000vbk.googlegroups.com>,
> Robert Myers <rbmyers...@gmail.com> wrote:
>
>

> >I imagine a huge stack:
>
> >problem conceptualization
> >algorithm development
> >choice of language
> >coding
> >compilation--run-time software, including the OS
> >isa
> >implementation of the isa in hardware.
>
> >To locate the problem at the compilation stage (or lower) seems an
> >indefensible oversimplification. ...
>
> You have missed the point, possibly because you were not informed of
> the original intent of the project. You would need to have known a
> fair amount about it before 1995 to be aware of that - by 1996, they
> were changing the objectives in a vain attempt to keep the Itanic
> afloat.
>
> The original intent of the IA64 design was largely to simplify the
> hardware, with the assumption that the compiler could convert even
> existing code (Fortran, C etc.) into instructions that would execute
> efficiently and correctly. Inter alia, it assumed that compilers
> could extract at least 4-way ILP from typical code.

People thought lots of naive things about compilers in the early
nineties.

Most of them have been falsified

To blame naivete about compilers and languages on Intel-HP management
seems a bit of a stretch.

Where's the compiler that is making all these multicore processors
useful? Is there a ship date?

Robert.

nm...@cam.ac.uk

unread,

Apr 14, 2009, 3:05:33 PM4/14/09

to

In article <bea7f63b-12db-4092...@n4g2000vba.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>
>People thought lots of naive things about compilers in the early
>nineties.
>
>Most of them have been falsified
>
>To blame naivete about compilers and languages on Intel-HP management
>seems a bit of a stretch.

Eh? You snipped the paragraph where I pointed out that People With
Clue had known that the relevant technology was infeasible since the
1970s. Why do you claim that a failure to appoint enough People With
Clue to high enough levels in the project is NOT a management failure?

>Where's the compiler that is making all these multicore processors
>useful? Is there a ship date?

Why is that relevant?

Regards,
Nick Maclaren.

j...@cix.compulink.co.uk

unread,

Apr 14, 2009, 3:26:43 PM4/14/09

to

In article <gs2iqa$85s$1...@soup.linux.pwf.cam.ac.uk>, nm...@cam.ac.uk ()
wrote:

> You have missed the point, possibly because you were not informed of

> the original intent of the [Itanium] project. You would need to have

> known a fair amount about it before 1995 to be aware of that - by 1996,
> they were changing the objectives in a vain attempt to keep the Itanic
> afloat.
>
> The original intent of the IA64 design was largely to simplify the
> hardware, with the assumption that the compiler could convert even
> existing code (Fortran, C etc.) into instructions that would execute
> efficiently and correctly. Inter alia, it assumed that compilers
> could extract at least 4-way ILP from typical code.

I think you may have the timescale adrift by a year or two here. When we
had a presentation on IA-64 in spring 1998, the simplification of
control hardware and the ability to thus use more of the transistors for
execution units was definitely part of the message. This would have made
sense had the growth in transistor counts stopped at that point, but of
course, it did not. By the time they'd finished fighting obsolete
problems, the hardware was no longer impressively fast with the possible
compilers, and the death spiral had started.

I can't find an English word to replace "bias" in statements like "We
are sufficiently biased against your product that we will not waste our
time listening to presentations about it." I need a word that lacks the
pejorative aspects of "bias", since we consider this view entirely
justified and reasonable, and it would appear that so does most of the
rest of the world.

So when an HP salesman rang me last year to talk about "the successor to
PA-RISC", the conversation went roughly as follows:

"This would be Itanium, I presume?"
"You've heard of Itanium?!?"
"Ported our software to it, shipped it, found nobody wanted it, almost
finished phasing it out again. It cost me, personally, several years of
wasted time. If you still want to talk about Itanium, I have a perfectly
good dial tone that you can talk to. If you have decent prices on used
Alpha boxes, I could be interested."

They didn't, on any system we could accommodate: they only had huge
over-priced servers.

--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.

Robert Myers

unread,

Apr 14, 2009, 3:33:33 PM4/14/09

to

On Apr 14, 3:05 pm, n...@cam.ac.uk wrote:
> In article <bea7f63b-12db-4092-a95b-6edc1b848...@n4g2000vba.googlegroups.com>,

> Robert Myers <rbmyers...@gmail.com> wrote:
>
>
>
> >People thought lots of naive things about compilers in the early
> >nineties.
>
> >Most of them have been falsified
>
> >To blame naivete about compilers and languages on Intel-HP management
> >seems a bit of a stretch.
>
> Eh? You snipped the paragraph where I pointed out that People With
> Clue had known that the relevant technology was infeasible since the
> 1970s. Why do you claim that a failure to appoint enough People With
> Clue to high enough levels in the project is NOT a management failure?
>

I assign neither the truth nor the importance to the people with a
clue claim that you make. I mean no disrespect.

If you freeze a certain set of assumptions about the world of
software, *maybe* your claim is correct. The problem is the frozen
assumptions, not anything fundamental about software.

> >Where's the compiler that is making all these multicore processors
> >useful? Is there a ship date?
>
> Why is that relevant?
>

The compiler will never arrive. Tons and tons and tons of software
will be rewritten, or the new processors will be useless.

Had "AMD"-64 not arrived, tons and tons and tons of software would
have been rewritten and your frozen-in assumptions falsified.

Robert.

nm...@cam.ac.uk

unread,

Apr 14, 2009, 3:43:58 PM4/14/09

to

In article <cae45889-36f6-43c7...@m24g2000vbp.googlegroups.com>,
Robert Myers <rbmye...@gmail.com> wrote:
>>
>> Eh? =A0You snipped the paragraph where I pointed out that People With

>> Clue had known that the relevant technology was infeasible since the

>> 1970s. =A0Why do you claim that a failure to appoint enough People With

>> Clue to high enough levels in the project is NOT a management failure?
>>
>I assign neither the truth nor the importance to the people with a
>clue claim that you make. I mean no disrespect.

Ah. You may have misunderstood what I meant by Clue.

>If you freeze a certain set of assumptions about the world of
>software, *maybe* your claim is correct. The problem is the frozen
>assumptions, not anything fundamental about software.

Hmm. Frozen assumptions. Like the halting problem is insoluble?
No, I am not joking.

>Had "AMD"-64 not arrived, tons and tons and tons of software would
>have been rewritten and your frozen-in assumptions falsified.

You haven't been reading what I have been posting. I said that the
HP/Intel assumption was that such gains could have been made on
EXISTING code by the use of fancier compilers; that is demonstrable
nonsense.

Whether applications would have been redesigned (and it's often more
than mere rewriting) will forever remain a matter of speculation.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Apr 14, 2009, 3:48:29 PM4/14/09

to

In article <z7Kdnfgc2_XufnnU...@giganews.com>,

<j...@cix.compulink.co.uk> wrote:
>
>I think you may have the timescale adrift by a year or two here. When we
>had a presentation on IA-64 in spring 1998, the simplification of
>control hardware and the ability to thus use more of the transistors for
>execution units was definitely part of the message. This would have made
>sense had the growth in transistor counts stopped at that point, but of
>course, it did not. By the time they'd finished fighting obsolete
>problems, the hardware was no longer impressively fast with the possible
>compilers, and the death spiral had started.

No. By 1998, the target had changed. It was no longer being claimed
that it would start replacing x86 on desktops in 1999. I agree that
the nonsense remained the same - it was just aimed differently.

>I can't find an English word to replace "bias" in statements like "We
>are sufficiently biased against your product that we will not waste our
>time listening to presentations about it." I need a word that lacks the
>pejorative aspects of "bias", since we consider this view entirely
>justified and reasonable, and it would appear that so does most of the
>rest of the world.

Grin :-) That's roughly what I said in my last procurement! Try
replacing "biased against" by "well informed about".

Regards,
Nick Maclaren.

Robert Myers

unread,

Apr 14, 2009, 4:07:52 PM4/14/09

to

On Apr 14, 3:43 pm, n...@cam.ac.uk wrote:
> In article <cae45889-36f6-43c7-8018-5a2e6d452...@m24g2000vbp.googlegroups.com>,

> Robert Myers <rbmyers...@gmail.com> wrote:
>
>
>
> >> Eh? =A0You snipped the paragraph where I pointed out that People With
> >> Clue had known that the relevant technology was infeasible since the
> >> 1970s. =A0Why do you claim that a failure to appoint enough People With
> >> Clue to high enough levels in the project is NOT a management failure?
>
> >I assign neither the truth nor the importance to the people with a
> >clue claim that you make. I mean no disrespect.
>
> Ah. You may have misunderstood what I meant by Clue.
>
> >If you freeze a certain set of assumptions about the world of
> >software, *maybe* your claim is correct. The problem is the frozen
> >assumptions, not anything fundamental about software.
>
> Hmm. Frozen assumptions. Like the halting problem is insoluble?
> No, I am not joking.
>
> >Had "AMD"-64 not arrived, tons and tons and tons of software would
> >have been rewritten and your frozen-in assumptions falsified.
>
> You haven't been reading what I have been posting. I said that the
> HP/Intel assumption was that such gains could have been made on
> EXISTING code by the use of fancier compilers; that is demonstrable
> nonsense.

No disagreement. It's just that, at this point, it's not a very
interesting claim. In the process of trying to make Itanium work, we
learned lots of things about languages and software. I'm sure you'll
tell me we could have learned those things without Itanium. I don't
disagree. It's just that that's the context in which I've learned
most of this stuff.

> Whether applications would have been redesigned (and it's often more
> than mere rewriting) will forever remain a matter of speculation.

I don't think so. c must die or be so altered as to be beyond
recognition.

Itanium turned out to be a hopeless mess. It's probably just as well
that we're not trying to rethink software with it in mind, but we will
have to rethink software.

What you keep saying about instruction parallelism is roughly the same
as what people who don't understand that the memory wall came and went
keep saying.

If you write code so that it can't be compiled to evade the memory
wall, it will run very slowly on modern processors.

If you write code for which the compiler can't infer which memory
locations might be written to, it's going to be hard to find much more
parallelism of any kind.

Robert.

Jacko

unread,

Apr 15, 2009, 9:16:19 AM4/15/09

to

Hi

> I don't think so. c must die or be so altered as to be beyond
> recognition.

Moving away from write-back intensive algorithms. In the threading
sense. Loose the snoop bus idea, and any code construct that requires
it.

> Itanium turned out to be a hopeless mess. It's probably just as well
> that we're not trying to rethink software with it in mind, but we will
> have to rethink software.

Itanium sucks bottom ;-)

> What you keep saying about instruction parallelism is roughly the same
> as what people who don't understand that the memory wall came and went
> keep saying.
>
> If you write code so that it can't be compiled to evade the memory
> wall, it will run very slowly on modern processors.

Finding cache preload opertunities, looks kind of useful, but does
effectively reduce the cache size and hence hit rate, and just adds an
extra load, with the respective parallelism problem of in order
completion.

> If you write code for which the compiler can't infer which memory
> locations might be written to, it's going to be hard to find much more
> parallelism of any kind.

Generally indicates a multi threading based on array location written
code split was not used. Like I said loose the snoop bus ideas. They
will knot up your core. Try a bottle of cock-knobblers vi-agro, "the
bear for real men."

cheers jacko

j...@cix.compulink.co.uk

unread,

Apr 15, 2009, 9:16:50 AM4/15/09

to

In article <gs2pad$540$1...@soup.linux.pwf.cam.ac.uk>, nm...@cam.ac.uk ()
wrote:

> No. By 1998, the target had changed. It was no longer being claimed
> that it would start replacing x86 on desktops in 1999. I agree that
> the nonsense remained the same - it was just aimed differently.

Hum, yes. They were still saying that it would replace x86 on desktops,
but the timescale was less clear. They were also talking about an x86
emulation facility that would bridge the gap, and whose performance
claims we found surprising-to-implausible. Said claims were not met, but
there are all sorts of possible reasons for that.

Jacko

unread,

Apr 15, 2009, 9:48:49 AM4/15/09

to

On 15 Apr, 14:16, j...@cix.compulink.co.uk wrote:
> In article <gs2pad$54...@soup.linux.pwf.cam.ac.uk>, n...@cam.ac.uk ()

> wrote:
>
> > No. By 1998, the target had changed. It was no longer being claimed
> > that it would start replacing x86 on desktops in 1999. I agree that
> > the nonsense remained the same - it was just aimed differently.
>
> Hum, yes. They were still saying that it would replace x86 on desktops,
> but the timescale was less clear. They were also talking about an x86
> emulation facility that would bridge the gap, and whose performance
> claims we found surprising-to-implausible. Said claims were not met, but
> there are all sorts of possible reasons for that.

Must be the new SH bus feature, which when added to the ITanium
provides the extra power processing with the big mucho wanga fan GHz.
Try a bottle of cock knobblers Sci-Fi FTL (For The Ladies) "A real off
the planet earth mover."

cheers jacko

nm...@cam.ac.uk

unread,

Apr 15, 2009, 9:58:22 AM4/15/09

to

In article <wYqdnTZOJNLfQ3jU...@giganews.com>,

<j...@cix.compulink.co.uk> wrote:
>
>> No. By 1998, the target had changed. It was no longer being claimed
>> that it would start replacing x86 on desktops in 1999. I agree that
>> the nonsense remained the same - it was just aimed differently.
>
>Hum, yes. They were still saying that it would replace x86 on desktops,
>but the timescale was less clear. They were also talking about an x86
>emulation facility that would bridge the gap, and whose performance
>claims we found surprising-to-implausible. Said claims were not met, but
>there are all sorts of possible reasons for that.

Quite. They spent rather a long time trying to find a target for their
turkey that would be plausible, but never really succeeded. Whether
it would have been successful if they had located a fairly plausible
target and stuck to it, we shall never know.

Regards,
Nick Maclaren.

Jacko

unread,

Apr 15, 2009, 10:51:00 AM4/15/09

to

clue : an idea about, or physical hint to solution.

CLUE: Cache Line Unit of Execution, A simplified processor design
where a whole cache line can be executed at once, using dataflow
latching. This allows the leading edge of an execution flow to to
exsist with other execution flows in the line execution barrel for
maximization of execution potential of logic. Kind of a pipeline.

I just wondered if this was what you were on abount, but probly not.

cheers jacko

Christopher

unread,

Apr 15, 2009, 6:00:18 PM4/15/09

to

On Apr 14, 12:18 pm, n...@cam.ac.uk wrote:
> In article <7zws9n2ret....@pc-003.diku.dk>,

> Torben =?iso-8859-1?Q?=C6gidius?= Mogensen <torb...@pc-003.diku.dk> wrote:
>
>
>
> >I doubt Itanium failed because it used a register stack -- it failed
> >because it added a lot of complex features that the compiler was
> >expected to exploit, without making sure the required compiler
> >technology was in place. It turned out that the compiler technology
> >did not materialise, so run-time hardware mechanisms were added to
> >compensate, which removed nearly all benefit of these features.
>
> Indeed. And some of us predicted that failure, because we didn't
> believe that HP and Intel could pull a rabbit out of a hat that
> extensive research by the combined CS and IT communities had failed
> to produce in 25 years of trying ....

Now that is just silly. No one had flying machines for thousands of
years. It doesn't mean that they didn't try. But eventually there
was a breakthrough. Saying "no one has been able to do this yet"
doesn't mean that it isn't doable or that someone else won't figure
out how to do it. That's how PROGRESS works.

Christopher

unread,

Apr 15, 2009, 6:04:56 PM4/15/09

to

> You haven't been reading what I have been posting. I said that the
> HP/Intel assumption was that such gains could have been made on
> EXISTING code by the use of fancier compilers; that is demonstrable
> nonsense.

It is not demonstrable or nonsense. Modern compilers are vastly,
vastly superior to compilers of a decade ago. Their ultimate goals
were perhaps overly ambitious and a bit lofty, and the marketing hype
machine probably got involved a little too soon, but the architecture
is both sound and elegant.

j...@cix.compulink.co.uk

unread,

Apr 15, 2009, 6:23:15 PM4/15/09

to

In article
<5185676a-92b6-4411...@k41g2000yqh.googlegroups.com>,

You're comparing apples with oranges. Progress happens mostly in small
steps, not vast leaps appearing out of the blue. Solving the problem of
extracting ILP from serial code, to order, on a short timescale, would
be plausible if there had been some progress towards it. Say, if the
position were analogous to flying machines in 1900, when definite
progress was being made, by several sets of people. But the position
isn't and wasn't much like that. Evidence for this includes the actual
performance that you can extract from Itaniums.

Someday someone may solve the problem, via a breakthrough. But you can't
produce breakthroughs to order: they tend to come from ideas outside the
mainstream, and there's no practical way to tell which of those might be
the right one.

HP and Intel didn't have any radical new approaches; they just decided
that they could solve the problem, without much evidence of its
possibility except that they wanted to do it. Doing so would show that
some ideas they had about how to build a processor could be demonstrated
to be useful, and they could make huge quantities of money. The latter
prospect tends to affect calm judgement.

nm...@cam.ac.uk

unread,

Apr 15, 2009, 6:26:40 PM4/15/09

to

In article <5185676a-92b6-4411...@k41g2000yqh.googlegroups.com>,

Christopher <nadias...@gmail.com> wrote:
>
>Now that is just silly. No one had flying machines for thousands of
>years. It doesn't mean that they didn't try. But eventually there
>was a breakthrough. Saying "no one has been able to do this yet"
>doesn't mean that it isn't doable or that someone else won't figure
>out how to do it. That's how PROGRESS works.

Not half as long as people have been trying to turn lead into gold,
create matter or energy out of nothing, achieve immortality etc.

Mathematics is not mocked. I suggest that you do some research.
The relevant phrase is "complexity theory".

Regards,
Nick Maclaren.

j...@cix.compulink.co.uk

unread,

Apr 15, 2009, 6:31:58 PM4/15/09

to

In article
<95ff9162-e850-4933...@37g2000yqp.googlegroups.com>,
nadias...@gmail.com (Christopher) wrote:

> It is not demonstrable or nonsense. Modern compilers are vastly,
> vastly superior to compilers of a decade ago. Their ultimate goals
> were perhaps overly ambitious and a bit lofty, and the marketing hype
> machine probably got involved a little too soon, but the architecture
> is both sound and elegant.

Really? Have you spent much time trying to get software to work
correctly and run fast on it? I believed in Itanium until I'd spent a
year fighting the compilers; the further year-plus I spent on it was
pretty convincing about its deficiencies.

For example, how about the need to re-issue all outstanding
floating-point pre-loads after each function call? That handicaps
performance significantly, and is a basic feature of the architecture.
If the compilers can work around it now, I'd be interested to hear how.
With specifics.

Jacko

unread,

Apr 15, 2009, 6:41:01 PM4/15/09

to

On 15 Apr, 23:26, n...@cam.ac.uk wrote:
> In article <5185676a-92b6-4411-a3c5-1523ccb4b...@k41g2000yqh.googlegroups.com>,

>
> Christopher <nadiasver...@gmail.com> wrote:
>
> >Now that is just silly. No one had flying machines for thousands of
> >years. It doesn't mean that they didn't try. But eventually there
> >was a breakthrough. Saying "no one has been able to do this yet"
> >doesn't mean that it isn't doable or that someone else won't figure
> >out how to do it. That's how PROGRESS works.
>
> Not half as long as people have been trying to turn lead into gold,
> create matter or energy out of nothing, achieve immortality etc.

the lead into gold problem was solved, but a 3 day half life to mecury
probably made the 3 day clearing time necessary.

> Mathematics is not mocked. I suggest that you do some research.
> The relevant phrase is "complexity theory".

ILP is a complicated issue, and is mainly useful in DSP code. General
code has too much branching and other things, and so is more thread
model in any attempt to parellize it. The locking of program counters
to one instruction thread so that many execution units run just it, is
generally a bad way of packing instructions in a word. Hence the
multicore approach.

cheers jacko

Larry Elmore

unread,

Apr 15, 2009, 8:41:09 PM4/15/09

to

ELEGANT?

Bernd Paysan

unread,

Apr 16, 2009, 4:32:05 AM4/16/09

to

nm...@cam.ac.uk wrote:
> Not half as long as people have been trying to turn lead into gold,
> create matter or energy out of nothing, achieve immortality etc.

Turning lead into gold is sort-of solved, through nuclear transmutation,
though uninteresting. Matter and energy out of nothing is still impossible,
but the work on immortality looks like progressing seriously thanks to huge
progress in DNA analysis.

Progress on massive parallel execution of programs has advanced, too, but
probably not in the direction Intel imagined. The current state-of-art
massive parallel chip are GPUs. They use SIMD (usually implicit SIMD),
multicore, multithreading, the latter to "hide" latency. And they have been
successful in a specialized area, while legacy programs still are very
sequential.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

nm...@cam.ac.uk

unread,

Apr 16, 2009, 7:31:51 AM4/16/09

to

In article <8345144.9...@elfi.zetex.de>,

Bernd Paysan <bernd....@gmx.de> wrote:
>
>> Not half as long as people have been trying to turn lead into gold,
>> create matter or energy out of nothing, achieve immortality etc.
>
>Turning lead into gold is sort-of solved, through nuclear transmutation,
>though uninteresting. Matter and energy out of nothing is still impossible,
>but the work on immortality looks like progressing seriously thanks to huge
>progress in DNA analysis.

You have completely missed the point. The automatic parallelisation
of existing code is comparable with transmuting lead into gold - i.e.
we know a method, but it is completely impractical for anything except
demonstrating one's virtuosity. Provably so.

And your belief in DNA analysis leading to immortality is much like
the belief of Intel and HP in new methods leading to a breakthrough
in the automatic parallelisation of existing code. A lot of wishful
thinking, but no actual progress.

>Progress on massive parallel execution of programs has advanced, too, but
>probably not in the direction Intel imagined. The current state-of-art
>massive parallel chip are GPUs. They use SIMD (usually implicit SIMD),
>multicore, multithreading, the latter to "hide" latency. And they have been
>successful in a specialized area, while legacy programs still are very
>sequential.

The optimisation of vectorisable codes has been solved since the 1970s.
This thread was entirely about the automatic parallelisation of existing
code.

Regards,
Nick Maclaren.

Bernd Paysan

unread,

Apr 16, 2009, 8:39:52 AM4/16/09

to

nm...@cam.ac.uk wrote:
> And your belief in DNA analysis leading to immortality is much like
> the belief of Intel and HP in new methods leading to a breakthrough
> in the automatic parallelisation of existing code. A lot of wishful
> thinking, but no actual progress.

Ah, they managed to make some simple worms live 10 times as long as before.
Not that these worms live very long in either case ;-)... For humans, it is
wishful thinking, of course.

>>Progress on massive parallel execution of programs has advanced, too, but
>>probably not in the direction Intel imagined. The current state-of-art
>>massive parallel chip are GPUs. They use SIMD (usually implicit SIMD),
>>multicore, multithreading, the latter to "hide" latency. And they have
>>been successful in a specialized area, while legacy programs still are
>>very sequential.
>
> The optimisation of vectorisable codes has been solved since the 1970s.
> This thread was entirely about the automatic parallelisation of existing
> code.

Which usually is way too low-level to extract the actual parallelism out of
it... Take e.g. GCC as example, it's part of SPEC, so people would try hard
to create a fast binary from the source. GCC does pattern matching for
instruction selection. Now pattern matching by itself is a highly parallel
problem; GCC's approach however is a highly sequential one (through using a
Lisp-ish representation of the patterns). The point is: existing code is
just way too obfuscated. There's only one way to improve the situation:
Rewrite existing code. Since "free lunch is over" (i.e. performance
improvement of current hardware is only in parallelism, not in per-thread
execution), this has to come anyway.

nm...@cam.ac.uk

unread,

Apr 16, 2009, 9:05:54 AM4/16/09

to

In article <1569875.K...@elfi.zetex.de>,

Bernd Paysan <bernd....@gmx.de> wrote:
>
>> And your belief in DNA analysis leading to immortality is much like
>> the belief of Intel and HP in new methods leading to a breakthrough
>> in the automatic parallelisation of existing code. A lot of wishful
>> thinking, but no actual progress.
>
>Ah, they managed to make some simple worms live 10 times as long as before.
>Not that these worms live very long in either case ;-)... For humans, it is
>wishful thinking, of course.

Yes. You can also make many trees live indefinitely by coppicing them.
No advanced technology needed - but applying that to humans might be
tricky :-)

>> This thread was entirely about the automatic parallelisation of existing
>> code.
>
>Which usually is way too low-level to extract the actual parallelism out of
>it... Take e.g. GCC as example, it's part of SPEC, so people would try hard
>to create a fast binary from the source. GCC does pattern matching for
>instruction selection. Now pattern matching by itself is a highly parallel
>problem; GCC's approach however is a highly sequential one (through using a
>Lisp-ish representation of the patterns). The point is: existing code is
>just way too obfuscated. There's only one way to improve the situation:
>Rewrite existing code. Since "free lunch is over" (i.e. performance
>improvement of current hardware is only in parallelism, not in per-thread
>execution), this has to come anyway.

We are in full agreement on that one.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Apr 16, 2009, 9:11:51 AM4/16/09

to

j...@cix.compulink.co.uk writes:
>Solving the problem of
>extracting ILP from serial code, to order, on a short timescale, would
>be plausible if there had been some progress towards it. Say, if the
>position were analogous to flying machines in 1900, when definite
>progress was being made, by several sets of people.

There was progress towards it. There was basic block instruction
scheduling, software pipelining, trace scheduling, superblock
scheduling, if-conversion and reverse if-conversion, and a lot of
other research about dealing with control flow and memory dependencies
in instruction scheduling. Some of these ideas needed or at least
profited from architectural support, and IA-64 provided that. And I
don't think that IA-64 is a failure in that area. If you compare the
Itanium II's IPC or SPECcpu/MHz with that of other in-order machines
such as some UltraSPARCs or the Power 6, I guess the Itanium II is
doing fine (although it's always hard to compare such wildly different
designs).

And there were a number of limits papers that showed, that if we could
work around this and that dependency and have enough resources, the
available ILP is very large. Looking back, I think where the limits
papers misled us was that they were trace-based and therefore ignored
certain restrictions that instructions have that are executing from
real code on an in-order CPU (e.g., scheduling across
dynamically-linked calls just does not happen in the real world; also,
many other solutions to such problems require extensive code
replication, which sometimes leads to a large number of I-cache
misses).

Anyway, where the Itaniums fell down performancewise is in clock rate.
When the project started, most people believed that an in-order CPU
could be made to clock significantly faster than an out-of-order CPU.
But when the Itaniums finally arrived, they had only half the clock
rate of their Pentium 4 cousins and were still clocked significantly
slower than the Athlons and about the same as the out-of-order 21264
and HPPA CPUs that they were supposed to replace.

Nowadays there's Power 6 as an example of in-order clocking faster
(possibly because out-or-order at high clock rates is too
power-intensive), and in the mid-90s there was the 21164, but right
around the introduction of IA-64 implementations, out-of-order was
generally clocked faster, for whatever reason. I think that, if Intel
tried for a few decades, they would finally find out how to make a
fast-clocked high-performance IA-64 implementation, but the complexity
of the architecture probably requires a few more iterations than for
simpler architectures. Similarly, it takes more time for the compiler
to mature on a complex architecture.

> But the position
>isn't and wasn't much like that. Evidence for this includes the actual
>performance that you can extract from Itaniums.

Hindsight is 20/20.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

nm...@cam.ac.uk

unread,

Apr 16, 2009, 10:40:51 AM4/16/09

to

In article <2009Apr1...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>j...@cix.compulink.co.uk writes:
>
>>Solving the problem of
>>extracting ILP from serial code, to order, on a short timescale, would
>>be plausible if there had been some progress towards it. Say, if the
>>position were analogous to flying machines in 1900, when definite
>>progress was being made, by several sets of people.
>
>There was progress towards it. There was basic block instruction
>scheduling, software pipelining, trace scheduling, superblock
>scheduling, if-conversion and reverse if-conversion, and a lot of
>other research about dealing with control flow and memory dependencies

>in instruction scheduling. ...

None of those have anything whatsoever to do with extracting ILP.
They are all about how to make use of the ILP you have extracted;
to get good efficiency on existing codes, you need both, and it is
the extraction that is the problem.

>And there were a number of limits papers that showed, that if we could
>work around this and that dependency and have enough resources, the
>available ILP is very large. Looking back, I think where the limits
>papers misled us was that they were trace-based and therefore ignored
>certain restrictions that instructions have that are executing from

>real code on an in-order CPU ....

Well, at least you noticed that they were trace-based.

What you have missed is that most (all?) of them were complete hogwash,
because they were valid only if you were going to rerun exactly the
same code ON EXACTLY THE SAME DATA. Who is interested in that?
Benchmarketeers, that's who.

There is a standard statistical technique for dealing with such issues,
which I have not seen used.

When designing hardware (or even compiling code), you have to guess
what the future will hold, which leads to the old result of an average
ILP of 2. Those papers were based on using an omniscient oracle and,
surprise, surprise, did much better. But it's of no use whatsoever
in practice.

Regards,
Nick Maclaren.

EricP

unread,

Apr 16, 2009, 12:22:22 PM4/16/09

to

Bernd Paysan wrote:
> There's only one way to improve the situation:
> Rewrite existing code. Since "free lunch is over" (i.e. performance
> improvement of current hardware is only in parallelism, not in per-thread
> execution), this has to come anyway.

Which applications do you have in mind?
Because so far I don't see much business
case for doing so outside of a few niche areas.

Eric

Robert Myers

unread,

Apr 16, 2009, 2:41:45 PM4/16/09

to

The "business case" has historically been vendor-driven. Microsoft
and Intel have brought the game to a high art, but they didn't invent
it. Vendors want to create a situation where customers feel they have
to buy stuff.

There being no stuff left to sell that doesn't somehow lean on
parallelism, I expect the business case to materialize one way or
another.

Robert.

Robert Myers

unread,

Apr 16, 2009, 4:05:45 PM4/16/09

to

On Apr 16, 10:40 am, n...@cam.ac.uk wrote:

>
> What you have missed is that most (all?) of them were complete hogwash,
> because they were valid only if you were going to rerun exactly the
> same code ON EXACTLY THE SAME DATA. Who is interested in that?
> Benchmarketeers, that's who.

What would you propose as a test between a perfect oracle and whatever
imperfect technology exists?

To have a perfect oracle, you have to know *exactly* what will happen,
which means you *have* to rerun on the same data.

I'm not aware of anyone who ever tried even an "approach to a perfect
oracle" on even an artificial test case. Perhaps it's been done, and
I just don't know about it.

People randomly try stuff and show improvement on well-known test
cases (say bzip or gcc). Those random shots in the dark haven't led
to much insight, but I don't know that anyone knows enough to draw the
kinds of emphatic conclusions you are prone to on this subject.

> There is a standard statistical technique for dealing with such issues,
> which I have not seen used.

But you seem to be agreeing that meaningful tests have never been
done, so I don't know how you state your conclusions with such
certainty.

Robert.

Robert Myers

unread,

Apr 16, 2009, 4:16:11 PM4/16/09

to

On Apr 16, 7:31 am, n...@cam.ac.uk wrote:

>
> The optimisation of vectorisable codes has been solved since the 1970s.
> This thread was entirely about the automatic parallelisation of existing
> code.

It may be to no one else, but the apparent singularity of
"vectorisable" code is a challenging mystery to me. It's applicable
to a large class of problems, there is little argument as to what it
means, compilers can be taught to recognize it and to diagnose
obstacles to it automatically, it generally leads to clean, readable
code, and the programming style maps neatly to a very efficient
hardware realization.

What is so peculiar about vectorization that nothing else in the world
of software works even remotely the same way? It would be as if the
study of the Riemann zeta function had stopped with the failure of the
series representation to converge for Re s less than or equal to one.
There is no equivalent to analytic continuation in the world of
software, but one of these days someone will discover the universe of
possibilities in which vectorisable loops are embedded.

Robert.

Jacko

unread,

Apr 16, 2009, 4:32:26 PM4/16/09

to

The peculiar feature of vectorization is SIMD, regular general code
has little MD for the SI, as if then else cascades and case switches
limit the effective use of speculative execution. Such large scale
speculative execution is bulky, and an inefficient/unrealizable use of
silcon area.

The failure of the series to converge is due to analytical constraints
of series convergence. This in itself is limited by what constitutes a
term of said series, and in no way implies there is no alternate form
of a divisio-product term, and even if sum or product over terms is
required.

Maybe, but this does not solve the if then else optional execution of
conditionals problem.

cheers jacko

nm...@cam.ac.uk

unread,

Apr 16, 2009, 4:44:45 PM4/16/09

to

In article <cb01505c-5d4f-498a...@f19g2000yqh.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>>
>> What you have missed is that most (all?) of them were complete hogwash,
>> because they were valid only if you were going to rerun exactly the

>> same code ON EXACTLY THE SAME DATA. =A0Who is interested in that?

>> Benchmarketeers, that's who.
>
>What would you propose as a test between a perfect oracle and whatever
>imperfect technology exists?

Eh? Why would one bother? A perfect oracle is useless for any practical
purpose - or, indeed, almost any theoretical purpose.

>To have a perfect oracle, you have to know *exactly* what will happen,
>which means you *have* to rerun on the same data.

Precisely.

>I'm not aware of anyone who ever tried even an "approach to a perfect
>oracle" on even an artificial test case. Perhaps it's been done, and
>I just don't know about it.

Look at many of the papers that analyse the extractable ILP. They
run the program, then work out what dependencies actually occurred,
and draw their conclusions from that.

>> There is a standard statistical technique for dealing with such issues,
>> which I have not seen used.
>
>But you seem to be agreeing that meaningful tests have never been
>done, so I don't know how you state your conclusions with such
>certainty.

Do you know any of the relevant mathematics? Because there is plenty.
It's a provably intractable problem. And 25 years of experience has
shown that it's not one of the 'easy' cases of them, such as linear
programming.

Regards,
Nick Maclaren.

Robert Myers

unread,

Apr 16, 2009, 5:58:20 PM4/16/09

to

You should've got in on the what is asynchronous argument. The
stupidity window is closed, dude.

Robert.

Robert Myers

unread,

Apr 16, 2009, 6:01:23 PM4/16/09

to

On Apr 16, 4:44 pm, n...@cam.ac.uk wrote:

>
> Do you know any of the relevant mathematics? Because there is plenty.
> It's a provably intractable problem. And 25 years of experience has
> shown that it's not one of the 'easy' cases of them, such as linear
> programming.

You make mathematics look bad by trying to use it to patronize
people. I wish you'd stop. I know what NP-hard means.

Robert.

Gavin Scott

unread,

Apr 16, 2009, 7:47:57 PM4/16/09

to

Torben ?gidius Mogensen <tor...@pc-003.diku.dk> wrote:
> What is more amazing is that so many computer vendors believed Intel's
> hype about Itanium: Compaq, who had just aquired Digital's Alpha
> design dropped it and jumped on the Itanium bandwagon. Silicon
> Graphics dropped their MIPS development and did the same. HP stopped
> developing their PA-RISC design and actually (IIRC) lent some hardware
> designers to Intel to work on the Itanium.

Since Itanium was an HP-invented architecture, you might as well ask
why Intel bought into HP's hype rather than the other way round.

G.

Rick Jones

unread,

Apr 16, 2009, 8:01:36 PM4/16/09

to

Gavin Scott <ga...@allegro.com> wrote:
> Since Itanium was an HP-invented architecture, you might as well ask
> why Intel bought into HP's hype rather than the other way round.

It might be more accurate to say that what became Itanium started at
HP. How much of what there is today matches the initial HP scent and
how much is Intel's own scent I've no idea.

rick jones
--
a wide gulf separates "what if" from "if only"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

Jacko

unread,

Apr 16, 2009, 10:33:16 PM4/16/09

to

Why did it pay cash?

Well that's lucky 'cos all the idiots were climbing in.

cheers jacko

Jacko

unread,

Apr 16, 2009, 10:37:27 PM4/16/09

to

By the way, not having conditional branching solves some pipeline
stalls, but conditional assignment does not solve data dependancy
sufficiently to extract efficient performance from speculative
execution and ILP.

cheers jacko

Jacko

unread,

Apr 16, 2009, 10:58:10 PM4/16/09

to

> a wide gulf separates "what if" from "if only"

very well put. The what if design had to be made, but the if only all
code was like it isn't problem came up. Yes C is a pandemic problem.

learning an if only from research does not make for a marketable
sucess.

cheers jacko

Robert Myers

unread,

Apr 17, 2009, 12:33:51 AM4/17/09

to

That's a pretty strong statement. I'd be interested to know if you
considered that to be true even leaving power constraints aside.

Robert.

Jacko

unread,

Apr 17, 2009, 1:13:47 AM4/17/09

to

if efficiency is not defined in terms of power, what alternate word
would you suggest so that I may consider the truth of the inefficient
system suggested?

You know my view, very simple cores, even to the extent of 1
instruction per 2 or 3 cycles, and a simple instruction at that.

Speculative execution is a waste of a threading execution potential.
Instruction level parallelism happens to some extent on a sub
instruction (e.g. pre post inc/dec) basis. Whether it it better to
split complex instructions into VLIW and end up clogging the register
file ports with strange multiplex strategies is debatable.

Consider this ... If the utilization of an extra functional unit is
limited by extractable ILP limits, then surly having the unit use
specialist registers whose contents are filled and emptied que fashion
into and out of the main nregister file (lower mux overhead) then the
opcode for this functional unit can be smaller, and possibly just 1
bit.

cheers jacko

nm...@cam.ac.uk

unread,

Apr 17, 2009, 4:35:51 AM4/17/09

to

In article <gs8gt0$h94$1...@usenet01.boi.hp.com>,

Rick Jones <rick....@hp.com> wrote:
>Gavin Scott <ga...@allegro.com> wrote:
>> Since Itanium was an HP-invented architecture, you might as well ask
>> why Intel bought into HP's hype rather than the other way round.
>
>It might be more accurate to say that what became Itanium started at
>HP. How much of what there is today matches the initial HP scent and
>how much is Intel's own scent I've no idea.

No, but what Gavin Scott is correct (the "buying in" was in the early
1990s).

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Apr 17, 2009, 4:41:30 AM4/17/09

to

In article <ceed9498-72eb-4629...@q16g2000yqg.googlegroups.com>,
Robert Myers <rbmye...@gmail.com> wrote:
>>
>> Do you know any of the relevant mathematics? =A0Because there is plenty.
>> It's a provably intractable problem. =A0And 25 years of experience has

>> shown that it's not one of the 'easy' cases of them, such as linear
>> programming.
>
>You make mathematics look bad by trying to use it to patronize
>people. I wish you'd stop. I know what NP-hard means.

Then you don't know any of the RELEVANT mathematics. If you did, you
wouldn't have have implied that I didn't base my statements on definite
knowledge. There just isn't any way that I can explain anything like
that in layman's terms, except by making bald statements.

Doubtless a really skilled communicator could, but there typically
aren't more than one or two of those born a generation. I am not one.

Regards,
Nick Maclaren.

Ken Hagan

unread,

Apr 17, 2009, 5:10:10 AM4/17/09

to

On Wed, 15 Apr 2009 23:00:18 +0100, Christopher <nadias...@gmail.com>
wrote:

> Now that is just silly. No one had flying machines for thousands of
> years.

Balderdash. Birds predate humans on this planet by quite a few million
years
and were used as self-piloting hunting drones long before modern air-forces
build mechanical ones.