Does CPU ISA matter for 25M+ transistors?

Michael S

unread,

Dec 5, 2002, 7:37:28 AM12/5/02

to

Of coarse, I mean other things like fab and skills of implementers
being the same. Can we say that there are better ISAs ?
Is it possible that one ISA is inherently better in per CPU marks
while another one is better in per Watt marks ?

glen herrmannsfeldt

unread,

Dec 5, 2002, 1:22:08 PM12/5/02

to

"Michael S" <already...@yahoo.com> wrote in message
news:f881b862.0212...@posting.google.com...

Is this a homework assignment you want us to do for you?

This topic could go on with endless discussions, arguments
from both sides, etc.

Just remember that MIPS, (not the company, though) stands for
Meaningless Indicator of Processor Speed.

There is also the famous saying, "All generalizations are false,
including this one."

Oh, the answers are No, and No.

-- glen

Nick Maclaren

unread,

Dec 5, 2002, 1:28:46 PM12/5/02

to

In article <kPMH9.249744$P31.97459@rwcrnsc53>,

Actually, I would say that they are Yes and Yes. But, as with all
such assignments, the answer is irrelevant and the only requirement
is to make a plausible case. Any other bids?

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Brannon Batson

unread,

Dec 5, 2002, 7:02:18 PM12/5/02

to

already...@yahoo.com (Michael S) wrote in message news:<f881b862.0212...@posting.google.com>...

> Of coarse, I mean other things like fab and skills of implementers
> being the same. Can we say that there are better ISAs ?

Absolutely. There are some real performance differences, but mostly
the consequence is design complexity (which implies schedule, risk for
bugs, etc.).

> Is it possible that one ISA is inherently better in per CPU marks
> while another one is better in per Watt marks ?

Sure. Though this issue is clouded somewhat in that some
microarchitectures do heroic things to make up for bad architectural
decisions.

Brannon
not speaking for Intel, obviously

glen herrmannsfeldt

unread,

Dec 5, 2002, 10:38:56 PM12/5/02

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:aso5su$drb$1...@pegasus.csx.cam.ac.uk...

> In article <kPMH9.249744$P31.97459@rwcrnsc53>,
> glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> >
> >"Michael S" <already...@yahoo.com> wrote in message
> >news:f881b862.0212...@posting.google.com...
> >> Of coarse, I mean other things like fab and skills of implementers
> >> being the same. Can we say that there are better ISAs ?
> >> Is it possible that one ISA is inherently better in per CPU marks
> >> while another one is better in per Watt marks ?
> >
> >Is this a homework assignment you want us to do for you?
> >

(snip)

> >
> >Oh, the answers are No, and No.
>
> Actually, I would say that they are Yes and Yes. But, as with all
> such assignments, the answer is irrelevant and the only requirement
> is to make a plausible case. Any other bids?
>

Oh well, the reason for the first No was that it was a generalization.
Some ISA are better at some things, and worse at others. The
x86 ISA is much better for machine translated 8080 code than most
of the others, because that was a design goal.

-- glen

Ketil Malde

unread,

Dec 6, 2002, 3:02:26 AM12/6/02

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:

> In article <kPMH9.249744$P31.97459@rwcrnsc53>,
> glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

>> This topic could go on with endless discussions, arguments
>> from both sides, etc.

[...]

>> Oh, the answers are No, and No.

> Actually, I would say that they are Yes and Yes.

See? Here we go!

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

Singh, S.R.

unread,

Dec 6, 2002, 4:05:56 AM12/6/02

to

Michael S wrote:

> Of coarse, I mean other things like fab and skills of implementers
> being the same. Can we say that there are better ISAs ?

In general, I would have to say 'no.' The natural contradiction to this
is,
of course, "what if the ISA is ridiculously complex and requires
significant
amounts of silicon dedicated to this and that?" Aside from this
exception
(which isn't much of an issue if you're designing a low-power system,
considering the fact that you've already disregarded the Usual
Suspects),
I don't believe the ISA makes for a decidedly high or low power system

At Hotchips 14 this past summer, Dr. Rajesh Gupta gave an (80 page!)
tutorial on "Low Power Wireless Networked System Design." At the time,
there was a topic on Comp.Arch regarding 'low-power ISAs." I asked him
whether or not there are inherent features to an ISA which make for a
low-
power system. In short, the answer was 'no.'

--
Singh, S.R. swaranrajsingh |at| hotmail |dot| com

If someone gives me a job, I can put,
"Opinions not of my employer." here

Bernd Paysan

unread,

Dec 6, 2002, 4:42:23 AM12/6/02

to

Singh, S.R. wrote:
> At Hotchips 14 this past summer, Dr. Rajesh Gupta gave an (80 page!)
> tutorial on "Low Power Wireless Networked System Design." At the time,
> there was a topic on Comp.Arch regarding 'low-power ISAs." I asked him
> whether or not there are inherent features to an ISA which make for a

> low-power system. In short, the answer was 'no.'

I tend to disagree. The way implementors are attacking ISAs for performance
is to add a level of indirection - here, it's "translation". Instruction
sets get translated before they are executed. When the translation overhead
is small compared to the rest of the chip, the ISA doesn't matter much.

However, when you think about "low-power", people mean very different
things. Some people say a 10W PowerPC is a "low power processor". Other
people want 10琺 processors. And here, adding levels of indirection is not
the way to solve the power consumption problem. Minimalistic ISAs consume
less power and can compute more in one step than interpreted
non-minimalistic ISAs (though the interpreter itself certainly doesn't
consume more power per cycle - but needs more cycles to compute the
result).

Example: The b16 CPU you can find on my homepage takes roughly the same
space (on an ASIC) as a microcoded 8051. However, if you want to do some
calculations (like a PID regulator, or other things, I call that "low-end
DSP tasks"), the 8051 easily needs several 100 times more cycles to do the
same thing (wide multiplications, divisions), and even 10 times more cycles
for controlling stuff (that's because single instructions take 12 to 24
cycles on an 8051).

You can make faster 8051s, which then consume more power and die space, and
still are not appropriate for low-end DSP tasks. You can extend the 8051 by
a DSP or 16 bit instructions, which makes the thing even larger, but then
it's appropriate for those low-end DSP tasks.

On the low end, the ISA is quite important. On the high-end, where clever
microarchitectures can level almost anything, the ISA is much less
important. If your low-power requirements are in the order of watts, don't
care. If the are in the order of milliwatts or even microwatts, the
influence of the ISA is quite large. And I don't mean religious issues like
68HC05 vs. 8051, because both are the same sort of crap.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Torben Ægidius Mogensen

unread,

Dec 6, 2002, 6:03:33 AM12/6/02

to

"Singh, S.R." <s...@below.null> writes:

> At Hotchips 14 this past summer, Dr. Rajesh Gupta gave an (80 page!)
> tutorial on "Low Power Wireless Networked System Design." At the time,
> there was a topic on Comp.Arch regarding 'low-power ISAs." I asked him
> whether or not there are inherent features to an ISA which make for a

> low-power system. In short, the answer was 'no.'

In addition to the points Bernd Paysan brought up, I would say that
compactness of code can matter a good deal. If you need twice as
large Icache and twice as large main memory for storing the code, this
hurts power consumption. Also, with smaller instructions, you can use
a narrower bus, which also saves power. Of course, half-size
instructions don't help you if you need twice as many. But you can go
from 32-bit instructions to 16-bit instructions without adding more
than 10-20% more instructions on typical code.

Torben Mogensen

Michael S

unread,

Dec 6, 2002, 7:24:24 AM12/6/02

to

"glen herrmannsfeldt" <g...@ugcs.caltech.edu> wrote in message news:<kPMH9.249744$P31.97459@rwcrnsc53>...

> "Michael S" <already...@yahoo.com> wrote in message
> news:f881b862.0212...@posting.google.com...
> > Of coarse, I mean other things like fab and skills of implementers
> > being the same. Can we say that there are better ISAs ?
> > Is it possible that one ISA is inherently better in per CPU marks
> > while another one is better in per Watt marks ?
>
> Is this a homework assignment you want us to do for you?
>

No. I had finished my last homework 15 years ago. BTW, it had nothing
to do with computers.

> This topic could go on with endless discussions, arguments
> from both sides, etc.
>

It's a primary purpose of this group, isn't it ? At least it's better
than endless chewing of valley's rumors "Yamhill myths" style.

> Just remember that MIPS, (not the company, though) stands for
> Meaningless Indicator of Processor Speed.
>

Then I used a neutral term - marks. Like it or hate it some processors
are faster than others.

> There is also the famous saying, "All generalizations are false,
> including this one."
>

Discussions are rarely interesting without generalizations. Besides we
can reject generalizations philosophically but in practice we need
them to navigate a complexity of the universe.

> Oh, the answers are No, and No.
>
> -- glen

Too simple, Glen. Too simple.

Dennis M O'Connor

unread,

Dec 6, 2002, 8:03:56 AM12/6/02

to

"Michael S" <already...@yahoo.com> wrote ...
> "glen herrmannsfeldt" <g...@ugcs.caltech.edu> wrote ...

> > This topic could go on with endless discussions, arguments
> > from both sides, etc.
> >
> It's a primary purpose of this group, isn't it ? At least it's better
> than endless chewing of valley's rumors "Yamhill myths" style.

Back when this group was mainly populated by people who
had at last _some_ clue about computer architecture, and some
who were even prominent in the field, you would have had a point.

However, the group as currently constituted is mainly people
who don't know much about computer architecture, and just
like to mouth off. This is why "Yamhill"-type discussions and
C-language discussions and business-management discussions
flourish, while good solid computer architecture content is rare.

And those people here that are actually professional computer
architecture can't really talk very much about it, partly because
too much of what they know is company-proprietary information,
and partly because they can't speak for their company. Which
is a shame for those not in the industry: just the stuff _I_ know
about that's happening at Intel is pretty interesting, and some
of the stuff that was explored but that isn't happening is even more
interesting, and few people outside Intel are aware of any of it.

Tsk tsk, I'm such a tease.
--
Dennis M. O'Connor dm...@primenet.com
Not speaking for Intel Corporation

glen herrmannsfeldt

unread,

Dec 6, 2002, 10:11:02 AM12/6/02

to

"Bernd Paysan" <bernd....@gmx.de> wrote in message
news:vdrpsa...@miriam.mikron.de...

I think for any ISA there are things it will do well and things it won't.
Once you say which things you want to do, you can say one is better, or
takes less power, or is faster, for a given fab technology and size.

There might be some that are bad at just about everything. In one of
D.E. Knuth's books there is a description of a two instruction ISA.
This is very inefficient for real problems, but is good for solving the
"can you make a two instruction ISA" question.

So the question I see is, is there any ISA that isn't good at anything?
-- glen

Bernd Paysan

unread,

Dec 6, 2002, 10:37:11 AM12/6/02

to

glen herrmannsfeldt wrote:
> There might be some that are bad at just about everything. In one of
> D.E. Knuth's books there is a description of a two instruction ISA.
> This is very inefficient for real problems, but is good for solving the
> "can you make a two instruction ISA" question.

There are various reduced-to-nothing ISAs with very different efficiency.
Take the one-instruction-ISA (move as the only instruction). If you don't
use memory mapped ALUs - and you should not, as they just turn addresses
into opcodes - you need tables like IBM's CADET, and self-modifying code.
This works relatively efficient (for the class of single-instruction
computers, sure). Other single-instruction CPUs often use "decrement and
branch conditionally" as their primitive, which means that for every
addition by n, you have to loop quite a lot (n times a loop of 255 times
for a 8 bit architecture).

> So the question I see is, is there any ISA that isn't good at anything?

Stop it. Someone will start to develop such an ISA! And someone else then
will write an Intercal compiler for it.

Stephen Fuld

unread,

Dec 6, 2002, 12:47:36 PM12/6/02

to

"Dennis M O'Connor" <dm...@primenet.com> wrote in message
news:10391797...@nnrp2.phx1.gblx.net...

snip

> However, the group as currently constituted is mainly people
> who don't know much about computer architecture, and just
> like to mouth off. This is why "Yamhill"-type discussions and
> C-language discussions and business-management discussions
> flourish, while good solid computer architecture content is rare.

Two things. First, as one of the people who started one of the recent
language discussions, I could argue that they are relevant to computer
architecture. For example, one of the (mis) features of C, its penchant for
not allowing compile time disambiguation of aliases, has led to at least one
architectural design feature in an existing chip (the load alias table in IA
64).

Second, has the actual amount of "solid computer architecture content"
declined over time, or only its percentage of the total as others have been
attracted to the group? If hte actual amount has declined, there could be
several reasons for that decline (some of which you point out in the next
part of your post that I snipped)..

--
- Stephen Fuld
e-mail address disguised to prevent spam

factory

unread,

Dec 6, 2002, 7:48:31 PM12/6/02

to

In article <10391797...@nnrp2.phx1.gblx.net>, dm...@primenet.com says...

> "Michael S" <already...@yahoo.com> wrote ...
> > "glen herrmannsfeldt" <g...@ugcs.caltech.edu> wrote ...
> > > This topic could go on with endless discussions, arguments
> > > from both sides, etc.
> > >
> > It's a primary purpose of this group, isn't it ? At least it's better
> > than endless chewing of valley's rumors "Yamhill myths" style.
>
> Back when this group was mainly populated by people who
> had at last _some_ clue about computer architecture, and some
> who were even prominent in the field, you would have had a point.
>
> However, the group as currently constituted is mainly people
> who don't know much about computer architecture, and just
> like to mouth off. This is why "Yamhill"-type discussions and
> C-language discussions and business-management discussions
> flourish, while good solid computer architecture content is rare.

Hmm looking back through the achives, comp.arch was started in nov 7, 1986. By the 6th of
december, the three largest threads were: "test", "How do you say "byte" in French?", and
"Brain damaged terminal contest."

--
- Factory (there is no X in my email)

del cecchi

unread,

Dec 6, 2002, 10:05:27 PM12/6/02

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
news:Yo5I9.40716$vM1.3...@bgtnsc04-news.ops.worldnet.att.net...

Probably neither, but Dennis has gotten pickier. :-)

del cecchi

Dennis M O'Connor

unread,

Dec 7, 2002, 12:07:54 AM12/7/02

to

"del cecchi" <dce...@msn.com> wrote ...
> "Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote ...
> > "Dennis M O'Connor" <dm...@primenet.com> wrote ...

> >
> > snip
> >
> > > However, the group as currently constituted is mainly people
> > > who don't know much about computer architecture, and just
> > > like to mouth off. This is why "Yamhill"-type discussions and
> > > C-language discussions and business-management discussions
> > > flourish, while good solid computer architecture content is rare.
> >
> > Two things. First, as one of the people who started one of the recent
> > language discussions, I could argue that they are relevant to computer

> > architecture. [...]

And I'd agree. But there have been many others that were not.
Yours, like you, is the exception. ;-)

> > Second, has the actual amount of "solid computer architecture content"
> > declined over time, or only its percentage of the total as others have
> > been attracted to the group?

Hard to say, and I have no hard numbers.
So intelligent people can disagree, I suppose.

> > If hte actual amount has declined, there could be
> > several reasons for that decline (some of which you point out in the
> > next part of your post that I snipped)..
>

> Probably neither, but Dennis has gotten pickier. :-)

Perhaps, and perhaps more frustrated: I _really _ want to refute
some of the marginally idiotic claims, but to do so I'd have to
divulge knowledge that is confidential. In the past that has
meant going out and finding published papers to ensure that
no Intel-proprietary info was revealed. But that's a lot of work,
and sometimes there is disagreement in the published
material, and then there's the 2005-and-later stuff I know
about that I just can't talk about at all.

I can't even talk about stuff that doesn't exist, cause
then people might guess what _did_ exist. It's the
classic "neither confirm nor deny" situation.

But, as consolation, I had a _really_ good week at
work this week: found problems, crafted elegant
solutions to them. Suh-weet ! Maybe in three
or four years I'll be able to talk about it. ;-)

--
Dennis M. O'Connor dm...@primenet.com

"We don't become a rabid dog to destroy a rabid dog,
but we do get a bit rude."

Rupert Pigott

unread,

Dec 7, 2002, 11:23:35 AM12/7/02

to

"Dennis M O'Connor" <dm...@primenet.com> wrote in message
news:10392375...@nnrp1.phx1.gblx.net...

> But, as consolation, I had a _really_ good week at
> work this week: found problems, crafted elegant
> solutions to them. Suh-weet ! Maybe in three
> or four years I'll be able to talk about it. ;-)

ARGH !

Not only do we have a SMUG Dennis on our hands but
he sounds HAPPY as well. Pleased for you dude, keep
up the good work. :)

Cheers,
Rupert

Michael S

unread,

Dec 7, 2002, 12:53:07 PM12/7/02

to

tor...@diku.dk (Torben gidius Mogensen) wrote in message news:<w5znrj2...@pc-032.diku.dk>...

Essentially, it is my answer too. ISA does matter, but only through
compactness of code. You say 16-bit instruction word is better than
32-bit ? I’ll go one step father – variable-length
instruction word is better than both of them.

IMO, best ISA for heavy OoO 50M+ transistors performance-oriented
processor has to have following characteristics:

1. Reduced instruction set – PPC+Altivec instructions look just
about right. May be, throw away some rarely used stuff like
bit-fields and NAND. Full orthogonality is not necessary. E.g.
division and rotate don’t have to be allowed on all GP
registers.
2. Expanded addressing modes – i386 addressing modes looks good
for me. Non-compromised immediate support is must.
3. Small general-purpose register file. My vote goes for 8 GP
registers + SP, i.e. one more register than x86.
4. Variable-length instruction word, 1 byte granularity.
5. Three main instruction formats – register-to-register
2-operands immediate-to-register 1-operands and memory-to-register
2-operands. This two format must be encoded with minimum number of
bytes.
Another possible candidates are register-to-register 3-operands,
immediate-to-register 2-operands and register–to-memory
2-operands format. Extensive simulations of existing code are required
to justify inclusion/rejection of this three formats. My gut feeling
says that register-to-register 3-operands format can be included as a
2nd class citizen, i.e. via instruction prefix.
Register–to-memory 2-operands format can be included as a 1st
class citizen for one and only addressing mode – SP+immediate.

There are other questions like:
How many sets of flags/conditional codes registers are needed ?
What instructions would be allowed to modify flags ?
What is better - distinct FP and SIMD register files of FP registers
as aliases over SIMD registers ?
What is a right size of FP/SIMD register file ? Sure the range must be
from 8 to 64 but how many exactly ?
All this questions are important, but they are *less* important than
the five I mentioned above.

Stephen Fuld

unread,

Dec 7, 2002, 12:59:15 PM12/7/02

to

"Dennis M O'Connor" <dm...@primenet.com> wrote in message
news:10392375...@nnrp1.phx1.gblx.net...

> "del cecchi" <dce...@msn.com> wrote ...
> > "Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote ...
> > > "Dennis M O'Connor" <dm...@primenet.com> wrote ...
> > >
> > > snip
> > >
> > > > However, the group as currently constituted is mainly people
> > > > who don't know much about computer architecture, and just
> > > > like to mouth off. This is why "Yamhill"-type discussions and
> > > > C-language discussions and business-management discussions
> > > > flourish, while good solid computer architecture content is rare.
> > >
> > > Two things. First, as one of the people who started one of the recent
> > > language discussions, I could argue that they are relevant to computer
> > > architecture. [...]
>
> And I'd agree. But there have been many others that were not.
> Yours, like you, is the exception. ;-)

Thank you for the compliment, but I don't think it is that exceptional. I
should also say that I do learn things from many of the "officially off
topic" threads, and since I enjoy learning new things, I generally don't
complain about them. Yes, they occasionally get out of hand, but IMO rarely
enough that it isn't a problem to ignore them. Besides, you never know when
such information might be usefull! :-)

Michael S

unread,

Dec 7, 2002, 6:55:10 PM12/7/02

to

Oh, something got mad with my previous post. So I'd repost. Sorry.

tor...@diku.dk (Torben gidius Mogensen) wrote in message news:<w5znrj2...@pc-032.diku.dk>...

Essentially, it is my answer too. ISA does matter, but only through

compactness of code. You say 16-bit instruction word is better than

32-bit ? I'll go one step father - variable-length instruction word is

better than both of them.

IMO, best ISA for heavy OoO 50M+ transistors performance-oriented
processor has to have following characteristics:

1. Reduced instruction set - PPC+Altivec instructions look just about

right. May be, throw away some rarely used stuff like bit-fields and
NAND. Full orthogonality is not necessary. E.g. division and rotate

don't have to be allowed on all GP registers.
2. Expanded addressing modes -i386 addressing modes looks good for me.

Non-compromised immediate support is must.
3. Small general-purpose register file. My vote goes for 8 GP
registers + SP, i.e. one more register than x86.
4. Variable-length instruction word, 1 byte granularity.

5. Three main instruction formats - register-to-register 2-operands,
immediate-to-register 1-operand and memory-to-register 2-operands.
This three format must be encoded with minimum number of bytes.

Another possible candidates are register-to-register 3-operands,

immediate-to-register 2-operands and register-to-memory 2-operands

format. Extensive simulations of existing code are required to justify
inclusion/rejection of this three formats. My gut feeling says that
register-to-register 3-operands format can be included as a 2nd class
citizen, i.e. via instruction prefix.

Register-to-memory 2-operands format can be included as a 1st class
citizen for one and only addressing mode - SP+immediate.

At350bogomips

unread,

Dec 7, 2002, 7:38:51 PM12/7/02

to

ISA effectiveness depends on software as well as the hardware
implementation. Some ISA features (e.g., early branch resolution
via condition registers) can have little benefit for some code
(code with few branches with conditions that can be tested early)
and significant benefit for others (code with unpredictable
branches whose conditions can be generated early).

Support for legacy code also influences ISA effectiveness. If
binary code is an issue, must one support the entire old ISA at
maximum efficiency or at merely moderate or even minimal
efficiency? Obviously, the less efficiency is needed for legacy
binaries, the more freedom one has in the ISA. Legacy source
code can also influence ISA effectiveness. Algorithms can be
chosen based on support for specific features (e.g., population
count, FPR-GPR/GPR-FPR moves). Even features that are only
visible to supervisor-level code can be made non-cost effective
by weak OS support.

The expected quality of software is also a factor. E.g., some
ISA features (e.g., static branch prediction) benefit from/depend
on feedback optimization. If such optimization is never done,
the instruction space could be used for other, more beneficial
features. Some ISA features require more compiler/asm-programmer
intelligence (e.g., the benefit of a larger register set can be
minimized by poor register allocation; the benefit of condition
registers, by only testing immediately before a branch; the
benefit of branch delay slots, by always inserting no-ops). If
code is expected to be poorly scheduled, an OoO implementation
might be more attractive (even in a power-constrained product)
which could further urge a smaller register set (and vice versa).

The expected frequency of full and partial context switches also
influences ISA choices, as does the expected availability of
thread-level parallelism. (How large should a context be? What
special support should be provided for interprocess
communication? Is memory synchronization necessary?)

Run-time code modification encourages support for a pipeline
flushing instruction.

The targeted performance/power-consumption/price window can
impact ISA efficiency. At lower performance, a scalar
implementation might be appropriate and delayed branches look
very attractive (assuming all implementations will be shallowish
pipeline, scalar designs). Software TLB management (which
requires some ISA support) can reduce cost and power consumption
at some performance loss for many applications. Production
volume also indicates how much fixed costs effect total costs, so
an unpopular but more design-efficient ISA will be more
expensive. A more narrowly targeted ISA (i.e., one with an
efficiency advantage in a narrower area of performance,
power-consumption, application type, and die size) will tend to
be less popular, so there is some pressure against features that
are not broadly beneficial.

Presumably an ISA should also not be overoptimized toward a
particular implementation since such excessively constrains the
creativity of the implementors. OTOH, it seems an ISA designer
must avoid stealing too much work/freedom from the software (one
of the problems with the "semantic gap" theory); OTOH, the
designer must avoid stealing too much work/freedom from the
implementers. (Thinking of the quote of Pope Pius XII (?) in
_The Mythical Man-Month_ to the effect that higher levels of
organization should not absorb functionality that can be
accomplished by lower levels.)

"Brannon Batson" <Brannon...@yahoo.com> wrote:
>Absolutely. There are some real performance differences, but
>mostly the consequence is design complexity (which implies
>schedule, risk for bugs, etc.).

. . . and low morale. Having to "do heroic things to make up for
bad architectural decisions" would seem to drain motivation.
Bright, creative professional architects might relish a
challenging problem, but being forced to tunnel through a wall
when a doorway stands open mere steps away . . . Furthermore,
such professionals are probably less susceptible to financial
rewards (though capital-flush organizations are more able to fund
exciting projects [not having the tools to do one's job or
watching inferior products win in the marketplace can also be
disheartening]).

Paul A. Clayton
A person who doesn't know much about computer architecture,
and just loves to mouth off! :-)

At350bogomips

unread,

Dec 7, 2002, 10:08:59 PM12/7/02

to

already...@yahoo.com (Michael S) wrote:
>IMO, best ISA for heavy OoO 50M+ transistors performance-oriented
>processor has to have following characteristics:
>
>1. Reduced instruction set - PPC+Altivec instructions look just about
>right. May be, throw away some rarely used stuff like bit-fields and
>NAND. Full orthogonality is not necessary. E.g. division and rotate
>don't have to be allowed on all GP registers.

Not too bad. It might be worthwhile to provide two
opcodes in the space of one by halving the number of
available destination registers (for rarely used
operations).

>2. Expanded addressing modes -i386 addressing modes looks good for me.
>Non-compromised immediate support is must.

Aside from decode/cracking bother, reg-mem ops are
contrary to load scheduling. Their use would put extra
burden on the OoO resources (which one would rather
use for unpredictable delays).

>3. Small general-purpose register file. My vote goes for 8 GP
>registers + SP, i.e. one more register than x86.

Madness! :-) Apparently 16 is minimal for some graph
coloring heuristics. One also wants to avoid accessing
the next (slower) level of the memory hierarchy (L1
Dcache) as often as possible while still being able to
meet cycle times for the register file (and avoid
excessively slow context switches).

Providing, say, 4 1KiB linear caches (one tag per cache)
and 64b-only accesses might provide something between
a small set of registers and the relatively large L1 Dcache.
But even with extremely simpistic addressing (e.g., 7b
offset from the base/tag; 'guaranteed' hit, single access
width) this would probably be an ugly wart.

>4. Variable-length instruction word, 1 byte granularity.

Mon Dieu! Where are the Defenders of RISC when one
needs them! :-)

Your ISA's front-end must be a nice challenge.
Hint: hardware wants regularity or at least predictability.

>5. Three main instruction formats - register-to-register 2-operands,
>immediate-to-register 1-operand and memory-to-register 2-operands.
>This three format must be encoded with minimum number of bytes.
>Another possible candidates are register-to-register 3-operands,
>immediate-to-register 2-operands and register-to-memory 2-operands
>format. Extensive simulations of existing code are required to justify
>inclusion/rejection of this three formats. My gut feeling says that
>register-to-register 3-operands format can be included as a 2nd class
>citizen, i.e. via instruction prefix.
>Register-to-memory 2-operands format can be included as a 1st class
>citizen for one and only addressing mode - SP+immediate.

Goodbye fmadd. Your FP performance boost will be
missed. :-)
----
For a performance-oriented ISA, more attention should
be paid to pipelining and easy hardware parallelism and
less to code density. Code density may well become
increasingly important, but it would seem more sensible
to design an ISA that is fast in its expanded form and
compresses and decompresses well than an ISA that
offers dense code but is difficult to decode and/or
takes up even more cache in its easy-to-decode
expanded form.

>There are other questions like:
>How many sets of flags/conditional codes registers are needed ?
>What instructions would be allowed to modify flags ?
>What is better - distinct FP and SIMD register files of FP registers
>as aliases over SIMD registers ?
>What is a right size of FP/SIMD register file ? Sure the range must be
>from 8 to 64 but how many exactly ?

42? :-)

>All this questions are important, but they are *less* important than
>the five I mentioned above.

These are more interesting questions and so outside
my meager competence.

Paul A. Clayton
just a former McD's grill worker and technophile

Michael S

unread,

Dec 8, 2002, 1:19:51 PM12/8/02

to

at350b...@aol.com (At350bogomips) wrote in message news:<20021207220859...@mb-ft.aol.com>...

>
> For a performance-oriented ISA, more attention should
> be paid to pipelining and easy hardware parallelism and
> less to code density. Code density may well become
> increasingly important, but it would seem more sensible
> to design an ISA that is fast in its expanded form and
> compresses and decompresses well than an ISA that
> offers dense code but is difficult to decode and/or
> takes up even more cache in its easy-to-decode
> expanded form.
>

> Paul A. Clayton
> just a former McD's grill worker and technophile

"takes up even more cache in its easy-to-decode expanded form".
Brilliant ! Here you hit one of this new points I wanted to talk
about. The size of the L1 I-cache/I-trace cache is not limited by
transistor budget any more ! It is limited only by access speed. You
can't afford more than 1K or 2K cache lines, but the lines can be as
wide as you like. With 8 or 10 decoded instructions per line it
doesn't really matter if expanded instruction is 60-bits wide or
80-bits wide. The difference is only 2K*10*20=400Kbit, i.e. 5% of the
size of the typical 8Mbit L2 cache.

A hardware decompression is an interesting alternative too, but I
don't like it. The main disadvantage of this approach is a necessity
to explicitly specify a decompressed form of the instructions -
something you would prefer to avoid if you want you ISA to span more
than one-two generations of uArch.

Niels Jørgen Kruse

unread,

Dec 9, 2002, 6:32:08 PM12/9/02

to

I artiklen <f881b862.02120...@posting.google.com> ,
already...@yahoo.com (Michael S) skrev:

> "takes up even more cache in its easy-to-decode expanded form".
> Brilliant ! Here you hit one of this new points I wanted to talk
> about. The size of the L1 I-cache/I-trace cache is not limited by
> transistor budget any more ! It is limited only by access speed. You
> can't afford more than 1K or 2K cache lines, but the lines can be as
> wide as you like.

Would you care to explain this?

> With 8 or 10 decoded instructions per line it
> doesn't really matter if expanded instruction is 60-bits wide or
> 80-bits wide. The difference is only 2K*10*20=400Kbit, i.e. 5% of the
> size of the typical 8Mbit L2 cache.

Looking at the PPC 970 die, the L1 data cache array (32KB) amounts to just
under 3/4 of a 256KB half of the L2.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Dennis M O'Connor

unread,

Dec 10, 2002, 1:17:08 AM12/10/02

to

"Niels Jørgen Kruse" <nj_k...@get2net.dk> wrote ...

> I artiklen <f881b862.02120...@posting.google.com> ,
> already...@yahoo.com (Michael S) skrev:
>
> > "takes up even more cache in its easy-to-decode expanded form".
> > Brilliant ! Here you hit one of this new points I wanted to talk
> > about. The size of the L1 I-cache/I-trace cache is not limited by
> > transistor budget any more ! It is limited only by access speed. You
> > can't afford more than 1K or 2K cache lines, but the lines can be as
> > wide as you like.
>
> Would you care to explain this?

Yeah, I'm sure the circuit guys I work with would love to know
about the new laws of physics "Michael S" must have discovered
that let him ignore word-line RC delays, cross-bit-pitch routing
issues, and/or signal fan-out delays.

Michael S

unread,

Dec 10, 2002, 6:57:06 AM12/10/02

to

Don't take things literally, Dennis.
All I said that today there are no reasons to squeeze every last bit
out of decoded form of the instruction. I don't think you would argue
with that. I don't know how wide P-4 decoded instructions are. Aren't
they about 85 bits ?

So let's reiterate. I say: "When one implementation has decoded
instruction width equal to 60-bit and the other has 80-bits, it's
likely that for the same silicon technology and target cache access
time both implementations would have the same number of lines in the
L1 I-cache/trace cache. It is possible that 60-bit group would be able
to put more instructions in every cache line, but it's not granted."
May be I have to say it even simpler - first group wouldn't gain
measurable advantage in the L1-I hit rate due to their narrow words.
Do you agree ?

"Dennis M O'Connor" <dm...@primenet.com> wrote in message news:<10395008...@nnrp2.phx1.gblx.net>...

Dennis M O'Connor

unread,

Dec 10, 2002, 10:33:43 PM12/10/02

to

"Michael S" <already...@yahoo.com> wrote ...

> Don't take things literally, Dennis.

Try meaning what you write, then. Their are no mind-readers
on USENET: if what you wrote wasn't what you meant,
only you to are to blame.

> All I said that today there are no reasons to squeeze every last bit
> out of decoded form of the instruction. I don't think you would argue
> with that.

It's hard to argue with a sentence that doesn't seem to mean anything.

> So let's reiterate. I say: "When one implementation has decoded
> instruction width equal to 60-bit and the other has 80-bits, it's
> likely that for the same silicon technology and target cache access
> time both implementations would have the same number of lines in the

> L1 I-cache/trace cache. [...] Do you agree ?

Nope. AFAICT, you don't understand the complexity and the
interrelations of the issues you are presuming to make claims about.

Russell Wallace

unread,

Dec 13, 2002, 2:44:00 AM12/13/02

to

On 7 Dec 2002 15:55:10 -0800, already...@yahoo.com (Michael S)
wrote:

>1. Reduced instruction set - PPC+Altivec instructions look just about
>right. May be, throw away some rarely used stuff like bit-fields and
>NAND.

I don't recall having seen a CPU with a NAND instruction. What do you
mean by bit-fields? If you mean &, |, ~, <<, >> then those are
sometimes useful and surely quite cheap to provide?

>Full orthogonality is not necessary. E.g. division and rotate
>don't have to be allowed on all GP registers.

Division and rotate aren't needed at all. Is anything gained by
putting them in but only on certain registers, other than making the
chip and software tools more likely to contain errors?

>2. Expanded addressing modes -i386 addressing modes looks good for me.

To what purpose? The x86 has to use a load-store architecture
internally anyway.

>Non-compromised immediate support is must.

Why? It's not needed for performance.

>3. Small general-purpose register file. My vote goes for 8 GP
>registers + SP, i.e. one more register than x86.

8 isn't enough, you just end up having to put frequently used values
in memory. You want 16 general purpose integer registers. (Maybe more
if they're cheap, but definitely at least 16.)

>4. Variable-length instruction word, 1 byte granularity.

Why? All that decoding logic costs development money, wastes power and
worsens the penalty for a branch mispredict, to little benefit.

>5. Three main instruction formats

Why? You just end up having to decode them into a nice regular format
internally anyway.

>There are other questions like:
>How many sets of flags/conditional codes registers are needed ?

Is there any good reason to have flag registers at all? Don't some
RISC chips benefit from not having them?

>What is better - distinct FP and SIMD register files of FP registers
>as aliases over SIMD registers ?

If you mean MMX style integer-only SIMD, that's pretty useless, get
rid of it.

If you mean floating point SIMD, that's useful, but should presumably
be integrated with non-SIMD FP.

>What is a right size of FP/SIMD register file ? Sure the range must be
>from 8 to 64 but how many exactly ?

I'll let someone who knows more about numerical computing than I do
answer that one.

--
"Mercy to the guilty is treachery to the innocent."
Remove killer rodent from address to reply.
http://www.esatclear.ie/~rwallace

Terje Mathisen

unread,

Dec 13, 2002, 8:47:46 AM12/13/02

to

Russell Wallace wrote:
> On 7 Dec 2002 15:55:10 -0800, already...@yahoo.com (Michael S)
> wrote:
>
>
>>1. Reduced instruction set - PPC+Altivec instructions look just about
>>right. May be, throw away some rarely used stuff like bit-fields and
>>NAND.
>
>
> I don't recall having seen a CPU with a NAND instruction. What do you
> mean by bit-fields? If you mean &, |, ~, <<, >> then those are
> sometimes useful and surely quite cheap to provide?

Afaik, most of the current short-vector extensions have a form of
inverted mask operation, simply because it is cheap to do, and saves a
cycle in a critical location for many masked merge operations in graphics.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Michael S

unread,

Dec 13, 2002, 9:03:32 AM12/13/02

to

It is obvious that you don't agree with my basic idea: the most
important thing about ISA is a compactness of code. Implementation can
compensate for almost everything else *.
I made an attempt to describe traditional register-based ISA with a
potential for a very compact code. I believe that ISA like this would
do slightly better than current leaders in 32-bit compactness - i386,
ARM+Thumb, Hitachi-SH and Motorola ColdFire.
-----
* Of coarse, stack based machines has even more compact code. However,
the second important thing about ISA for high-performance computer is
an ability to expose as much parallelism as possible to the hardware.
Stack-based machines aren't very good at that, I believe.

r...@vorpalbunnyeircom.net (Russell Wallace) wrote in message news:<3df98e26....@news.eircom.net>...

> On 7 Dec 2002 15:55:10 -0800, already...@yahoo.com (Michael S)
> wrote:
>
> >1. Reduced instruction set - PPC+Altivec instructions look just about
> >right. May be, throw away some rarely used stuff like bit-fields and
> >NAND.
>
> I don't recall having seen a CPU with a NAND instruction. What do you
> mean by bit-fields? If you mean &, |, ~, <<, >> then those are
> sometimes useful and surely quite cheap to provide?
>

Go read PPC ISA:
http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600719DF2/$file/6xx_pem.pdf

> >Full orthogonality is not necessary. E.g. division and rotate
> >don't have to be allowed on all GP registers.
>
> Division and rotate aren't needed at all. Is anything gained by
> putting them in but only on certain registers, other than making the
> chip and software tools more likely to contain errors?
>
> >2. Expanded addressing modes -i386 addressing modes looks good for me.
>
> To what purpose? The x86 has to use a load-store architecture
> internally anyway.
>
> >Non-compromised immediate support is must.
>
> Why? It's not needed for performance.
>

It is needed for compactness

> >3. Small general-purpose register file. My vote goes for 8 GP
> >registers + SP, i.e. one more register than x86.
>
> 8 isn't enough, you just end up having to put frequently used values
> in memory. You want 16 general purpose integer registers. (Maybe more
> if they're cheap, but definitely at least 16.)
>

I never seen the proof that 8 is not enough. In my experience 6 is
often not enough. 7 is rarely not enough. 15 is always enough. I
personally never programmed for the architecture with 8 GP registers.
My feeling - it's enough more than 90% of the time.

> >4. Variable-length instruction word, 1 byte granularity.
>
> Why? All that decoding logic costs development money, wastes power and
> worsens the penalty for a branch mispredict, to little benefit.
>

I don't think you can invest development money and power better when
performance is an ultimate goal.
You statement about branch mispredicts is only partially correct. You
pay more if the branch target is not in the L1-Icache/Trace cache yet.
Otherwise you pay the same penalty.

> >5. Three main instruction formats
>
> Why? You just end up having to decode them into a nice regular format
> internally anyway.
>

It is needed for compactness

> >There are other questions like:
> >How many sets of flags/conditional codes registers are needed ?
>
> Is there any good reason to have flag registers at all? Don't some
> RISC chips benefit from not having them?

I don't know. Then I mentioned conditional codes registers as an
alternative.

>
> >What is better - distinct FP and SIMD register files of FP registers
> >as aliases over SIMD registers ?
>
> If you mean MMX style integer-only SIMD, that's pretty useless, get
> rid of it.
>

You are wrong. MMX was extremely usefull in at least one my projects
(live video processing). Something better-disigned like Altivec is
even more usefull. Today almost only justification for faster personal
computer is its improved DSP ability. P-III 500MHz is good enough for
everything else. 16-bit integer SIMD is significantly faster then
32-bit FP SIMD and 16 bits are often sufficient. Even 8-bit SIMD is
usefull for many video applications.

> If you mean floating point SIMD, that's useful, but should presumably
> be integrated with non-SIMD FP.

The opinion of somebody who don't know much about numeric computing
can't hold much weight.
I see arguments for both distinct and splitted FP/SIMD register files.

Chris Morgan

unread,

Dec 13, 2002, 10:38:33 AM12/13/02

to

already...@yahoo.com (Michael S) writes:

> It is obvious that you don't agree with my basic idea: the most
> important thing about ISA is a compactness of code. Implementation can
> compensate for almost everything else *.

Hmmm, isn't the VAX existence proof that this statement is too strong?
--
Chris Morgan

"Not so bad offer to discuss about"

- Best recent email spam subject line

Dennis M O'Connor

unread,

Dec 13, 2002, 10:56:10 AM12/13/02

to

"Michael S" <already...@yahoo.com> wrote...

> It is obvious that you don't agree with my basic idea: the most
> important thing about ISA is a compactness of code.

And yet, their are no commercially-successful (indeed, maybe
none at all) ISAs that have Huffman-encoded, fully-variable-length
instructions, which would be denser than any existing ISA.
Indeed, the trends seems to be away from instructions like
the extremely dense "Evaluate Polynomial" instruction found
on the milestone VAX architecture.

> Implementation can compensate for almost everything else *.

Ever done an implementation ? A real one ?

Russell Wallace

unread,

Dec 13, 2002, 12:06:46 PM12/13/02

to

On 13 Dec 2002 06:03:32 -0800, already...@yahoo.com (Michael S)
wrote:

>It is obvious that you don't agree with my basic idea: the most

>important thing about ISA is a compactness of code. Implementation can
>compensate for almost everything else *.

You're right, I don't :) Why do you believe compactness of code is
important? RAM is cheap, and inner loops normally fit in I-cache. (And
you want the I-cache to store the decoded version anyway unless you
want to have to repeat the decode every time around the inner loop.)

Besides, if you really want compactness of code...

>* Of coarse, stack based machines has even more compact code. However,
>the second important thing about ISA for high-performance computer is
>an ability to expose as much parallelism as possible to the hardware.
>Stack-based machines aren't very good at that, I believe.

Look at Chuck Moore's chip - it's called x25 or something like that -
25 simple stack-based cores in parallel on a tiny chip, and I think
each core is 4-way superscalar. Not much good for numeric computing of
course, but it does an amazing job of fitting your two primary
criteria :)

>Go read PPC ISA:

Oh okay, I agree bit fields in that sense could probably be thrown
away.

>> >Non-compromised immediate support is must.
>>
>> Why? It's not needed for performance.
>>
>It is needed for compactness

It's not even needed for compactness. Large immediates are rare
statically as well as dynamically, so not supporting them won't have a
significant impact on compactness.

>I never seen the proof that 8 is not enough. In my experience 6 is
>often not enough. 7 is rarely not enough. 15 is always enough. I
>personally never programmed for the architecture with 8 GP registers.
>My feeling - it's enough more than 90% of the time.

I've written assembler for architectures with both 8 and 16 integer
registers (386 and 68000). Empirically, I find 16 is near enough to
always sufficient as makes no difference, but 8 definitely isn't. Take
a look at some output from Microsoft's C++ compiler on integer code;
you'll see most of the time it doesn't quite manage to squeeze all the
inner loop variables into registers.

>You statement about branch mispredicts is only partially correct. You
>pay more if the branch target is not in the L1-Icache/Trace cache yet.
>Otherwise you pay the same penalty.

So if the branch target is in I-cache, the pipeline flush doesn't
include the decoder stage? Okay, then I stand corrected on that.
However, the extra decoder logic still costs design complexity and
heat dissipation.

>You are wrong. MMX was extremely usefull in at least one my projects
>(live video processing).

Sure, I can see that a 200 MHz Pentium with MMX would be a lot more
useful for video processing with MMX than without it. But wouldn't a
Pentium 4 be perfectly adequate even without MMX?

>Something better-disigned like Altivec is
>even more usefull. Today almost only justification for faster personal
>computer is its improved DSP ability.

Seems to me the big drivers for faster machines these days are graphic
applications and games (which are mostly simulation, AI and graphics).
Then again I could be wrong - is there something important that
stresses a modern machine hard and for which 16 bit integer SIMD is
useful? (Even some video chips are starting to use floating point
triplets for pixels.)

>> If you mean floating point SIMD, that's useful, but should presumably
>> be integrated with non-SIMD FP.
>The opinion of somebody who don't know much about numeric computing
>can't hold much weight.

Which is why I qualified my statement with "presumably" and ended
with:

>> >What is a right size of FP/SIMD register file ? Sure the range must be
>> >from 8 to 64 but how many exactly ?
>>
>> I'll let someone who knows more about numerical computing than I do
>> answer that one.

^.^

Robert Klute

unread,

Dec 13, 2002, 12:44:00 PM12/13/02

to

On Fri, 13 Dec 2002 08:56:10 -0700, "Dennis M O'Connor"
<dm...@primenet.com> wrote:

>"Michael S" <already...@yahoo.com> wrote...
>> It is obvious that you don't agree with my basic idea: the most
>> important thing about ISA is a compactness of code.
>
>And yet, their are no commercially-successful (indeed, maybe
>none at all) ISAs that have Huffman-encoded, fully-variable-length
>instructions, which would be denser than any existing ISA.
>Indeed, the trends seems to be away from instructions like
>the extremely dense "Evaluate Polynomial" instruction found
>on the milestone VAX architecture.

The VAX was the last hurrah for the CISC architecture. Not that CISC has
disappeared, but that it was was the biggest and most complex of its
breed. (Although one could argue that the NS3200 should have that
dubious honor). CISC evolved during the period that humans writing and
tuning code in assembly was common. During the 70s the transition to
3rd generation languages and compilers for almost all code written
spelled the end of the CISC evolution and the beginning of RISC.

Compiler writers didn't bother looking for the special cases where
instructions like 'evaluate polynomial' would work. RISC was able to
take advantage of this an reduce the instruction set to only those
instructions the compiler writers of the time were using. The flaw in
this is that some infrequently used instructions can not be emulated
efficiently with alternate sequences - such as semaphore and control
instructions.

Paul DeMone

unread,

Dec 13, 2002, 6:40:11 PM12/13/02

to

Chris Morgan wrote:
>
> already...@yahoo.com (Michael S) writes:
>
> > It is obvious that you don't agree with my basic idea: the most
> > important thing about ISA is a compactness of code. Implementation can
> > compensate for almost everything else *.
>
> Hmmm, isn't the VAX existence proof that this statement is too strong?

If that doesn't work then trot out the NS32000 family. If not then
go nuclear as a last resort with the example of the iAPX432.

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
pde...@igs.net architectures with MIPSed results but ALPHA's well
that ends well.

del cecchi

unread,

Dec 13, 2002, 8:03:40 PM12/13/02

to

"Paul DeMone" <pde...@igs.net> wrote in message
news:3DFA6FDB...@igs.net...

In fact any risc processor caused a code bloat, if I recall the RISC
wars of days gone by. Wasn't the expansion from VAX to Alpha at least
2X?

del cecchi (msg not worth snipping)

Paul DeMone

unread,

Dec 14, 2002, 1:07:13 AM12/14/02

to

del cecchi wrote:
[...]

> In fact any risc processor caused a code bloat, if I recall the RISC
> wars of days gone by. Wasn't the expansion from VAX to Alpha at least
> 2X?

Something like that. Keep in mind that people tend to just look
at the size of executable files and call the ratio code expansion.
But there are more things in an executable file than just program
machine code. For example Alpha executables typically used 64 bit
addresses in their linkage tables while VAX was strictly 32 bit.

Greg Lindahl

unread,

Dec 14, 2002, 5:20:09 AM12/14/02

to

In article <3DFACA91...@igs.net>, Paul DeMone <pde...@igs.net> wrote:

>Something like that. Keep in mind that people tend to just look
>at the size of executable files and call the ratio code expansion.
>But there are more things in an executable file than just program
>machine code. For example Alpha executables typically used 64 bit
>addresses in their linkage tables while VAX was strictly 32 bit.

Newer compilers also are more aggressive at loop unrolling and
software pipelining, both of which produce significantly larger code
and higher performance. It's hard to make an apples to apples
comparison of code size.

-- greg

Anton Ertl

unread,

Dec 14, 2002, 5:51:18 AM12/14/02

to

Paul DeMone <pde...@igs.net> writes:
>
>
>del cecchi wrote:
>[...]
>> In fact any risc processor caused a code bloat, if I recall the RISC
>> wars of days gone by. Wasn't the expansion from VAX to Alpha at least
>> 2X?

You apparently remember some claims made in that war, but not the
empirical data that some people provided. Here's some data from some
postings I have made (ok, no VAX data, but the 386 should represent
CISCs nicely):

The only factor >=2 I see here is between MIPS Ultrix and Alpha Linux
on main-fast.o, and that's probably coming from the gcc version or the
library (watch main-fast.o on 386 Linux between the libc versions).

>Something like that. Keep in mind that people tend to just look
>at the size of executable files and call the ratio code expansion.
>But there are more things in an executable file than just program
>machine code. For example Alpha executables typically used 64 bit
>addresses in their linkage tables while VAX was strictly 32 bit.

I have also some figures that include data sizes:

|While I'm at it, let's look how this benchmark turns out currently
|(Emacs-20.3 under Linux):
|
| code insts c/i total size
|386, RedHat-5.2: 890798 318230 2.80 2985749
|PowerPC, YellowDog-1.0: 1074320 267477 4.02 3170313
|Alpha, RedHat-5.2: 1353008 337701 4.01 4890241

Neither a factor >=2 in code nor in data or total size (but the factor
between 32-bit and 64-bit data sizes is about 1.7 in this case).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Nick Maclaren

unread,

Dec 14, 2002, 6:40:06 AM12/14/02

to

In article <2002Dec1...@a0.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Paul DeMone <pde...@igs.net> writes:
>>del cecchi wrote:
>>[...]
>>> In fact any risc processor caused a code bloat, if I recall the RISC
>>> wars of days gone by. Wasn't the expansion from VAX to Alpha at least
>>> 2X?
>
>You apparently remember some claims made in that war, but not the
>empirical data that some people provided. Here's some data from some
>postings I have made (ok, no VAX data, but the 386 should represent
>CISCs nicely):

NO, NO, NO!

Back in the early 1980s, the Intel 80x86 architecture was notorious
for causing code bloat. It was typically 50% larger than 68K or
VAX. I didn't keep the figures I measured, or the ones that other
people posted, but my recollection matches Del's.

Back then, RISC was worse still. But, as I have just posted, RISC
architectures have been improved a lot since then. And, as someone
else posted, the trend in compilers has been to spend code size to
get performance - for both x86 and RISC.

I am pretty sure that RISC does cause code bloat, but the rejoinder
is "so what?" To which my reply is "so let's not be revisionist."
It is not a CURRENTLY important issue, but WAS one when RISC started
to take over from CISC, so is a matter of HISTORICAL importance.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Terje Mathisen

unread,

Dec 14, 2002, 8:19:14 AM12/14/02

to

Nick Maclaren wrote:
> Back in the early 1980s, the Intel 80x86 architecture was notorious
> for causing code bloat. It was typically 50% larger than 68K or
> VAX. I didn't keep the figures I measured, or the ones that other
> people posted, but my recollection matches Del's.

Nick, that's simply wrong.

Well-written 16-bit x86 code is _very_ small, not the least due to the
original 8088, where total execution time was almost directly
proportional to the number of code (not data!) bytes touched.

That there might have been some compilers that generated pretty horrible
code, saving and reloading every variable between every statement line,
doesn't make the architecture itself nearly as badly broken as you seem
to remember.

Anton Ertl

unread,

Dec 14, 2002, 8:02:09 AM12/14/02

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
>In article <2002Dec1...@a0.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>Paul DeMone <pde...@igs.net> writes:
>>>del cecchi wrote:
>>>[...]
>>>> In fact any risc processor caused a code bloat, if I recall the RISC
>>>> wars of days gone by. Wasn't the expansion from VAX to Alpha at least
>>>> 2X?
>>
>>You apparently remember some claims made in that war, but not the
>>empirical data that some people provided. Here's some data from some
>>postings I have made (ok, no VAX data, but the 386 should represent
>>CISCs nicely):
>
>NO, NO, NO!
>
>Back in the early 1980s, the Intel 80x86 architecture was notorious
>for causing code bloat. It was typically 50% larger than 68K or
>VAX.

There is no Intel 80x86 architecture. Since you mentioned the early
80s, you might mean the 8086 or the 80286 architecture (with various
memory models for those). The 386 architecture (introduced 1985) is
quite different from these architectures, so any "early 80s 80x86"
data does not transfer to the 386.

>I didn't keep the figures I measured, or the ones that other
>people posted,

If such postings exist, groups.google.com should have kept them.

>Back then, RISC was worse still. But, as I have just posted, RISC
>architectures have been improved a lot since then.

In the early 1980s RISCs existed only as research projects. Anyway,
if I look at the numbers I presented, and compare the earlier RICS
architectures (MIPS, SPARC, HPPA) to the later ones (PowerPC, Alpha),
I don't see a trend toward smaller code sizes.

Nick Maclaren

unread,

Dec 14, 2002, 10:02:52 AM12/14/02

to

In article <atfc2p$ja3$1...@vkhdsu24.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>Nick Maclaren wrote:
>> Back in the early 1980s, the Intel 80x86 architecture was notorious
>> for causing code bloat. It was typically 50% larger than 68K or
>> VAX. I didn't keep the figures I measured, or the ones that other
>> people posted, but my recollection matches Del's.
>
>Nick, that's simply wrong.

No, we are just looking at different aspects.

>Well-written 16-bit x86 code is _very_ small, not the least due to the
>original 8088, where total execution time was almost directly
>proportional to the number of code (not data!) bytes touched.
>
>That there might have been some compilers that generated pretty horrible
>code, saving and reloading every variable between every statement line,
>doesn't make the architecture itself nearly as badly broken as you seem
>to remember.

I should have said that I was referring to the sort of code that was
comparable between that architecture, 68K, SPARC and so on. Not
necessarily true 32-bit but with (say) a data segment of 200K. The
80386 was tolerable, but had the expansion ratio I mentioned relative
to (say) the 68020.

Earlier than that, comparing the 8086 versus the 68000, things were
much worse. And, even on fairly early machines, 256K of memory was
not rare.

Nick Maclaren

unread,

Dec 14, 2002, 10:15:06 AM12/14/02

to

In article <2002Dec1...@a0.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>
>There is no Intel 80x86 architecture. Since you mentioned the early
>80s, you might mean the 8086 or the 80286 architecture (with various
>memory models for those). The 386 architecture (introduced 1985) is
>quite different from these architectures, so any "early 80s 80x86"
>data does not transfer to the 386.

Hmm. That is a matter of terminology, but see my reasponse to Terje
Mathisen for clarification of what I meant.

>>I didn't keep the figures I measured, or the ones that other
>>people posted,
>
>If such postings exist, groups.google.com should have kept them.

Anyone is welcome to look. I am not going to do so.

>>Back then, RISC was worse still. But, as I have just posted, RISC
>>architectures have been improved a lot since then.
>
>In the early 1980s RISCs existed only as research projects. Anyway,
>if I look at the numbers I presented, and compare the earlier RICS
>architectures (MIPS, SPARC, HPPA) to the later ones (PowerPC, Alpha),
>I don't see a trend toward smaller code sizes.

You are right - I should have said mid-1980s. The reason you don't
see a trend is largely because there has been a trend for using more
code to get faster execution, which hides the compactness improvements.
So it is very hard to use current figures to say what would have
happened then.

For example, take the Alpha. In the very first release, any code
that used a lot of objects of less than 8 bytes had a LOT of its
code space taken up with sub-word selection. There are similar
cases with other operations, but that is the most obvious.

Dennis M O'Connor

unread,

Dec 14, 2002, 10:53:35 AM12/14/02

to

"Robert Klute" <robert_klut...@hp.com> wrote in ...

> The VAX was the last hurrah for the CISC architecture.

Indeed, if you read many of the early papers that were
the basis of the "RISC revolution", they are often about
what was wrong with the VAX.

>CISC evolved during the period that humans writing and
> tuning code in assembly was common.

More importantly, I think, it evolved back when the
implementation constraints were significantly different.

Paul DeMone

unread,

Dec 14, 2002, 11:19:23 AM12/14/02

to

Nick Maclaren wrote:
>
> In article <2002Dec1...@a0.complang.tuwien.ac.at>,
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> >
> >There is no Intel 80x86 architecture. Since you mentioned the early
> >80s, you might mean the 8086 or the 80286 architecture (with various
> >memory models for those). The 386 architecture (introduced 1985) is
> >quite different from these architectures, so any "early 80s 80x86"
> >data does not transfer to the 386.

Also remember that with the P6 onwards compiler writers have
been encouraged to code "RISC-style" favouring register to
register operations separate from loads and stores. This
would tend to increase code size.

[...]

> For example, take the Alpha. In the very first release, any code
> that used a lot of objects of less than 8 bytes had a LOT of its
> code space taken up with sub-word selection. There are similar
> cases with other operations, but that is the most obvious.

The first Alphas (EV4, EV45, EV5) lacked 8 and 16 bit load
and store instructions but did have 32 bit loads and stores.
Support for 8 and 16 bit loads and stores is in EV56, EV6x,
and EV7.

Programs that did a lot of string manipulation did tend to
expand quite a bit on Alpha compared to x86 and even other
RISCs in EV4x and EV5 days.

Michael S

unread,

Dec 14, 2002, 11:42:09 AM12/14/02

to

r...@vorpalbunnyeircom.net (Russell Wallace) wrote in message news:<3dfa1032....@news.eircom.net>...

> On 13 Dec 2002 06:03:32 -0800, already...@yahoo.com (Michael S)
> wrote:
>
> >It is obvious that you don't agree with my basic idea: the most
> >important thing about ISA is a compactness of code. Implementation can
> >compensate for almost everything else *.
>
> You're right, I don't :) Why do you believe compactness of code is
> important? RAM is cheap, and inner loops normally fit in I-cache. (And
> you want the I-cache to store the decoded version anyway unless you
> want to have to repeat the decode every time around the inner loop.)
>

Offchip access is already very costly and is gooing to be even more
costly in the future. Then it is very important to improve L2 hit
rate.

> Besides, if you really want compactness of code...
>
> >* Of coarse, stack based machines has even more compact code. However,
> >the second important thing about ISA for high-performance computer is
> >an ability to expose as much parallelism as possible to the hardware.
> >Stack-based machines aren't very good at that, I believe.
>
> Look at Chuck Moore's chip - it's called x25 or something like that -
> 25 simple stack-based cores in parallel on a tiny chip, and I think
> each core is 4-way superscalar. Not much good for numeric computing of
> course, but it does an amazing job of fitting your two primary
> criteria :)

I assume it is a Forth chip. Forth excells at 16 bits. 32-bit Forth is
probably less appealing from the code compactness point of veiw.
Where can I look at this chip ?
...............

> >I never seen the proof that 8 is not enough. In my experience 6 is
> >often not enough. 7 is rarely not enough. 15 is always enough. I
> >personally never programmed for the architecture with 8 GP registers.
> >My feeling - it's enough more than 90% of the time.
>
> I've written assembler for architectures with both 8 and 16 integer
> registers (386 and 68000). Empirically, I find 16 is near enough to
> always sufficient as makes no difference, but 8 definitely isn't. Take
> a look at some output from Microsoft's C++ compiler on integer code;
> you'll see most of the time it doesn't quite manage to squeeze all the
> inner loop variables into registers.
>

i386 has 7 GP registers (ESP is not GPR). I suggest 8 GPRs. Older MS
C++ compilers were unable to take advantage of EBP, effectively
reducing number of GPRs to 6.

> >You statement about branch mispredicts is only partially correct. You
> >pay more if the branch target is not in the L1-Icache/Trace cache yet.
> >Otherwise you pay the same penalty.
>
> So if the branch target is in I-cache, the pipeline flush doesn't
> include the decoder stage? Okay, then I stand corrected on that.
> However, the extra decoder logic still costs design complexity and
> heat dissipation.

Of coarse, pipeline flush includes decode stage. However, it doesn't
include *extra* decoding stages for variable-length instructions.
IIRC, the expanded I-cache technique was pioneered by NexGen six or
seven years ago. Intel's NetBurst uArch uses this technique in
conjuction with trace cache. I don't know who invented trace caches
and when. IMO it's great idea which puts last nail into the coffin of
the fixed length instruction words.
You're right about complexity, but it can be done. Variable-length
decoder isn't the most complex thing in the modern CPU. Out-of-order
execution engine is far more complex.
I don't know about heat dissipation. My guess, multiple excution
unites consume order of magnitude more power than this additional
decoder stage. I would be glad I somebody with numbers in hand
educates me on the issue.

>
> >You are wrong. MMX was extremely usefull in at least one my projects
> >(live video processing).
>
> Sure, I can see that a 200 MHz Pentium with MMX would be a lot more
> useful for video processing with MMX than without it. But wouldn't a
> Pentium 4 be perfectly adequate even without MMX?

No. We were able to process 256x256 areas at 20 fps with P-II 266. It
is usefull, but really we wanted 640x480 at 30 fps. Even P-IV with MMX
barely can do it. Besides, algorithm guys are always ready to add one
more filter :) If we were doing it today with P-IV in mind we would
use fix-point capabilities of SSE2.

>
> >Something better-disigned like Altivec is
> >even more usefull. Today almost only justification for faster personal
> >computer is its improved DSP ability.
>
> Seems to me the big drivers for faster machines these days are graphic
> applications and games (which are mostly simulation, AI and graphics).
> Then again I could be wrong - is there something important that
> stresses a modern machine hard and for which 16 bit integer SIMD is
> useful? (Even some video chips are starting to use floating point
> triplets for pixels.)
>

FP is good for 3D. Fix point is more adequate for 2D like live video
processing.

Anton Ertl

unread,

Dec 14, 2002, 12:03:16 PM12/14/02

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
>The
>80386 was tolerable, but had the expansion ratio I mentioned relative
>to (say) the 68020.

I should know better, but I could not resist, and two minutes of
searching turned up the first posting containing data on 386 vs 68k
code size: <13...@steinmetz.ge.com>. Here's the relevant data:

| I checked the size of image files compiled from the same source on
|Xenix/386 v2.3.1 and SunOS 3.4 (on a 3/280 if anyone cares). I noted
|the size of the actual (stripped) file, and the values reported by
|"size." I used available source to insure that differences in the
|functions of SysV and BSD programs in bin would not skew the results.
|All compiles with -O.
|
|source CPU file text data bss
|zoo 2.01 386 55368 43560 10312 20904
| 68k 73728 57344 16384 24136
|compress 4.0 386 20300 13236 5964 442596
| 68k 32768 24576 8192 419456
|memacs 3.9n 386 93080 75768 16280 19592
| 68k 114688 90112 24576 18524

Your memory is obviously unreliable.

Stephen Fuld

unread,

Dec 14, 2002, 12:32:08 PM12/14/02

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:atf5am$bdp$1...@pegasus.csx.cam.ac.uk...

snip

> I am pretty sure that RISC does cause code bloat, but the rejoinder
> is "so what?"

ISTM that larger code size could potentially lead to two kinds of problems.
One is the requirement for larger main memory, but since the cost of DRAM,
that is clearly not significant (except for some embedded applications).
The other is performance, caused by higher I cache miss rate. This could
cause lower performance directly, as well as putting increased bandwidth
requirements on the memory interface, which is made worse by the fact that
miss would bring in fewer instructions so more misses would be required to
load the code for a given function.

So the question is, are I cache misses an important performance limiter in
any non-contrived workload today? If so, would having a smaller code
footprint make enough difference in the I cache hit rate to matter?

--
- Stephen Fuld
e-mail address disguised to prevent spam

Anton Ertl

unread,

Dec 14, 2002, 12:42:38 PM12/14/02

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
>So the question is, are I cache misses an important performance limiter in
>any non-contrived workload today?

Many papers report that transaction processing workloads have very
high I-cache miss rates, and spend a good part of their time waiting
for instructions.

> If so, would having a smaller code
>footprint make enough difference in the I cache hit rate to matter?

There is one paper on the 386 (IIRC in a Pentium II Xeon incarnation)
vs. Alpha architecture that claims that the Xeon beats some Alpha
(IIRC 21164) on transaction processing because of this. Unfortunatley
I don't find the paper, but I remember that the authors were from
Intel, and one had written an NVAX vs. 21064 paper when he was at DEC
(nice backpaddling in the newer paper:-).

Nick Maclaren

unread,

Dec 14, 2002, 1:16:06 PM12/14/02

to

In article <2002Dec1...@a0.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>nm...@cus.cam.ac.uk (Nick Maclaren) writes:
>>The
>>80386 was tolerable, but had the expansion ratio I mentioned relative
>>to (say) the 68020.
>
>I should know better, but I could not resist, and two minutes of
>searching turned up the first posting containing data on 386 vs 68k
>code size: <13...@steinmetz.ge.com>. Here's the relevant data:
>

>Your memory is obviously unreliable.

Perhaps. But I seem to remember that posting (or a quote of it)
starting a "statistics war" with lots of people posting the
converse. Certainly, it wasn't what I got.

Michael S

unread,

Dec 14, 2002, 1:36:02 PM12/14/02

to

"Dennis M O'Connor" <dm...@primenet.com> wrote in message news:<10397948...@nnrp2.phx1.gblx.net>...

> "Michael S" <already...@yahoo.com> wrote...
> > It is obvious that you don't agree with my basic idea: the most
> > important thing about ISA is a compactness of code.
>
> And yet, their are no commercially-successful (indeed, maybe
> none at all) ISAs that have Huffman-encoded, fully-variable-length
> instructions, which would be denser than any existing ISA.

I already described why I don't like "compress during compilation -
expand in run-time" approach. This approach specifies expanded form of
the instructions. It is much better for longevity of ISA to leave
expanded form non-specified.
Or may be you are talking about something else - like non-compressed
ISA with one-bit granularity of the instructions lenght ?

> Indeed, the trends seems to be away from instructions like
> the extremely dense "Evaluate Polynomial" instruction found
> on the milestone VAX architecture.
>

I'm calling for reduced instruction set too. But all the rest of the
RISC manifest (load/store architecture, fixed-length instructions,
large architectural register file, 3-op instructions, full
orthogonality, branch delay slots) are IMO outdated.

> > Implementation can compensate for almost everything else *.
>
> Ever done an implementation ? A real one ?

No. So what ? 95% of the posters didn't.
BTW, I implemented special-purpose interpreters in FPGA several times.
I did it in software in almost every project. It's much (o.k.
very-very-very much) simpler activity than specifiing and implementing
genaral-purpose ISA, but defenitly has something in common.

Russell Wallace

unread,

Dec 14, 2002, 5:01:17 PM12/14/02

to

On 14 Dec 2002 08:42:09 -0800, already...@yahoo.com (Michael S)
wrote:

>Offchip access is already very costly and is gooing to be even more

>costly in the future. Then it is very important to improve L2 hit
>rate.

Yes, but L2 miss is dominated by data access. Code inner loops usually
fit in L1 nevermind L2, even if you haven't made any attempt to make
them compact.

If you really want to improve L2 hit rate, concentrate on making the
chip faster so programmers can afford to use data formats that trade
off speed for compactness.

>> Look at Chuck Moore's chip - it's called x25 or something like that -
>> 25 simple stack-based cores in parallel on a tiny chip, and I think
>> each core is 4-way superscalar. Not much good for numeric computing of
>> course, but it does an amazing job of fitting your two primary
>> criteria :)
>I assume it is a Forth chip.

Yep.

>Forth excells at 16 bits. 32-bit Forth is
>probably less appealing from the code compactness point of veiw.

If you design an instruction set with code compactness in mind, you
can get very compact Forth code and still retain the ability to handle
32 bit addresses.

>Where can I look at this chip ?

Don't remember, try a Google search or check on comp.lang.forth.

>i386 has 7 GP registers (ESP is not GPR). I suggest 8 GPRs. Older MS
>C++ compilers were unable to take advantage of EBP, effectively
>reducing number of GPRs to 6.

8 still isn't enough. 12, say, might be; but if you're going to go
past 8, you might as well go to 16.

>You're right about complexity, but it can be done. Variable-length
>decoder isn't the most complex thing in the modern CPU. Out-of-order
>execution engine is far more complex.

True, but you can't scrap out of order execution if you want best
performance on typical object-oriented spaghetti code. You can,
however, scrap variable length instructions if you don't need x86
compatibility.

>I don't know about heat dissipation. My guess, multiple excution
>unites consume order of magnitude more power than this additional
>decoder stage. I would be glad I somebody with numbers in hand
>educates me on the issue.

Me too.

>No. We were able to process 256x256 areas at 20 fps with P-II 266. It
>is usefull, but really we wanted 640x480 at 30 fps. Even P-IV with MMX
>barely can do it. Besides, algorithm guys are always ready to add one
>more filter :) If we were doing it today with P-IV in mind we would
>use fix-point capabilities of SSE2.

So fixed point SIMD is still useful for video, okay, fair enough.

Stephen Fuld

unread,

Dec 15, 2002, 12:21:40 AM12/15/02

to

"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
news:2002Dec1...@a0.complang.tuwien.ac.at...

> "Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
> >So the question is, are I cache misses an important performance limiter
in
> >any non-contrived workload today?
>
> Many papers report that transaction processing workloads have very
> high I-cache miss rates, and spend a good part of their time waiting
> for instructions.
>
> > If so, would having a smaller code
> >footprint make enough difference in the I cache hit rate to matter?
>
> There is one paper on the 386 (IIRC in a Pentium II Xeon incarnation)
> vs. Alpha architecture that claims that the Xeon beats some Alpha
> (IIRC 21164) on transaction processing because of this. Unfortunatley
> I don't find the paper, but I remember that the authors were from
> Intel, and one had written an NVAX vs. 21064 paper when he was at DEC
> (nice backpaddling in the newer paper:-).
>

Thanks Anton. So perhaps there is something to be gained from various code
size reduction techniques, at least for transaction processing. But I
suspect that most transaction processing is so I/O bound that improving
processor utilization isn't much of an issue. Tow other related notes.
There may be something to the two currently produced general purpose
processors that have the most presence in transaction procesing, X86 and
S/390 (aka Z series) are both cisc machinew - or perhaps not. Also, if that
is a factor in X86's success in transaction workloads, then Intel is giving
that up with IA-64. My guess is that it isn't that much of a factor.

Dennis M O'Connor

unread,

Dec 15, 2002, 1:38:14 AM12/15/02

to

"Michael S" <already...@yahoo.com> wrote ...
> "Dennis M O'Connor" <dm...@primenet.com> wrote ...

> > "Michael S" <already...@yahoo.com> wrote...
> > > It is obvious that you don't agree with my basic idea: the most
> > > important thing about ISA is a compactness of code.
> >
> > And yet, their are no commercially-successful (indeed, maybe
> > none at all) ISAs that have Huffman-encoded, fully-variable-length
> > instructions, which would be denser than any existing ISA.
>
> I already described why I don't like "compress during compilation -

> expand in run-time" approach. [...]

> Or may be you are talking about something else - like non-compressed
> ISA with one-bit granularity of the instructions lenght ?

Yep. I was talking about a bit-variable-length ISA that used static
or dynamic (your choice) frequency of occurrence to select the
length of each instruction, and thereby seek to maximize information
density and therefor minimize code size. You said this was
the most important thing to do. Yet no one AFAIK does it.

I once proposed taking the 32-bit i960 ISA and converting it
to 21 bits per instruction, with 3 instructions per naturally-
aligned 64-bit memory block (and one spare bit). It would have
worked pretty well, and I was not surprised to see the EPIC ISA
adopt a 3-insts-per-128-bits format a few years later.
[I don't know if they got the idea (indirectly) from me or not.]

> > Indeed, the trends seems to be away from instructions like
> > the extremely dense "Evaluate Polynomial" instruction found
> > on the milestone VAX architecture.
> >
> I'm calling for reduced instruction set too.

But why ? If you take common instruction pairs and
combine them into one new instruction, you'll get
more compact code. You said this was of supreme
importance. So why deviate from that objective just to
simplify the HW decoders or the compiler code generators ?
Unless compactness of code _isn't_ the most important thing ...

> > Ever done an implementation ? A real one ?
> No. So what ? 95% of the posters didn't.

Being in the 5%, I'm painfully aware of that fact.

Michael S

unread,

Dec 15, 2002, 4:25:59 AM12/15/02

to

r...@vorpalbunnyeircom.net (Russell Wallace) wrote in message news:<3dfba7ea....@news.eircom.net>...

> >i386 has 7 GP registers (ESP is not GPR). I suggest 8 GPRs. Older MS
> >C++ compilers were unable to take advantage of EBP, effectively
> >reducing number of GPRs to 6.
>
> 8 still isn't enough. 12, say, might be; but if you're going to go
> past 8, you might as well go to 16.
>

You previous i386 experience is irrelevant for two reasons:
1. 8 GPRs are supposed to do much better than 7.
2. More important, you're probably talking about difficulties with
integer code, where both indices/loop counters and data variables use
the same 7 GPRs. In "my" architecture you use GPRs only for indices,
loop counters etc... Data goes to SIMD register file, because ISA
provides proper support for scalar operation on this registers. In
effect, the pressure on GPRs is much closer to what you have on
pre-68020 68K or FP code on i386.
My "intuitive" answer was 12 GPRs, like yours. However, taking into
account two points mentioned above, I realized that 8 is enough.

Russell Wallace

unread,

Dec 15, 2002, 7:27:56 AM12/15/02

to

On 15 Dec 2002 01:25:59 -0800, already...@yahoo.com (Michael S)
wrote:

>You previous i386 experience is irrelevant for two reasons:

>1. 8 GPRs are supposed to do much better than 7.

They don't. (Honestly, I would have thought it was obvious that
they're not going to except in rare cases.)

>2. More important, you're probably talking about difficulties with
>integer code, where both indices/loop counters and data variables use
>the same 7 GPRs.

Yes.

>In "my" architecture you use GPRs only for indices,
>loop counters etc... Data goes to SIMD register file, because ISA
>provides proper support for scalar operation on this registers. In
>effect, the pressure on GPRs is much closer to what you have on
>pre-68020 68K or FP code on i386.

Oh okay, fine. That meets the requirement of having more than 8
registers for working with integer values, so that's no problem then
^.^

Nick Maclaren

unread,

Dec 15, 2002, 7:32:01 AM12/15/02

to

In article <f881b862.0212...@posting.google.com>,

Michael S <already...@yahoo.com> wrote:
>>
>I already described why I don't like "compress during compilation -
>expand in run-time" approach. This approach specifies expanded form of
>the instructions. It is much better for longevity of ISA to leave
>expanded form non-specified.

Yes. That is precise what IBM did on the System/360 and /370. The
systems that were microcoded expanded the published ISA into VLIW
or other unpublished microcode.

I believe that "general purpose" compression is precisely the wrong
approach, both because it is a very poor way of getting special
purpose compression and because it loses most of the other advantages
of an ISA designed to help the compiler and run-time systems writers.
I would favour an ISA (perhaps configurable by the process) of that
form, designed to macro expand into a lower level one designed for
fast, easy, parallel and reliable hardware execution.

Not a new idea, after all.

Michael S

unread,

Dec 15, 2002, 10:59:24 AM12/15/02

to

"Dennis M O'Connor" <dm...@primenet.com> wrote in message news:<10399341...@nnrp1.phx1.gblx.net>...

> "Michael S" <already...@yahoo.com> wrote ...
> > "Dennis M O'Connor" <dm...@primenet.com> wrote ...
> > > "Michael S" <already...@yahoo.com> wrote...
> > > > It is obvious that you don't agree with my basic idea: the most
> > > > important thing about ISA is a compactness of code.
> > >
> > > And yet, their are no commercially-successful (indeed, maybe
> > > none at all) ISAs that have Huffman-encoded, fully-variable-length
> > > instructions, which would be denser than any existing ISA.
> >
> > I already described why I don't like "compress during compilation -
> > expand in run-time" approach. [...]
> > Or may be you are talking about something else - like non-compressed
> > ISA with one-bit granularity of the instructions lenght ?
>
> Yep. I was talking about a bit-variable-length ISA that used static
> or dynamic (your choice) frequency of occurrence to select the
> length of each instruction, and thereby seek to maximize information
> density and therefor minimize code size.
>

It's very interesting idea. Now, Dennis, I'm asking you as one who had
done an implementation (a real one ). How many pipeline stages
are required to implement bit-variable-length decoder in a manner
which doesn't compromise cycle time of aggressively pipelined design ?
I mean, really aggressive pipelining- P4-style.

There is another interesting question:
Let's take a branch which was correctly predicted when encountered for
the first time. What happens when this branch goes "other way" ? The
simple solution is to create a brand new trace in the trace cache.
Probably this solution is too simple. More likely, we would want to
find a new target in the trace cache. There is a good chance it's
already here. Even if it's not here, it's desirable to merge the new
trace with an existing one later. I see only one solution for this
problem - a block of CAM holding linear addresses of the original
instructions together with their corresponding addresses in the trace
cache. If there are better solutions, please educate me about them.
The key field of this CAM for the bit-variable-length ISA is 3 bits
wider than for byte-variable-length ISA. I don't know if it is a big
deal or not.

>
>You said this was the most important thing to do. Yet no one AFAIK
does it.
>

I'm not aware of any *new* ISA from a big player which is specifically
targeted at 20M to 200M transistors, other than IA-64. Even IA-64 was
probably originally targeted to perform reasonably well at 10M
transistors. The semiconductor technology only very recently hit the
state in which the size of both L1 and L2 on-chip caches is limited by
an access speed rather than a transistors budget. May be even doesn't
hit yet ? Up to this state the original RISC idea of investing
transistors in caches instead of decoders was appealing. It doesn't
work any more.

> I once proposed taking the 32-bit i960 ISA and converting it
> to 21 bits per instruction, with 3 instructions per naturally-
> aligned 64-bit memory block (and one spare bit). It would have
> worked pretty well, and I was not surprised to see the EPIC ISA
> adopt a 3-insts-per-128-bits format a few years later.
> [I don't know if they got the idea (indirectly) from me or not.]
>
> > > Indeed, the trends seems to be away from instructions like
> > > the extremely dense "Evaluate Polynomial" instruction found
> > > on the milestone VAX architecture.
> > >
> > I'm calling for reduced instruction set too.
>
> But why ? If you take common instruction pairs and
> combine them into one new instruction, you'll get
> more compact code. You said this was of supreme
> importance. So why deviate from that objective just to
> simplify the HW decoders or the compiler code generators ?
> Unless compactness of code _isn't_ the most important thing ...
>

What pair of instructions is common enough ? Remember, "op reg, [mem]"
is already inside.
Oh, I understand. You want to say that improving a compactness of code
by 0.1% is less important than simplifying the decoder by the factor
of two. Of coarse, I agree. But what if you can improve a compactness
by 20% ? Isn't it worth complicating a decoder by factor of 5 ?
I never questioned the fact that engineering is an art of compromises.

> > > Ever done an implementation ? A real one ?
> > No. So what ? 95% of the posters didn't.
>
> Being in the 5%, I'm painfully aware of that fact.

You have to like complex ISAs. It's your brad and butter after all.

Bernd Paysan

unread,

Dec 14, 2002, 4:34:23 PM12/14/02

to

Russell Wallace wrote:
>>* Of coarse, stack based machines has even more compact code. However,
>>the second important thing about ISA for high-performance computer is
>>an ability to expose as much parallelism as possible to the hardware.
>>Stack-based machines aren't very good at that, I believe.
>

> Look at Chuck Moore's chip - it's called x25 or something like that -
> 25 simple stack-based cores in parallel on a tiny chip, and I think
> each core is 4-way superscalar.

The cores are not superscalar. Maybe you mix that up with my 4stack
processor (where the - single - core contains four stacks and ALUs, and
performs four operations in parallel). However, each instruction word of
Chuck's x25 core (the c18) contains four instructions, which are extracted
and executed sequentially. That allows the on-chip SRAM to be up to four
times slower than the core clock frequency.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Robert

unread,

Dec 15, 2002, 12:09:46 PM12/15/02

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
news:EjUK9.62625$hK4.5...@bgtnsc05-news.ops.worldnet.att.net...

>
>
> Thanks Anton. So perhaps there is something to be gained from various
code
> size reduction techniques, at least for transaction processing. But I
> suspect that most transaction processing is so I/O bound that improving
> processor utilization isn't much of an issue. Tow other related notes.
> There may be something to the two currently produced general purpose
> processors that have the most presence in transaction procesing, X86 and
> S/390 (aka Z series) are both cisc machinew - or perhaps not. Also, if
that
> is a factor in X86's success in transaction workloads, then Intel is
giving
> that up with IA-64. My guess is that it isn't that much of a factor.
>

Check this out:
http://research.microsoft.com/research/pubs/view.aspx?tr_id=609

Seems like it could give the cache benefits, but without a totally new ISA.

Food for thought..

Terje Mathisen

unread,

Dec 15, 2002, 1:47:12 PM12/15/02

to

Nick Maclaren wrote:
> In article <atfc2p$ja3$1...@vkhdsu24.hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
>>Nick Maclaren wrote:
>>
>>>Back in the early 1980s, the Intel 80x86 architecture was notorious
>>>for causing code bloat. It was typically 50% larger than 68K or
>>>VAX. I didn't keep the figures I measured, or the ones that other
>>>people posted, but my recollection matches Del's.
>>
>>Nick, that's simply wrong.
>
>
> No, we are just looking at different aspects.
>
>
>>Well-written 16-bit x86 code is _very_ small, not the least due to the
>>original 8088, where total execution time was almost directly
>>proportional to the number of code (not data!) bytes touched.
>>
>>That there might have been some compilers that generated pretty horrible
>>code, saving and reloading every variable between every statement line,
>>doesn't make the architecture itself nearly as badly broken as you seem
>>to remember.
>
>
> I should have said that I was referring to the sort of code that was
> comparable between that architecture, 68K, SPARC and so on. Not
> necessarily true 32-bit but with (say) a data segment of 200K. The
> 80386 was tolerable, but had the expansion ratio I mentioned relative
> to (say) the 68020.

Nick, this is still wrong:

The 386 first came on the market in 1986, in the form of a Compaq machine.

1986 is definitely _not_ what I'd call 'early 1980s'.

> Earlier than that, comparing the 8086 versus the 68000, things were
> much worse. And, even on fairly early machines, 256K of memory was
> not rare.

Wrong again:

The only exception to the size rule was for compiled code that needed to
access single data structures larger than the 64K x86 limit.

I.e. I wrote _lots_ of 16-bit programs during the eighties that used and
required much more than 64 K RAM.

The code size explosion only happened for C and Fortran programs
compiled according to the 'Huge model', where every single pointer
modification required multiple segment/offset modifications.

Nick Maclaren

unread,

Dec 15, 2002, 2:04:41 PM12/15/02

to

In article <atiinh$86n$1...@vkhdsu24.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
>Nick, this is still wrong:
>
>The 386 first came on the market in 1986, in the form of a Compaq machine.
>
>1986 is definitely _not_ what I'd call 'early 1980s'.

True. I agree that I was being monumentally obscure, by referring
to two different periods without saying so! What I posted was indeed
misleading to the point of being wrong.

>> Earlier than that, comparing the 8086 versus the 68000, things were
>> much worse. And, even on fairly early machines, 256K of memory was
>> not rare.
>
>Wrong again:
>
>The only exception to the size rule was for compiled code that needed to
>access single data structures larger than the 64K x86 limit.
>
>I.e. I wrote _lots_ of 16-bit programs during the eighties that used and
>required much more than 64 K RAM.
>
>The code size explosion only happened for C and Fortran programs
>compiled according to the 'Huge model', where every single pointer
>modification required multiple segment/offset modifications.

That's not a fair comparison. To compare the code size of 80286
assembler, or even high-level languages hand-crafted to fit into
its horrible addressing models, with the compilation of clean,
portable high-level codes obviously makes the former look good.
Do you REALLY write significant applications for RISC systems down
at that level?

Yes, of course I was assuming the 'huge' model, because that was
the one that was comparable with what the 68000 had delivered from
day one, most of the other workstation chips used (what were they,
now? I used one ....) and almost all RISC systems used from day
one. I had been writing portable code for 15 years by then, and
had no intention of going back to the practices of the early 1960s.

If you want to compare your hand-crafted code with a RISC system,
then the fair comparison is with ARM with Thumb set support.

Russell Wallace

unread,

Dec 15, 2002, 2:28:13 PM12/15/02

to

On Sat, 14 Dec 2002 22:34:23 +0100, Bernd Paysan <bernd....@gmx.de>
wrote:

>Russell Wallace wrote:
>>
>> Look at Chuck Moore's chip - it's called x25 or something like that -
>> 25 simple stack-based cores in parallel on a tiny chip, and I think
>> each core is 4-way superscalar.
>
>The cores are not superscalar. Maybe you mix that up with my 4stack
>processor (where the - single - core contains four stacks and ALUs, and
>performs four operations in parallel). However, each instruction word of
>Chuck's x25 core (the c18) contains four instructions, which are extracted
>and executed sequentially. That allows the on-chip SRAM to be up to four
>times slower than the core clock frequency.

Oh, okay. I knew the x25 fetched one word and executed its four
instructions per SRAM cycle, didn't know it did them in sequence
rather than in parallel. Same result anyway, I guess :)

Michael S

unread,

Dec 15, 2002, 6:27:18 PM12/15/02

to

already...@yahoo.com (Michael S) wrote in message news:<f881b862.02121...@posting.google.com>...

> "Dennis M O'Connor" <dm...@primenet.com> wrote in message news:<10399341...@nnrp1.phx1.gblx.net>...
> > "Michael S" <already...@yahoo.com> wrote ...
> > > "Dennis M O'Connor" <dm...@primenet.com> wrote ...
> > > > "Michael S" <already...@yahoo.com> wrote...
> > > > > It is obvious that you don't agree with my basic idea: the most
> > > > > important thing about ISA is a compactness of code.
> > > >
> > > > And yet, their are no commercially-successful (indeed, maybe
> > > > none at all) ISAs that have Huffman-encoded, fully-variable-length
> > > > instructions, which would be denser than any existing ISA.
> > >
> > > I already described why I don't like "compress during compilation -
> > > expand in run-time" approach. [...]
> > > Or may be you are talking about something else - like non-compressed
> > > ISA with one-bit granularity of the instructions lenght ?
> >
> > Yep. I was talking about a bit-variable-length ISA that used static
> > or dynamic (your choice) frequency of occurrence to select the
> > length of each instruction, and thereby seek to maximize information
> > density and therefor minimize code size.
> >
> It's very interesting idea. Now, Dennis, I'm asking you as one who had

> done an implementation (a real one :). How many pipeline stages

> are required to implement bit-variable-length decoder in a manner
> which doesn't compromise cycle time of aggressively pipelined design ?
> I mean, really aggressive pipelining- P4-style.
>

I tried to think a little bit more about this decoder. Asked myself
how would I implement it on my current favorite FPGA - Altera's
Stratix. Then I realized that the number of piplene stages is only a
small part of the problem. The bigger part is that I can't implement
fully pipelined decoder like this in Stratix at all. Stratix doesn't
support big asynchroneus ROM. It looks like I can't build fully
pipelined decoder without such ROM. When I say big ROM, I mean not
really that big - 8K*8bit is probably enough, but Stratix can't do it.
I don't know nearly enough about custom designs. May be ROM like this
with access time of 1/2 of the clock cycle is not a problem for a
custom chip ? We need to do ROM lookup+barrel shifter sequentially in
one cycle. Is it possible ?

glen herrmannsfeldt

unread,

Dec 15, 2002, 11:56:56 PM12/15/02

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:atiinh$86n$1...@vkhdsu24.hda.hydro.com...
(snip)

>
> The only exception to the size rule was for compiled code that needed to
> access single data structures larger than the 64K x86 limit.
>
> I.e. I wrote _lots_ of 16-bit programs during the eighties that used and
> required much more than 64 K RAM.
>
> The code size explosion only happened for C and Fortran programs
> compiled according to the 'Huge model', where every single pointer
> modification required multiple segment/offset modifications.

The programs I was working on in the 80286 days, and continuing on,
used large two-dimensional arrays. They were easy to allocate as
an array of pointers, each pointing to a column of the array.
8K doubles was plenty big enough, so I never needed to try huge
model. OS/2 had support for huge model, by allocating
multiple segment selectors with the appropriate offset, if this
was needed.

64K should be big enough for the code generated by almost
any single function/subroutine.

-- glen

Dennis M O'Connor

unread,

Dec 16, 2002, 4:06:35 AM12/16/02

to

"Michael S" <already...@yahoo.com> wrote ...
> "Dennis M O'Connor" <dm...@primenet.com> wrote in message
news:<10399341...@nnrp1.phx1.gblx.net>...
> > "Michael S" <already...@yahoo.com> wrote ...
> > > "Dennis M O'Connor" <dm...@primenet.com> wrote ...
> > > > "Michael S" <already...@yahoo.com> wrote...
> > > > > It is obvious that you don't agree with my basic idea: the most
> > > > > important thing about ISA is a compactness of code.
> > > >
> > > > And yet, their are no commercially-successful (indeed, maybe
> > > > none at all) ISAs that have Huffman-encoded, fully-variable-length
> > > > instructions, which would be denser than any existing ISA.
> > >
> > > I already described why I don't like "compress during compilation -
> > > expand in run-time" approach. [...]
> > > Or may be you are talking about something else - like non-compressed
> > > ISA with one-bit granularity of the instructions lenght ?
> >
> > Yep. I was talking about a bit-variable-length ISA that used static
> > or dynamic (your choice) frequency of occurrence to select the
> > length of each instruction, and thereby seek to maximize information
> > density and therefor minimize code size.
> >
> It's very interesting idea. Now, Dennis, I'm asking you as one who had
> done an implementation (a real one ). How many pipeline stages
> are required to implement bit-variable-length decoder in a manner
> which doesn't compromise cycle time of aggressively pipelined design ?

Why decode in the pipeline when you can predecode ?
Then all you have to worry about is the extra fetch latency on
an instruction miss, and the expansion of the I-Cache. As you
said, implementation can handle "everything else".

[...]

> You have to like complex ISAs. It's your brad and butter after all.

I don't do x86 processors, Michael.

Michael S

unread,

Dec 16, 2002, 4:07:13 PM12/16/02

to

"Dennis M O'Connor" <dm...@primenet.com> wrote in message news:<10400294...@nnrp1.phx1.gblx.net>...

I'm talking about predecoding pipeline. I-Cache misses are bursty - I
would expecte an average of 10 instructions in a burst. Due to this
bursts, predecoding pipeline can significantly reduce the cost of the
I-Cache misses. For a byte-variable-length instructions you can easily
sustain 1-op/cycle predecoding rate. It was demonstrated many times
that even 2 or 3 ops/cycle are perfectly possible. For a
bit-variable-length instructions 1-op/cycle is your best hope. I think
0.5 op/cycle is more realistic. It increases an average cost of the
miss by 1 cycles. For me, it sounds as an affordable price for
improved L2 effeciency. But totally skipping predecoding pipeline and
predecoding at 0.25 or 0.2 ops/cycle doesn't look like a right thing
to do.

> [...]
> > You have to like complex ISAs. It's your brad and butter after all.
>
> I don't do x86 processors, Michael.

What are you doing ? SA ?

Dennis M O'Connor

unread,

Dec 17, 2002, 12:37:29 AM12/17/02

to

"Michael S" <already...@yahoo.com> wrote

> I'm talking about predecoding pipeline. I-Cache misses are bursty - I
> would expecte an average of 10 instructions in a burst. Due to this
> bursts, predecoding pipeline can significantly reduce the cost of the
> I-Cache misses. For a byte-variable-length instructions you can easily
> sustain 1-op/cycle predecoding rate. It was demonstrated many times
> that even 2 or 3 ops/cycle are perfectly possible. For a
> bit-variable-length instructions 1-op/cycle is your best hope.

Not at all. Proof by counterexample: imagine feeding an ISA
with 1,2,3 and 4 bit instructions into a 256-entry table as the address,
which table has it's output exactly 2 fixed-length post-decode
instructions corresponding to the first two bit-VL instructions in
the stream, and also has as is it's output the amount to shift the
stream for the next cycle. Ta-da: 2 instructions decoded per cycle.

Of course, a real bit-VL ISA would have longer instructions, and
such a simple decoder might be impractical. But other, more
practical solutions do exist, most using a fast probability-based
decoding strategy coupled with a slower deterministic decoder
that also handled the long, rare instructions. Remember, by design
of this ISA, long instructions are rare, and vice versa. And fills
already have variable return times, so an occasional cycle or two
of extra delay shouldn't present any additional difficulty.

> > > You have to like complex ISAs. It's your brad and butter after all.
> >
> > I don't do x86 processors, Michael.
> What are you doing ? SA ?

ARM ISA based processors, yes.

Terje Mathisen

unread,

Dec 17, 2002, 2:14:29 AM12/17/02

to

Michael S wrote:
> I'm talking about predecoding pipeline. I-Cache misses are bursty - I
> would expecte an average of 10 instructions in a burst. Due to this
> bursts, predecoding pipeline can significantly reduce the cost of the
> I-Cache misses. For a byte-variable-length instructions you can easily
> sustain 1-op/cycle predecoding rate. It was demonstrated many times
> that even 2 or 3 ops/cycle are perfectly possible. For a
> bit-variable-length instructions 1-op/cycle is your best hope. I think
> 0.5 op/cycle is more realistic. It increases an average cost of the

The code I posted a couple of weeks ago make it perfectly feasible to
decode up to three huffmann coded instructions/cycle, using a not too
large ROM table.

The barrel shifter can start immediately after the table lookup, making
ready for the next cycle, while the current block (0 to 3 opcodes) are
pushed into the predecode cache.

Even if you decide that it is too expensive to have multiple tables,
with explicit table-to-table linkage stored in the tables, an exception
mechanism to handle the longer instructions should work, and as Andy
Glew have told us numerous times, hw is much better than sw at handling
multi-way decision trees.

> miss by 1 cycles. For me, it sounds as an affordable price for
> improved L2 effeciency. But totally skipping predecoding pipeline and
> predecoding at 0.25 or 0.2 ops/cycle doesn't look like a right thing
> to do.

If you really wanna go down this path, then you probably need to look
closely into some form of instruction packing, making sure that every N
instructions would be at least byte aligned.

Otherwise you have to designate branch targets at the bit level, which
either reduces the maximum branch range, or requires much wider offsets.

On a 64-bit system you could probably get away with a maximum of 61 bits
worth of relative branch offsets, using a 'far jump' instruction to get
past that. :-)

Terje

(No, I don't really advocate doing this, but some simulation would be
fun. :-)

Jason Watkins

unread,

Dec 17, 2002, 3:49:31 AM12/17/02

to

> It's not even needed for compactness. Large immediates are rare
> statically as well as dynamically, so not supporting them won't have a
> significant impact on compactness.

Last time I saw numbers, it looked like getting up to at least 16bit
immediates was sufficient to get the big win. Can't remember their
data sources, but H&P 3rd ed has some nice pretty graphs on this
subject.

> >I never seen the proof that 8 is not enough. In my experience 6 is
> >often not enough. 7 is rarely not enough. 15 is always enough. I
> >personally never programmed for the architecture with 8 GP registers.
> >My feeling - it's enough more than 90% of the time.
>
> I've written assembler for architectures with both 8 and 16 integer
> registers (386 and 68000). Empirically, I find 16 is near enough to
> always sufficient as makes no difference, but 8 definitely isn't. Take

From what I remember reading, 16 was good enough for nearly all code.
However, there are some graph coloring algorithms that require up to
32 registers to get the inner loop free of needless loads. If these
algorithms are unimportant to you, whoohoo, but I personally would try
to allow for 32 arch registers for safety and foresight.

> Seems to me the big drivers for faster machines these days are graphic
> applications and games (which are mostly simulation, AI and graphics).
> Then again I could be wrong - is there something important that
> stresses a modern machine hard and for which 16 bit integer SIMD is
> useful? (Even some video chips are starting to use floating point
> triplets for pixels.)

Well, maybe not 16bit... but 32bit has been shown to be quite useful
as a target for compiler loop transforms. The 16bit float format
(1:10:5) is incredibly useful for graphics, and it would be keen if
there was more support for it on the host architectures.

> >> >What is a right size of FP/SIMD register file ? Sure the range must be
> >> >from 8 to 64 but how many exactly ?
> >>
> >> I'll let someone who knows more about numerical computing than I do
> >> answer that one.

I'm not that someone... but given the instruction locality of numeric
code, I'd be inclined to say you should lean to maximizing, not
minimizing. Assuming that we're working with byte variable length
tails, an extra byte or 2 for float instructions would seem to me a
small price for more arch registers.

Jason Watkins

unread,

Dec 17, 2002, 3:57:59 AM12/17/02

to

> Offchip access is already very costly and is gooing to be even more
> costly in the future. Then it is very important to improve L2 hit
> rate.

not to mention power costs for driving pins.

> I assume it is a Forth chip. Forth excells at 16 bits. 32-bit Forth is
> probably less appealing from the code compactness point of veiw.
> Where can I look at this chip ?

He pulled info pending possible commercialization. A little birdie
tells me the wayback machine might be worth a look.

> You're right about complexity, but it can be done. Variable-length
> decoder isn't the most complex thing in the modern CPU. Out-of-order
> execution engine is far more complex.

As usual, too lazy to chase the paper, but byte granularity variable
length tails have been shown to have little area/speed cost using
certain approaches. (the word tail implies that length is determined
wholely from the instruction head, hence stride, hence we can
lookahead along the decode window).

Jason Watkins

unread,

Dec 17, 2002, 4:13:40 AM12/17/02

to

> Thanks Anton. So perhaps there is something to be gained from various code
> size reduction techniques, at least for transaction processing. But I
> suspect that most transaction processing is so I/O bound that improving
> processor utilization isn't much of an issue. Tow other related notes.

Ahh, but that just means there's more time for the cache to be
polluted while the io op is pending. I assume that modern os's are
able to alias code pages, so that when we run x number of data base
backend threads they don't thrash eachother. But still, there are
common database backed workloads where the dataset has very high
locality. CRM comes to mind, as does websites where 90% of traffic is
the front page. I would like to see numbers, but my intuition says
that neglecting icache misses may not be wise for server oriented
architectures.

> There may be something to the two currently produced general purpose
> processors that have the most presence in transaction procesing, X86 and
> S/390 (aka Z series) are both cisc machinew - or perhaps not. Also, if that
> is a factor in X86's success in transaction workloads, then Intel is giving
> that up with IA-64. My guess is that it isn't that much of a factor.

I can't think of any particular advantage cisc has either. When I
think OLTP, I think of IBM yes, but also of Sun. I think *large*
shared memories are probibly the most important sale feature. It will
be interesting to see how oracle prospers on x86-64.

Russell Wallace

unread,

Dec 17, 2002, 12:52:59 PM12/17/02

to

On 17 Dec 2002 00:49:31 -0800, jason_...@pobox.com (Jason Watkins)
wrote:

>> It's not even needed for compactness. Large immediates are rare
>> statically as well as dynamically, so not supporting them won't have a
>> significant impact on compactness.
>
>Last time I saw numbers, it looked like getting up to at least 16bit
>immediates was sufficient to get the big win.

Even 8 bit immediates are sufficient, 10 is plenty and to spare.

>From what I remember reading, 16 was good enough for nearly all code.
>However, there are some graph coloring algorithms that require up to
>32 registers to get the inner loop free of needless loads. If these
>algorithms are unimportant to you, whoohoo, but I personally would try
>to allow for 32 arch registers for safety and foresight.

*nod* My claim is that you want at least 16; going to 32 just to make
sure, is a reasonable enough thing to do.

>The 16bit float format
>(1:10:5) is incredibly useful for graphics, and it would be keen if
>there was more support for it on the host architectures.

Maybe useful for some graphic applications, but completely useless for
everything else :P I'd rather see the resources put into making 32 and
64 bit floats go faster.

>> >> I'll let someone who knows more about numerical computing than I do
>> >> answer that one.
>
>I'm not that someone... but given the instruction locality of numeric
>code, I'd be inclined to say you should lean to maximizing, not
>minimizing. Assuming that we're working with byte variable length
>tails, an extra byte or 2 for float instructions would seem to me a
>small price for more arch registers.

Okay.

Robin KAY

unread,

Dec 17, 2002, 1:53:56 PM12/17/02

to

Jason Watkins wrote:

> The 16bit float format
> (1:10:5) is incredibly useful for graphics, and it would be keen if
> there was more support for it on the host architectures.

Really!? Half precision floating point has been a pet idea of mine for a while
now, but since no one (that I'm aware of) had ever implemented it I just
concluded that it mustn't be useful for anything. Do any architectures support
it in hardware?

--
Wishing you good fortune,
--Robin Kay-- (komadori)

Sander Vesik

unread,

Dec 17, 2002, 2:54:08 PM12/17/02

to

Russell Wallace <r...@vorpalbunnyeircom.net> wrote:
> On 7 Dec 2002 15:55:10 -0800, already...@yahoo.com (Michael S)
> wrote:
>
>>1. Reduced instruction set - PPC+Altivec instructions look just about
>>right. May be, throw away some rarely used stuff like bit-fields and
>>NAND.
>
> I don't recall having seen a CPU with a NAND instruction. What do you
> mean by bit-fields? If you mean &, |, ~, <<, >> then those are
> sometimes useful and surely quite cheap to provide?

IIRC Sparc has all the "normal" logic instructions (and, or, xor) in both
uninverted and inverted forms.

>
>>Full orthogonality is not necessary. E.g. division and rotate
>>don't have to be allowed on all GP registers.
>
> Division and rotate aren't needed at all. Is anything gained by
> putting them in but only on certain registers, other than making the
> chip and software tools more likely to contain errors?

Rotate is very good for some crypto algorithms.

--
Sander

+++ Out of cheese error +++

Scott Robinson

unread,

Dec 17, 2002, 4:45:22 PM12/17/02

to

On Tue, 17 Dec 2002 18:53:56 +0000, Robin KAY <koma...@myrealbox.com>
wrote:

>Jason Watkins wrote:
>
>> The 16bit float format
>> (1:10:5) is incredibly useful for graphics, and it would be keen if
>> there was more support for it on the host architectures.
>
>Really!? Half precision floating point has been a pet idea of mine for a while
>now, but since no one (that I'm aware of) had ever implemented it I just
>concluded that it mustn't be useful for anything. Do any architectures support
>it in hardware?

It is supposed to be used by the latest (DX9) compatible graphics
cards for pixel calculations (I don't know if present chips do this
directly or not, they are supposed to also support 32 bit fp pixels,
something useless for any scene computed in real-time),

I would have liked a 20bit (1:15:4 or 0:16:4) system in MMX. Such a
format should not take much more room, although it may have other
problems (it would likely require a "global offset" stored in an
architecturally visible register). My guess is that the opinion was
that either programmers were real men and used asm and fixed point, or
they would punt to the compiler and use floating point (and need
better accuracy).

Scott

Iain McClatchie

unread,

Dec 17, 2002, 5:22:47 PM12/17/02

to

Jason> The 16bit float format (1:10:5) is incredibly useful for graphics,
Jason> and it would be keen if there was more support for it on the host
Jason> architectures.

Robin> Do any architectures support it in hardware?

Just learn Cg :-).

Inside ATI they joke that "GPU" stands for "General Processor Unit" and
"CPU" stands for "Compatible Processor Unit".

Doesn't DirectX 8.x support 16b FP textures and also FP writes back out
to memory? Once the GPUs can load, operate, and store FP, seems to me
there is going to be a lot of pressure to move some of the vectorizable
processing over there.

I know someone inside SGI once tried to get the graphics unit to do some
of the processing for a seismic code -- something called post-stack
migration, which apparently uses single-precision (32b) FP quite heavily.

Now all we need is for Nvidia to sponsor someone to port LAPACK to NForce
under Linux!

Nate D. Tuck

unread,

Dec 17, 2002, 6:12:59 PM12/17/02

to

In article <b36vvu4q45cgi9v9o...@4ax.com>,

Scott Robinson <dsc...@dontincludethis.bellatlantic.net> wrote:
>It is supposed to be used by the latest (DX9) compatible graphics
>cards for pixel calculations (I don't know if present chips do this
>directly or not, they are supposed to also support 32 bit fp pixels,
>something useless for any scene computed in real-time),

Nonsense. Less than double the bandwidth requirement makes something
go from being useful to useless for real time?

These formats are supported by R300 and R350 based chips from ATI
and NVIDIA's nv30/CineFX machines, as well as supposedly by
new cards from S3 and Trident. I would sincerely doubt that they
ever make their way into mainstream CPU architectures because the precision
and range are quite poor.

Even for some rendering applications, 16 bit FP is not good enough.

nate

Paul DeMone

unread,

Dec 17, 2002, 7:34:14 PM12/17/02

to

Terje Mathisen wrote:
>
> Michael S wrote:
> > I'm talking about predecoding pipeline. I-Cache misses are bursty - I
> > would expecte an average of 10 instructions in a burst. Due to this
> > bursts, predecoding pipeline can significantly reduce the cost of the
> > I-Cache misses. For a byte-variable-length instructions you can easily
> > sustain 1-op/cycle predecoding rate. It was demonstrated many times
> > that even 2 or 3 ops/cycle are perfectly possible. For a
> > bit-variable-length instructions 1-op/cycle is your best hope. I think
> > 0.5 op/cycle is more realistic. It increases an average cost of the
>
> The code I posted a couple of weeks ago make it perfectly feasible to
> decode up to three huffmann coded instructions/cycle, using a not too
> large ROM table.
>
> The barrel shifter can start immediately after the table lookup, making
> ready for the next cycle, while the current block (0 to 3 opcodes) are
> pushed into the predecode cache.

Hmmmm, barrel shifter in the instruction path. Sounds like
you are well on your way to re-inventing the iAPX432. George
Santayana is proven right yet again.

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
pde...@igs.net architectures with MIPSed results but ALPHA's well
that ends well.

Scott Robinson

unread,

Dec 17, 2002, 10:31:02 PM12/17/02

to

I don't believe the bandwidth is the problem. I call 32 bit useless
in that it won't make a difference with less than 30 passes. If you
have a 12 bit mantissa (unsigned with an assumed one) and a 4 bit
exponent, you can add 30 values together with perfect accuracy in the
first 8 bits*. Since present cards have no extra bits and can easily
support 4-8 textures (I will admit that two of those are often
combined in 16 bit registers before being written to ram), I expect
that upwards of 60 passes won't be much of a problem.

ATI and NVIDIA do show off applications that do need more bits, and
they do that many passes. I simply don't expect them to be done in
real time (for the current generation at least). I also expect that
the size expansion of switching to 32 bits would enlarge the pixel
pipelines far too much (they already dominate these chips), for
something that does nothing for the common case.

Scott

* I seem to have overlooked gamma. I don't know if the floating point
values stay in a linear region or are already gamma corrected or not.
If they are linear it would change the calculations considerably, but
the actual images shown only seem to change in details created by the
Renderman type calculations with many passes.

Jason Watkins

unread,

Dec 18, 2002, 2:58:36 AM12/18/02

to

> Nonsense. Less than double the bandwidth requirement makes something
> go from being useful to useless for real time?

It's more like this. For zbuffered graphics, it's been shown that
above 90% of shaders do not benifit from the additional precision of
32b float. The exceptions tend to center around normal maps. However,
doubling the available bandwidth could be HUGE. An example would be
volume shadows. Doubling available fill rate could more than double
the number of usable lights at a realtime framerate. When you're
talking about going from 3 lights to 9 lights, that's pretty dramatic.

I would not advise the 16f format as a "good enough for all". But, it
_is_ a valuable tradeoff to have available. It is a bit sad that host
architectures won't support partitioned SIMD for this format. Even
though the hardware additions would be minor, ISA changes are not
taken lightly. I wish it had gotten in on the original specs. As it
didn't, we'll have to live with additional swizzleing.

> new cards from S3 and Trident. I would sincerely doubt that they
> ever make their way into mainstream CPU architectures because the precision
> and range are quite poor.
> Even for some rendering applications, 16 bit FP is not good enough.

I agree, I doubt we'll see them. Don't overlook wonderful oppertunity
in the specific by always applying generalizations. There are some
cases where 16bit FP is very interesting.

Martin Høyer Kristiansen

unread,

Dec 18, 2002, 4:35:18 AM12/18/02

to

Scott Robinson wrote:

<snip>

> I don't believe the bandwidth is the problem. I call 32 bit useless
> in that it won't make a difference with less than 30 passes. If you
> have a 12 bit mantissa (unsigned with an assumed one) and a 4 bit
> exponent, you can add 30 values together with perfect accuracy in the
> first 8 bits*.

The problem 32FP is meant to mediate is not error propagation in
blending colours, but rather to increase precision and range for more
advanced formats (normal maps, etc).

BTW, the ATI R300 always uses 24bit FP internally, but write to either a
(4x)16bit or (4x)32 bit FP target framebuffer.

Cheers
Martin

Martin Høyer Kristiansen

unread,

Dec 18, 2002, 4:43:49 AM12/18/02

to

Iain McClatchie wrote:

> Jason> The 16bit float format (1:10:5) is incredibly useful for graphics,
> Jason> and it would be keen if there was more support for it on the host
> Jason> architectures.
>
> Robin> Do any architectures support it in hardware?
>
> Just learn Cg :-).
>
> Inside ATI they joke that "GPU" stands for "General Processor Unit" and
> "CPU" stands for "Compatible Processor Unit".
>
> Doesn't DirectX 8.x support 16b FP textures and also FP writes back out
> to memory? Once the GPUs can load, operate, and store FP, seems to me
> there is going to be a lot of pressure to move some of the vectorizable

> processing over there.

FP framebuffer support is in the upcomming (in a week) DirectX 9. DX8
supports programmable shaders on regular integerevalues.

> I know someone inside SGI once tried to get the graphics unit to do some
> of the processing for a seismic code -- something called post-stack
> migration, which apparently uses single-precision (32b) FP quite heavily.
>
> Now all we need is for Nvidia to sponsor someone to port LAPACK to NForce
> under Linux!

No kidding! The next version of DirectX, 10, will apparently see a
fusion of pixel and vertex shaders with a Turing complete instruction
set (DX 9 hardware lacks Conditional branches I believe). So we'll have
an array of VLIW SIMD (R300 has this but isn't TC) processors sitting on
a daughter card with 5-10 times bandwidth of the host system.

Cheers
Martin

Anton Ertl

unread,

Dec 18, 2002, 11:32:17 AM12/18/02

to

The following claims were recently (follow the References to those
statements) made about code size, among others:

1) RISC code size is more than twice as large as CISC code size.

2) i386 code is 1.5 times larger than m68k code.

I presented some data from old postings on this, but they either did
not include 68k results, or they varied more than just the
architecture. In order to get more reliable results, we need to use
the same environment (same OS, same libraries) on all architectures.

Debian GNU/Linux is such an environment with ports to many
architectures (no VAX port yet, though). It also features pre-built
binaries for all of them, so we can check their sizes without needing
such a machine for building them.

I produced numbers for gzip and grep; for gzip I used the following
commands:

for i in alpha arm hppa i386 ia64 m68k mips mipsel powerpc s390 sparc; do wget http://ftp.<your_mirror>.debian.org/debian/pool/main/g/gzip/gzip_1.3.5-1_$i.deb; done
for i in gzip_1.3.5-1_*.deb; do ar x $i; tar xfz data.tar.gz ./bin/gzip; echo "`size bin/gzip|cut -f 1-3`" $i|grep -v text; done

(GNU size 2.9.5 on i386 did not work on alpha and ia64 binaries, but
GNU size 2.10.91 on alpha did work).

Here are the results:

gzip_1.3.5-1 grep_2.4.2_3
text data bss text data bss
73691 7760 329088 67542 3913 2152 alpha
55384 3012 332960 52012 1224 1896 arm
60045 2832 329000 60643 988 1956 hppa
44258 3100 329264 43772 1212 2068 i386
112417 7344 329144 109004 3768 2200 ia64
40566 3012 328956 39030 1212 1916 m68k
86846 3432 329088 78918 1212 2004 mips
86766 3432 329088 78886 1212 2004 mipsel
58312 3356 328992 55886 1504 1940 powerpc
54419 3012 329000 53238 1224 1948 s390
58228 2884 329024 58144 1036 1960 sparc

And here's the same data sorted by gzip text size:

40566 3012 328956 39030 1212 1916 m68k
44258 3100 329264 43772 1212 2068 i386
54419 3012 329000 53238 1224 1948 s390
55384 3012 332960 52012 1224 1896 arm
58228 2884 329024 58144 1036 1960 sparc
58312 3356 328992 55886 1504 1940 powerpc
60045 2832 329000 60643 988 1956 hppa
73691 7760 329088 67542 3913 2152 alpha
86766 3432 329088 78886 1212 2004 mipsel
86846 3432 329088 78918 1212 2004 mips
112417 7344 329144 109004 3768 2200 ia64

So, we get some answers to the question of whether the claims are
true:

1) The largest RISC code size is >2 times as large as the smallest
CISC code size, but the code size of many RISCs is <1.5 times the
smallest CISC code size; the smallest RISC code size is about the same
size as the largest CISC code size.

2) i386 code is about 1.1 times larger than m68k code.

We also get some interesting additional results:

- IA64 code is quite a bit larger than anything else: not quite a
factor of 3 over the smallest code size, a factor of 2.5 over i386 and
a factor of 2 over the smallest RISC code, and a factor of 1.5-1.6
over Alpha (which, being 64-bit, is the most comparable; however, IIRC
x86-64 64-bit code size has been claimed to be smaller than i386
code).

- There is no obvious relation between code size and RISC ISA age.

- The 64-bit data size can be more than twice as large as the 32-bit
size (compare the grep data size on alpha to others). OTOH, the bss
for these programs seems to contain no pointers at all.

- I am quite surprised that MIPS code is so large. Does anyone have
an idea why this is the case?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Terje Mathisen

unread,

Dec 18, 2002, 1:32:35 PM12/18/02

to

Anton Ertl wrote:
> The following claims were recently (follow the References to those
> statements) made about code size, among others:
>
> 1) RISC code size is more than twice as large as CISC code size.
>
> 2) i386 code is 1.5 times larger than m68k code.
>
> I presented some data from old postings on this, but they either did
> not include 68k results, or they varied more than just the
> architecture. In order to get more reliable results, we need to use
> the same environment (same OS, same libraries) on all architectures.
>
> Debian GNU/Linux is such an environment with ports to many
> architectures (no VAX port yet, though). It also features pre-built
> binaries for all of them, so we can check their sizes without needing
> such a machine for building them.

Great!

Thanks for inserting som facts into c.arch, it has been sorely missing
in many threads lately. :-(

This is _really_ bad.

The only way to ameliorate this would be for fp kernels, where the
rotating registers, plus automatic startup/cleanup handling can probably
gain back the 2.5 X factor, and possibly do even better, when compared
to a sw unrolled loop on Alpha or other 64-bit cpus.

Terje

Andi Kleen

unread,

Dec 18, 2002, 1:45:21 PM12/18/02

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> - IA64 code is quite a bit larger than anything else: not quite a
> factor of 3 over the smallest code size, a factor of 2.5 over i386 and

That's probably because gcc isn't a very efficient IA64 compiler
When you look at gcc assembly generated by the Ia64 gcc you see
a ;; (bundle stop) after nearly each instruction. This means it doesn't
manage to fill the bundles well.

Of course I would expect the intel compiler to still generate much
larger executables on ia64 than on i386, but with a smaller factor.

On the other hand gcc is a very stable IA64 compiler and tends to
generate working code for large code bases unlike some other compilers ;)

> a factor of 2 over the smallest RISC code, and a factor of 1.5-1.6
> over Alpha (which, being 64-bit, is the most comparable; however, IIRC
> x86-64 64-bit code size has been claimed to be smaller than i386
> code).

It depends on the executable. There are cases where it is bigger, cases
where it is smaller.

It also depends on the compiler.

e.g. this is a comparable gzip version from suse 8.1.
I don't have comparable numbers for grep, because this release uses
a newer grep version which seems to be much bigger on both i386 and
x86-64 than the one from debian.

this is gcc 3.2 for i386 and x86-64

i386: gzip: 46966 3560 329232 379758 5cb6e /bin/gzip
x86-64: gzip: 45124 9512 329392 384028 5dc1c /bin/gzip

You see the results are mixed. First gcc 3.2 with -O2 compiled is generally
larger than gcc 2.95 compiled with -O2. this is expected, gcc 3.2
often generates faster code. It has dramatic code size saving with -Os,
but that tends to generate slower code (surprisingly with gcc 2.95 it used
to be the case that -Os often generated the fastest code - but that has
changed)

The .text size for this example is slightly smaller than i386
with the same compiler.
But the .data size is much bigger on the 64bit system. It is not clear
if that is due to use of pointers or careless use of "long" where "int"
would be sufficient.

The comparison is slightly unfair: IIRC gzip contains assembly
functions hand optimized for some architectures like i386.
It doesn't have hand optimized x86-64 functions as far as I know.
The hand written assembly may vary a lot in code size.
It probably is only a small fraction of the full code size, but
should not be forgotten.

Note the x86-64 gcc is still in development, the numbers on a
production system may vary.

-Andi

Greg Lindahl

unread,

Dec 18, 2002, 2:16:13 PM12/18/02

to

In article <2002Dec1...@a0.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>I produced numbers for gzip and grep; for gzip I used the following
>commands:

But you should ask yourself: are these numbers any good?

1) The ARM result probably isn't using the Thumb insruction set.
2) The MIPS result probably isn't using MIPS16.
3) Any or all of the compilers might have unrolled loops.
4) gcc on MIPS is especially bad, because it uses lots of macro
instructions and can't use common subexpression elimination
in them.
5) The Alpha binaries might not be compiled to use byte instructions,
so there are lots of shifts and ands. Often binary distributions
are meant to run on all versions of a CPU.

In general, using a set of binaries not built for the metric you're
examining is a bad idea.

-- greg

Nick Maclaren

unread,

Dec 18, 2002, 3:14:45 PM12/18/02

to

In article <2002Dec1...@a0.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>

>So, we get some answers to the question of whether the claims are
>true:
>
>1) The largest RISC code size is >2 times as large as the smallest
>CISC code size, but the code size of many RISCs is <1.5 times the
>smallest CISC code size; the smallest RISC code size is about the same
>size as the largest CISC code size.

Which makes sense.

>2) i386 code is about 1.1 times larger than m68k code.

On those binaries. Doing the same check on Exim, I get nearer 18%.
I can't remember which compilers and tests I saw (and, no, I didn't
do all of the tests), but they can't all have been gcc as some were
in Fortran.

>We also get some interesting additional results:
>
>- IA64 code is quite a bit larger than anything else: not quite a
>factor of 3 over the smallest code size, a factor of 2.5 over i386 and
>a factor of 2 over the smallest RISC code, and a factor of 1.5-1.6
>over Alpha (which, being 64-bit, is the most comparable; however, IIRC
>x86-64 64-bit code size has been claimed to be smaller than i386
>code).

That is deceptive. IA-64 code relies quite heavily on separate
exceptional code, so the amount normally executed could be much
less.

>- There is no obvious relation between code size and RISC ISA age.

There certainly can be if you compile for different versions of
the ISA on the same machine, though the performance difference is
larger.

>- The 64-bit data size can be more than twice as large as the 32-bit
>size (compare the grep data size on alpha to others). OTOH, the bss
>for these programs seems to contain no pointers at all.

Again, that is not my experience on systems which support both
sizes (e.g. MIPS and SPARC). The Alpha always was very data hungry,
though I don't know why.

>- I am quite surprised that MIPS code is so large. Does anyone have
>an idea why this is the case?

Yes. The MIPS version of gcc has had relatively little attention
(there used to be a statement in the comments about it). Like the
Alpha, it requires more work in the optimisation but, unlike that,
it didn't have a large workstation base among the appropriate
hackers.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Douglas Siebert

unread,

Dec 18, 2002, 5:45:41 PM12/18/02

to

Martin =?ISO-8859-1?Q?H=F8yer?= Kristiansen <mhkris...@yahoo.dk> writes:

>No kidding! The next version of DirectX, 10, will apparently see a
>fusion of pixel and vertex shaders with a Turing complete instruction
>set (DX 9 hardware lacks Conditional branches I believe). So we'll have
>an array of VLIW SIMD (R300 has this but isn't TC) processors sitting on
>a daughter card with 5-10 times bandwidth of the host system.

So will "render farms" in a year or two be systems with whatever generic
CPU with all slots filled with PCI versions of the latest DX10 graphics
cards running software uploaded into the card, instead of worrying about
whether Athlon XP, P4 or Transmeta systems are faster/more cost effective
in racks to render the latest blockbuster?

This might kill the "digital video editing" killer app some people believe
breathe life into the PC market (I don't buy that the set of people who
want to do this is large enough to really matter, but I could be wrong)
If your video editing app just uploads software into your graphics card
to do the various transforms and effects, a better graphics card would
be the thing to get to speed things up. A new CPU wouldn't do squat, and
its much easier to upgrade video cards than to upgrade CPUs (for Intel
systems, and to some degree AMD as well, a 2x speed increase always
requires a new motherboard)

--
Douglas Siebert dsie...@excisethis.khamsin.net

"Suppose you were an idiot. And suppose you were a member of Congress.
But I repeat myself." -- Mark Twain

Terje Mathisen

unread,

Dec 18, 2002, 2:33:28 AM12/18/02

to

Robin KAY wrote:
> Jason Watkins wrote:
>
>
>>The 16bit float format
>>(1:10:5) is incredibly useful for graphics, and it would be keen if
>>there was more support for it on the host architectures.
>
>
> Really!? Half precision floating point has been a pet idea of mine for a while
> now, but since no one (that I'm aware of) had ever implemented it I just
> concluded that it mustn't be useful for anything. Do any architectures support
> it in hardware?

John Carmack have been a big advocate for this for a few years, he'd
like 16 fp bits per color component.

OTOH, afaik he'd be perfectly happy with this as a 'inside the graphics
chip only' format, and 32 bits/component would be fine as long as you
had the bandwidth to support it.

Terje

PS. I assume that Jason meant 1:5:10 (sign:exponent:mantissa) instead of
just 5 bits for the mantissa? It would be hard to show a believable blue
gradient sky with just 5 bits per binade, right?

Ketil Malde

unread,

Dec 19, 2002, 3:48:18 AM12/19/02

to

lin...@pbm.com (Greg Lindahl) writes:

> In article <2002Dec1...@a0.complang.tuwien.ac.at>,
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>> I produced numbers for gzip and grep; for gzip I used the following
>> commands:

> But you should ask yourself: are these numbers any good?

[..]

> In general, using a set of binaries not built for the metric you're
> examining is a bad idea.

Well - it is quite realistic, IMHO -- these are *actual* sizes, after
all. But if the question was with which architecture is it *possible*
to get the smallest code size, I suppose it's not a good test.

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

Nick Maclaren

unread,

Dec 19, 2002, 4:36:36 AM12/19/02

to

In article <egof7i8...@sefirot.ii.uib.no>,

That's the problem. For the past 10 years, the primary development
platform for gcc has been x86-based PCs. There was a lot of effort
put into the IA-64 version, though probably much less than put into
the x86 one, but I don't know if any significant effort has been
put into the 68K one. There was one stage at which the x86 version
was so far ahead of all others that there was a proposal to split
gcc into two projects - x86 and portable.

But a far worse issue is that a lot of modern optimisation is based
on spending code size to gain performance. You see that with IA-64,
but an even clearer example is FFTW. I tried looking at that, as
the only example of floating-point code I could think of in the
Debian distribution, and it made no sense at all. Given how it
works, that is not surprising :-)

Consider loop unrolling and inlining. Those techniques don't work
very well if you are seriously register starved, so are quite
likely to be less used in such cases. But, all other things being
equal, having more registers leads to more compact code. Ugh.

Jakob Engblom

unread,

Dec 19, 2002, 7:58:04 AM12/19/02

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote in message news:<2002Dec1...@a0.complang.tuwien.ac.at>...

> The following claims were recently (follow the References to those
> statements) made about code size, among others:
>
> 1) RISC code size is more than twice as large as CISC code size.
>
> 2) i386 code is 1.5 times larger than m68k code.

Some other data about instruction sets, compilers, and code size can
be found at:

http://user.it.uu.se/~jakob/presentations/engblom-darkII-ht2001.pdf

There are two slides: one quoting MPR reports from 1995 about hte ARM
vs Thumb code sizes (ARM-selected data), and one about results using
several different compilers for the same target architecture (8-bit
embedded).

The compiler can make a difference of a factor 3 on the same program,
and the instruction set like Thumb also has a huge influence. But it
also depends on the program.

Overall, experience indicates that ARM THUMB Is a very successful
method to get compact programs.

CISC vs RISC is a stupid debate.

Variable-vs-fixed-instruction length is more interesting, and note
that most recent designs use variable-length instructions, like the TI
C55 (1-6 bytes), NEC V850 (16-64 bits), TI C64 VLIW DSPs with "end
instruction here" bits. Variable length is so obviously beneficial
that little debate should be needed as to what gives the smallest
code...

Michael S

unread,

Dec 19, 2002, 10:58:54 AM12/19/02

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<ats3v4$b1b$1...@pegasus.csx.cam.ac.uk>...

This techniques are quite likely to be less used on any heavy OoO
machine with good registers renaming. P-II, P-4, all latest
generations of PPC, MIPS R10000 as well as their older "supercomputer"
chip (R8000 ?) all qualify despite a difference in # of registers.
21164 and IA-64 doesn't qualify. I don't know nearly enough about
UltraSparc and 21264 to make a statement.
Besides integer codes like gzip and grep are unlikely to use this
techniques in the first place.

> But, all other things being equal, having more registers leads to more
> compact code. Ugh.
>

This "other things being equal" comment doesn't make sense.
In case of variable-length instructions more registers mean more bytes
per instruction.
In case of fixed-length instructions more registers mean less
addressing modes and/or shorter offsets in base+immediate addressing
and less support for immediates.
Stack-based machines have only two registers and are renown for their
good code density.
For register-based machines there is an optimum number of registers
which would produce most compact code. Argue about the exact number
but don't question the existence of the optimum. A comparison between
existent architectures shows that 16-reg ISAs consistently beat 32-reg
ones on code size. 64-reg and more ISAs never come even close to
making a contest unless they use some sort of windowing scheme. Of
coarse it doesn't prove anything in scientific sense of the word…

Andy Glew

unread,

Dec 19, 2002, 12:41:22 PM12/19/02

to

> But you should ask yourself: are these numbers any good?
>
> 1) The ARM result probably isn't using the Thumb insruction set.
> 2) The MIPS result probably isn't using MIPS16.
> 3) Any or all of the compilers might have unrolled loops.
> 4) gcc on MIPS is especially bad, because it uses lots of macro
> instructions and can't use common subexpression elimination
> in them.
> 5) The Alpha binaries might not be compiled to use byte instructions,
> so there are lots of shifts and ands. Often binary distributions
> are meant to run on all versions of a CPU.

I'll add to the list that GCC produces rather large x86 code,
whereas Microsoft compilers, for many years, optimized
for one thing and one thing only: minimal code size.

Nevertheless, I think Anton's numbers are worth thinking about.

Jason Watkins

unread,

Dec 19, 2002, 1:40:44 PM12/19/02

to

> So will "render farms" in a year or two be systems with whatever generic
> CPU with all slots filled with PCI versions of the latest DX10 graphics

It's already happening. The 3d animation world is making a jump
similar to what the editing/compositing world did a number of years
ago: going from offline to online. The vision of the future is that
the realtime graphics are close enough to the offline render that the
principles can sit with artists and try things interactively.

When these cards are in market, I think you'll see a lot of other
markets figure out how to use them to accellerate intensive computing.
It's a shame that AGP's single card assumption really puts a leash on
us.

This is starting to get off topic, but generalized graphics
architectures have been something I've wanted for a long time now. If
anyone remembers the sort of games that appeared in the post-pentium
but pre-3dfx gap, there was a lot of algorithmic exploration, some
really interesting work. These days, the fixed portions of the
rendering pipeline really force a regularity on everyone on the
market, which is disappointing. Hopefully in the DX10 era we'll see
variety appear again.

Jason Watkins

unread,

Dec 19, 2002, 1:51:26 PM12/19/02

to

> John Carmack have been a big advocate for this for a few years, he'd
> like 16 fp bits per color component.

I'm sure his influence has been a factor in where the IHV's went. I'm
an idealist by nature, but it's nice to have a benign dictator
sometimes :)

> OTOH, afaik he'd be perfectly happy with this as a 'inside the graphics
> chip only' format, and 32 bits/component would be fine as long as you
> had the bandwidth to support it.

Yeah... for a certain class of algorithms (dynamic texturing) it would
be useful to avoid the conversion. This is the same class of
algorithms that is penalized by AGP readback to host speeds anyhow
though. I still would have liked to have seen it.

Another factor is that it complicates virtualized graphics memory. You
end up losing the ability for the cpu and gpu to work on the same
pages transparently. I guess it really goes back to IEEE or the like
not thinking about the utility of small floats.

Thankfully, the principle behind moore's law will eventually bring us
to a point in a few years where 32bit floats are fast, and we'll be
happy.

> PS. I assume that Jason meant 1:5:10 (sign:exponent:mantissa) instead of
> just 5 bits for the mantissa? It would be hard to show a believable blue
> gradient sky with just 5 bits per binade, right?

Yes, sorry for any confusion.

Nick Maclaren

unread,

Dec 19, 2002, 1:58:38 PM12/19/02

to

In article <34c75ec8.02121...@posting.google.com>,

jason_...@pobox.com (Jason Watkins) writes:
|> > John Carmack have been a big advocate for this for a few years, he'd
|> > like 16 fp bits per color component.
|>

|> > OTOH, afaik he'd be perfectly happy with this as a 'inside the graphics
|> > chip only' format, and 32 bits/component would be fine as long as you
|> > had the bandwidth to support it.
|>
|> Yeah... for a certain class of algorithms (dynamic texturing) it would
|> be useful to avoid the conversion. This is the same class of
|> algorithms that is penalized by AGP readback to host speeds anyhow
|> though. I still would have liked to have seen it.

Can you explain why the conversion is an issue? In hardware,
it is an absolute doddle to expand such formats to larger ones,
provided that the use the same model. It is a pain in software
ONLY because there seems some reluctance to provide the right
primitives.