1. Is it true?
2. If it is true, what is it about ARM that provides this? Mitch Alsup
has estimated the more complex instruction decoding for X86 costs
perhaps 5%, but even if true, that can't be the whole difference. Is it
perhaps that neither Intel nor AMD has really focused on mips per watt
and could do a lot better if they wanted to? It would seem that, at
least Intel would benefit from its better fabs, but that doesn't seemed
to compensate. Is it the case that mips per watt is a function of mips?
That is, it costs more watts per mip to provide higher mips and the
difference in mips per watt is just an artifact of the different target
market segments?
3. If it isn't true, what is the reason for the conventional wisdom?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
Very likely. ARM Cortex-M3 datasheet promises 0.19 mW/MIPS at 180nm.
No actual M3 chip (at least from big vendors) comes close to that
figure. However there are several M3 chips that deliver ~1 mW/MIPS
which is much (at least 10x) better than the best 180nm x86.
Now the comparison is not exactly fair since in absolute performance
180nm x86 was ~10-20 times faster than the fastest Cortex M3.
ARM Cortex A9 vs Intel Atom is more fair since top absolute
performance is pretty close. There is hard to get precise numbers
since A9 SOCs tend to pack more non-CPU functionality than Atom
counterparts, but it seems that at equal performance A9 consumes at
least twice less power under full load. Until recently it also
consumed enormously (more than 100x) less power when idle, but Intel
mostly fixed that with their latest Moorestown (Z6xx) SoC.
One thing we should take into account when comparing Atom vs Cortex-A9
is that, due to OoO, A9 is more "robust". In the other words, it sucks
less when running ARM Cortex code not specifically optimized for it.
On the rather hand, Atom sucks rather badly when running code
optimized for "major" x86 cores from Intel and AMD, less so for
integer, more so for FP/SIMD code. So comparisons based on SPECCPU or
some other benchmarks compiled specifically for target
microarchitecture show Atom in better light than typical experience of
majority of users. Much less so for Cortex-A9
I think, 2.5-3 years down the road we will be in better position for
comparison since then we will have AMD Bobcat vs ARM Cortex A15. Both
target more or less the same performance point and both feature
similarly capable OoO.
For anyone interested in trying to guess the future direction of HPC,
how far Atom can be pushed in terms or performance/watt relative to
ARM is an important question.
At the risk of starting an AMD fanboy flame war, I wonder about the
relevance/usefulness/motivation of the "information" you have
provided.
What possible use is information from a 180 nm generation of Intel
chips unless, as I suspect, you are looking to support an Intel-will-
never-get-there bias (immovable object?) on your part? That suspicion
is further deepened for me by your urging us to wait to look at Bobcat
for a real answer. The suspicion is deepened still further by your
"Atom sucks at code not specifically compiled for it" potshot. For
whatever it's worth, Fedora distributions are being compiled with Atom
as a target.
In the end, I'm as clueless as I was about the prospects for Atom
relative to Arm as I was before you started your post, but it's clear
you don't like Intel.
Robert.
Methinks, both irrelevant.
> At the risk of starting an AMD fanboy flame war, I wonder about the
> relevance/usefulness/motivation of the "information" you have
> provided.
>
> What possible use is information from a 180 nm generation of Intel
> chips unless, as I suspect, you are looking to support an Intel-will-
> never-get-there bias (immovable object?) on your part?
What can I do with the fact that no serious chip maker makes Cortex-M3
at geometries finer than 180nm?
I would like to compare Cortex-M3 against Intel Banias, but Banias is
130nm.
> That suspicion
> is further deepened for me by your urging us to wait to look at Bobcat
> for a real answer. The suspicion is deepened still further by your
> "Atom sucks at code not specifically compiled for it" potshot. For
> whatever it's worth, Fedora distributions are being compiled with Atom
> as a target.
>
I think that it is wrong to compare in-order chip against OoO chip on
metrics that do not emphasis one of the more significant advantages of
OoO in the real world - it's ability to achieve decent result on
legacy binaries.
> In the end, I'm as clueless as I was about the prospects for Atom
> relative to Arm as I was before you started your post, but it's clear
> you don't like Intel.
>
> Robert.
You are wrong. I like Intel. I like Dothan, Yonah, Merom and Penryn
cores. Nehalem core is not bad either, although I'd prefer Penryn core/
L2 coupled with Nehalem' awesome uncore stuff. I am a bit skeptical
about Sandy Bridge, but I would be glad if the real life numbers
disprove my skepticism.
But I don't like Atom. See nothing to like about it.
> > At the risk of starting an AMD fanboy flame war, I wonder about the
> > relevance/usefulness/motivation of the "information" you have
> > provided.
>
> > What possible use is information from a 180 nm generation of Intel
> > chips unless, as I suspect, you are looking to support an Intel-will-
> > never-get-there bias (immovable object?) on your part?
>
> What can I do with the fact that no serious chip maker makes Cortex-M3
> at geometries finer than 180nm?
> I would like to compare Cortex-M3 against Intel Banias, but Banias is
> 130nm.
>
You could have done lots better or your posts aren't worth the
bother. Intel had no reason to care about power until 90nm. What's
next, you'll hold up 90nm P4 as evidence that ARM will win?
> > That suspicion
> > is further deepened for me by your urging us to wait to look at Bobcat
> > for a real answer. The suspicion is deepened still further by your
> > "Atom sucks at code not specifically compiled for it" potshot. For
> > whatever it's worth, Fedora distributions are being compiled with Atom
> > as a target.
>
> I think that it is wrong to compare in-order chip against OoO chip on
> metrics that do not emphasis one of the more significant advantages of
> OoO in the real world - it's ability to achieve decent result on
> legacy binaries.
>
Oh, dear God.
Legacy binaries.
Has Wall Street bought, not just politics, but computer architecture?
> > In the end, I'm as clueless as I was about the prospects for Atom
> > relative to Arm as I was before you started your post, but it's clear
> > you don't like Intel.
>
> You are wrong. I like Intel. I like Dothan, Yonah, Merom and Penryn
> cores. Nehalem core is not bad either, although I'd prefer Penryn core/
> L2 coupled with Nehalem' awesome uncore stuff. I am a bit skeptical
> about Sandy Bridge, but I would be glad if the real life numbers
> disprove my skepticism.
> But I don't like Atom. See nothing to like about it.
I'm pretty clueless about Atom so far. Lots of heat. Very little
light. Don't see much light from your contribution, but I'll accept
your representation that you are not an AMD fanboy. You only look
like one. Lots of Opterons still running at the home office? Did it
affect your bonus?
Robert.
The ARM chip has no MMU, that alone likely accounts for a ~third
the difference. The huge x86 decoder costs you another ~third.
Intel/AMD indifference to the micro controller market another ~third.
(No designs for super low power.)
No conspiracy theories needed.
Bobcat will crush ARMs netbook ambitions like a zit.
All the PC folk will jump on Bobcat for their "iPad killers".
The phone hardware guys will just laugh at the PC folk for this.
Apple will NOT use Bobcat in any iPad ever.
The ARM instruction set was decidedly NOT designed for low power,
that just happened to be the only place they could make sales.
(Opcodes with three sources is not a good way to reduce power. ;)
ARM has no magic pixie dust, just a hand full of engineers.
Brett
Afaik, yes. At least for the ARM models designed for cell phones.
>
> 2. If it is true, what is it about ARM that provides this? Mitch Alsup
> has estimated the more complex instruction decoding for X86 costs
> perhaps 5%, but even if true, that can't be the whole difference. Is it
> perhaps that neither Intel nor AMD has really focused on mips per watt
> and could do a lot better if they wanted to? It would seem that, at
> least Intel would benefit from its better fabs, but that doesn't seemed
> to compensate. Is it the case that mips per watt is a function of mips?
Yes.
> That is, it costs more watts per mip to provide higher mips and the
> difference in mips per watt is just an artifact of the different target
> market segments?
I believe this is correct.
>
> 3. If it isn't true, what is the reason for the conventional wisdom?
Even if Mitch _can_ design an x86 cpu running on 5-10% of the watts of
current low-power versions, neither he nor anyone else have been allowed
to actually build one.
This can change as form factors keep decreasing, particularly if/when
the display becomes embedded in glasses or even contact lenses.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Hi, my name is Robert and I'm a third year EECS student. Just started
to follow this newsgroup yesterday. I find computer architecture and
program languages very interesting and I plan to pursue a master in
the subject later.
I have never programmed an ARM, only smaller MCU like PIC, AVR and
MSP430.
It would be interesting if you Brett, or someone else could explain a
little bit more about the statement above;
"Opcodes with three sources"
What does that mean? (I know what Opcodes are, I've done a very simple
CPU in VHDL, but it's the tree sources part I don't get).
Regards, Robert
"Stephen Fuld" <SF...@alumni.cmu.edu.invalid> wrote in message
news:i7t69e$ap$1...@news.eternal-september.org...
> Is it the case that mips per watt is a function of mips? That is, it costs
> more watts per mip to provide higher mips and the difference in mips per
> watt is just an artifact of the different target market segments?
Yes, you can think of this in terms of information theory. Energy is a
function of work over time. If you decrease the amount of time to do the
same amount of work, it takes more energy. Overhead like speculation,
schedulers, buffers, pipeline latches, etc. add even more into this.
That is why mips/watt is a poor metric. It optimizes for min power (no
incentive to produce performance). You need to weight performance by 2 or 3
(mips^2/w, or mips^3/w).
Ned
The current HPCS (which replaced the HPCF) at Cambridge is a large MPI
machine made out of Dell x86 boxen. However there are lots of people
using CUDA on Nvidia cards for more specialised compute requirements.
There is also CamGrid which is a Condor based distributed computing
service based on flocking lab machines and people's desktops into
clusters overnight.
> Oh, dear God.
> Legacy binaries.
> Has Wall Street bought, not just politics, but computer architecture?
I'm afraid being able to execute existing code (and code optimised for
previous microarchitectures) fast is extremely important. There are
various special cases, such as HPC, where people are willing and able
to recompile binaries to extract maximum performance but this is very
much the exception rather than the rule.
Intel has a problem with Atom here having lots of existing binaries
which execute perfectly well on a agressive OoO implementation but are
not performant on the in order Atom without being recompiled. ARM, on
the other hand, is going from binaries compiled and scheduled for
existing in order implementations and running on them on new out of
order implementations.
-p
--
Paul Gotch
--------------------------------------------------------------------
> That is why mips/watt is a poor metric. It optimizes for min power (no
> incentive to produce performance). You need to weight performance by 2
> or 3 (mips^2/w, or mips^3/w).
Obviously one chooses the weighting factor in order to get the answer that
the marketing department wanted.
More seriously, your analysis simply indicates that one should decide when
one needs the result and choose the power budget that will make the
calculation finish just in time. Based on what I've read here over the
years, I'd be really surprised if dafter things have not been attempted in
OS schedulers before now.
I believe he is referring to the fact that classic ARM
instructions have three input operands: the condition code
register and two general purpose register inputs. (Of course,
Brett probably also dislikes requiring three encoded operands
as in MIPS--one destination register name and two source/input
register names--since code density can also improve power
efficiency and often the destination register name can be the
same as one of the source register names.)
Paul A. Clayton
just a technophile
>On Wed, 29 Sep 2010 12:26:58 +0100, nedbrek <ned...@yahoo.com> wrote:
>
>> That is why mips/watt is a poor metric. It optimizes for min power (no
>> incentive to produce performance). You need to weight performance by 2
>> or 3 (mips^2/w, or mips^3/w).
>
>Obviously one chooses the weighting factor in order to get the answer that
>the marketing department wanted.
>
>More seriously, your analysis simply indicates that one should decide when
>one needs the result and choose the power budget that will make the
>calculation finish just in time.
Often enough the opposite is true, particular in embedded devices ...
the power budget is fixed and you have to select (or worse, create)
components that will do the job within that budget.
>Based on what I've read here over the years, I'd be really surprised if
>dafter things have not been attempted in OS schedulers before now.
There is a lot of research being done on this now. I'm not aware of
any real-world system that tries to minimize CPU watts through
scheduling other than to slow down/power down if all code is pending
on interrupts. Thorny problems involving conditional branches make
doing much more in software either impossibly hard or very
inefficient. Thread scheduling done in the CPU itself looks to be a
possible solution and I have read about some research CPU cores that
aim to do that.
But operating systems that schedule to control device usage have been
around for ages: 2-level core vs swap process scheduling began back in
the 60's although originally it was to enhance performance rather than
reduce power usage.
George
What's so special about machine code. Just take the RTL of the code
(generator) and make a new compile of the code. This should be a
reasonably easy software product. Adding a checksum feature to the
binary loader, would be good too, to prevent infection via virus.
That's exactly how one selects an embedded processor.
--
As we enjoy great advantages from the inventions of others, we should
be glad of an opportunity to serve others by any invention of ours;
and this we should do freely and generously. (Benjamin Franklin)
Okey, I see, thanks.
Emphasis in my post on the word *buying*
http://www.sgi.com/company_info/newsroom/press_releases/2010/august/cosmos.html
<quote>
FREMONT, Calif., and Reading, England — August 11, 2010 — SGI (NASDAQ:
SGI), a global leader in HPC and data center solutions, today
announced that the UK Computational Cosmology Consortium (COSMOS),
based at the University of Cambridge, has selected SGI® Altix® UV 1000
to support its research. Altix UV will help cosmologists answer
questions at the foundation of our understanding of how the universe
came to be, of what it is made, how it has evolved and what the future
holds.
</quote>
http://www.sgi.com/products/servers/altix/uv/
> > Oh, dear God.
> > Legacy binaries.
> > Has Wall Street bought, not just politics, but computer architecture?
>
> I'm afraid being able to execute existing code (and code optimised for
> previous microarchitectures) fast is extremely important. There are
> various special cases, such as HPC, where people are willing and able
> to recompile binaries to extract maximum performance but this is very
> much the exception rather than the rule.
>
> Intel has a problem with Atom here having lots of existing binaries
> which execute perfectly well on a agressive OoO implementation but are
> not performant on the in order Atom without being recompiled. ARM, on
> the other hand, is going from binaries compiled and scheduled for
> existing in order implementations and running on them on new out of
> order implementations.
>
So.
If you need the latest in cutting-edge performance (including
performance/watt), you bite the bullet and recompile.
If you want to run ancient binariers, you buy expensive, ancient
hardware or very expensive hardware designed to be binary-compatible
with expensive, ancient hardware. Selling such hardware is IBM's
schtick, not Intel's.
My strong suspicion with Intel and Atom is that there is no margin
there to be had. Once again, they are reaching for high margin
business (as they tried to with Itanium) and the market is pushing the
other way. Intel has the luxury of being some unknown mixture of
cagey and stubborn.
Given what came out of Haifa, the idea that Intel couldn't do whatever
is technologically possible with Atom if it wanted to defies belief.
Robert.
> There is a lot of research being done on this now. I'm not aware of
> any real-world system that tries to minimize CPU watts through
> scheduling other than to slow down/power down if all code is pending
> on interrupts.
There are cell phones out there that attempt to run at the lowest
possible battery voltage and operating frequency until some real time
schedule is not met and then up the voltage and frequency only to the
point that the real time schedule is met.
Mitch
> What's so special about machine code. Just take the RTL of the code
> (generator) and make a new compile of the code. This should be a
> reasonably easy software product. Adding a checksum feature to the
> binary loader, would be good too, to prevent infection via virus.
Given a symbol table, what you say is easily possible.
Without one it is fraught with peril and disaster looming at every
turn.
Now consider how much software is shipped WITH a symbol table.
Mitch
I seem to recall that one must also consider the variability
of power-delivery efficiency at different power levels. (I
know I have read that such applies to typical line-power
converters, but I thought that I read somewhere that similar
factors apply to batteries.) As if the trade-offs were not
complicated enough already!
>The ARM chip has no MMU,
Well, that does depend on the chip. The ones I used, not only had an MMU, but
you couldn't enable the caches unless the MMU was enabled. I admit, I'm not
familar with the M3.
- Tim
It's not a marketing problem, but an engineering problem. In the general
purpose CPU space, there is no one target application. You're trying to
enable the next Iphone App Market, or whatever. You have to pick some
tradeoff between power and performance. People used to always want more
performance (at whatever power), now power is a big factor.
It's not obvious where to aim...
Ned
Large parts of all Java code? :-)
We decompiled all 200 MB of SAP's web portal code in order to locate a
threading bug: Their code ship with complete debug info, to the point
where even function-local variables has their names and attributes. :-)
For anything written in C(++) we are back in the Alpha x86 binary
translator era: Yes, it mostly works, and no, even being almost
completely compatible with real hardware isn't nearly good enough.
You do need something like the Crusoe hw support for detecting
funky/self-modifying code without giving up all the performance you were
supposed to gain.
What are the economics of such a project? Compared to the original
capital cost of the software, for example?
It's off-topic, but I claim incidents like this as evidence that
software is a hopeless mess.
Moving on,...
Why *not* ship complete symbol tables and debug information with with
binaries?
Somewhat more on-topic, I consider the CMOV and Atom target issues for
Linux as evidence that we either...
a. Have to abandon backward binary-compatibility, or
b. Change the standards for what is acceptable as object code so that
software is not so locked into past architectural choices that are no
longer optimal.
Robert.
> Why *not* ship complete symbol tables and debug information with with
> binaries?
It makes reverse engineernig so much easier. But I digress.....
> a. Have to abandon backward binary-compatibility, or
The problem is that it is less expensive to build cores that put up
with the binaries that are already out there than to change the
software in any meaningful way. There is a long track record of binary-
compatability.
> b. Change the standards for what is acceptable as object code so that
> software is not so locked into past architectural choices that are no
> longer optimal.
Many of these choices are mandated at the OS level, so your argument
gets restated as "make all the OS interfaces identical". This falls
into the catagory of "nada gonna' 'appen".
Similarly, why bundle the runtime library with the application! This
is the ideal place to quit the application and put an optimized
library on the target machine where the application will be installed.
This also enabes the libraries to be shared. Yet when I bother to
look, MS Office, Photoshop, Adobe Reader, and Corel Draw do not even
use the same *.dll s --NIH is rampant. Resistance to this will be
fierce unless there is some way to guarentee all libraries contain the
same semantic gyrations.
In essence, Robert has put his finger on the problem. Software
development has become an "Idiocracy" for a whole lot of defensible
reasons and rationals, but "Idiocracy" it remains.
Mitch
But it would make copyright infringement cases so much easier to
bring and to prevail at.
I can imagine the advice of legal deparments everywhere: Don't even
*look* at object code. Or, at least, don't send any emails admitting
that you have.
Robert.
> http://www.sgi.com/company_info/newsroom/press_releases/2010/august/cosmos.html
That's COSMOS a consortium that DAMTP (Department of Applied Mathematics
of Theoretical Physics) is part of.
The machine, it's gone through a few since the consortium started in
1997, is physically located in Cambridge but it's owned by the
consortium not by the university and it only available for cosmology
jobs. The HPCS and CamGrid are in principle open to anyone in the
university, obviously it's a bit more complicated than that in
practice.
The previous COSMOS machine was an Itanium based Altix 4700 which they
naturally replaced/augmented with a x86 based Altix UV1000. Before that
they were running MIPS/IRIX based Origin machines. They need a large
single system image shared memory machine and SGI are the only game in
town whatever processors they currently use.
Far as I can tell, Rackable Systems, aka SGI, will cook you up
anything you want from among Itanium, Xeon, and Opteron.
If you're an AMD fanboy with a big checkbook, I'm sure you can order
an SSI system with your favorite processor inside.
Robert.
> Far as I can tell, Rackable Systems, aka SGI, will cook you up
> anything you want from among Itanium, Xeon, and Opteron.
They still seem to offer the 4700, but not with the latest 93XX
Itaniums (aka Tukwila). The Opteron stuff I came across in an
(admittedly cursory) check of their website seem to be "scale-out" -
which I take to mean cluster. I've yet to see anything about
"NUMAflex" for Opteron.
> If you're an AMD fanboy with a big checkbook, I'm sure you can order
> an SSI system with your favorite processor inside.
I was going to ask if that was what Cray were offering with their XE6
and Gemini interconnect chip, but I never saw the term SSI used in the
marketing blurbiages I found and so wonder if it is indeed SSI?
rick jones
--
The glass is neither half-empty nor half-full. The glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
You know we really are overdue for a Cray(Tera) SGI(Rackable) merger. I
mean it's been 14 years since the last one now.
http://www.sgi.com/products/servers/altix/450/index.html
Since it's a blade-based product, I'm sure that, if you wanted enough
of them, you could have them in any flavor you like.
> > If you're an AMD fanboy with a big checkbook, I'm sure you can order
> > an SSI system with your favorite processor inside.
>
> I was going to ask if that was what Cray were offering with their XE6
> and Gemini interconnect chip, but I never saw the term SSI used in the
> marketing blurbiages I found and so wonder if it is indeed SSI?
>
I don't think it is. Cray does talk about the use of vsmp from
ScaleMP to present a virtual single system image to the end user.
Robert.
Perhaps, but not significantly. Perhaps 5%.
Actually, the x86 tax as a fraction varies with the microarchitecture.
With the simplest, in-order, microarchitectures, the x86 penalty may be
higherr, 10-15%. But asyou get to the levels of complexity of modern
OOO processors such as the ARM Cortex A-9 or A-15, probably less than 5%.
> 2. If it is true, what is it about ARM that provides this? Mitch Alsup
> has estimated the more complex instruction decoding for X86 costs
> perhaps 5%, but even if true, that can't be the whole difference. Is it
> perhaps that neither Intel nor AMD has really focused on mips per watt
> and could do a lot better if they wanted to? It would seem that, at
> least Intel would benefit from its better fabs, but that doesn't seemed
> to compensate. Is it the case that mips per watt is a function of mips?
> That is, it costs more watts per mip to provide higher mips and the
> difference in mips per watt is just an artifact of the different target
> market segments?
>
> 3. If it isn't true, what is the reason for the conventional wisdom?
Because ARM failed to make in in the general purpose desktop PC market -
remember the BBC Micro - they had to make their name in small mobile
devices, beginning with the Newton.
Simply put, they've been doing low power for longer than Intel has taken
low power seriously.
Folks like me and Bob Dreyer tried to get Intel into low power circa
1995, but they only got serious with Atom. Earlier, the Intel 486SL,
etc. pioneered, but were heldback because, although there was demand,
the cost structure was not in Intel's favor. Intel simply made more
money selling power hungry CPUs with separate chipsets, than in selling
high integration low power parts,.
It did not help that up until a few years ago Intel's own technical
leaders, like Shekhar Borkhar of the Circuits Research Lab and the folks
who did Itanium, felt that x86 could never be power efficient.
In some ways what has happened is that the performance demands of mobile
CPUs have increased to the point where simple in-order is no longer
sufficient. And as your cell phone becomes an OOO processor, the x86
tax becomes smaller and smaller.
The big question, however, is whether ARM has already become so
established in the mobile marketplace that it cannot be displaced. x86
tools have some advantage, but then so do ARM tools. Microsoft is not
currently a leading SW/OS player in this space. And without Wintel,
Intel must compete not om compatibility but on other factors.
>The previous COSMOS machine was an Itanium based Altix 4700 which they
>naturally replaced/augmented with a x86 based Altix UV1000. Before that
>they were running MIPS/IRIX based Origin machines. They need a large
>single system image shared memory machine and SGI are the only game in
>town whatever processors they currently use.
Is there a good reference for what they're doing that requires the
SSI-nature? Max Planck Institute at Garching seems to be doing a lot
of fairly exciting cosmology (10^10-point octree-accelerated N-body
simulation, most notable - 'Millennium Run') on their pile-of-Power6
machine.
Tom
We didn't even consider any potential problems from decompiling their
code: We definitely weren't about to patch it ourselves, only use it to
help us (and them) figure out where the bug(s) were located.
> large sums of money and prestige that protesting efforts to actually
> fix problems would land them in court and out so large amounts that
> even the largest SW companies on the planet would have a large dent
> in their fortunes. Better let the customer fix the problem, silently.
Not really an option: As a customer you pretty much have to get the
vendor to acknowledge the problem and implement the fix. OTOH, at least
in my experience having a customer pointing directly to your "dirty
linen" tends to get any needed work on the fast track.
>
> If you grossly violate a contract you cannot expect the contract
> provisions protecting you to hold. And customer efforts to fix what
> the vendor should have fixed, to save the customer's business, will
> be easily defended.
>
> What I don't understand is that such large corporations don't insist
> on access to source. Always.
I agree in principle, in reality you often have to settle for
third-party escrow, which is better than nothing.
Even getting that out of an Oracle or Microsoft is a big uphill struggle
unless you're someone like the DoD. :-(
The most common reason is that programs usually need to be redesigned
(and not just edited) to move from a shared-memory model to a
distributed-memory one. At least at one stage, that was true for at
least some of the COSMOS applications.
Whether the people who said "we MUST have a shared memory system" were
correct in saying that it was infeasible to convert or not, I can't
say. There are people who could make an educated guess, but I am
not one.
Regards,
Nick Maclaren.
> Intel has a problem with Atom here having lots of existing binaries
> which execute perfectly well on a agressive OoO implementation but are
> not performant on the in order Atom without being recompiled.
If just the recompilation (i.e. different scheduling) is enough
to fix that, then why don't they re-schedule it at run-time, by
providing an x86->x86 dynamic binary translator? It will be doable
and even quite simple in the most cases, because if not sure, just
use the already available and executable "input" code.
A business opportunity?
Best regards
Piotr Wyderski
> Because ARM failed to make in in the general purpose desktop PC
> market - remember the BBC Micro
Actually the Archimedes, the BBC Micro had a 6502. Acorn had a virtual
monopoly of school computers sales in the UK for reasons I never
bothered to find out. ARM IRRC started as a subsidiary of Acorn and when
the schools finally went over to the PC ARM became an independent
company and had to find other markets for their chips.
Ken Young
Because they were damn good computers. That's also why there were
a fair number still in use in research laboratories (as data loggers
etc.) 20 years later.
>ARM IRRC started as a subsidiary of Acorn and when
>the schools finally went over to the PC ARM became an independent
>company and had to find other markets for their chips.
Nope. ARM was merely the name created for Acorn's chip design team
when it was spun off - if I recall, the term ARM was used before then
but there was definitely no separate organisation.
Regards,
Nick Maclaren.
It started off as the Acorn RISC Machine, but had become Advanced RISC
Machine before the spinoff, and is now an acronym that officially
doesn't stand for anything.
Tom
Because unless you cache the results of the translation you have to pay
the cost of doing it everytime you load the code and this is not
particularly feasible in terms of load time, battery concerns and
memory. If you do cache the results of the translations you can run
into trouble with copyright, I'd have to dig them out but there certainly
have been court cases which treated the translated and cached binary as
an unauthorised copy.
> http://www.sgi.com/products/servers/altix/450/index.html
> Since it's a blade-based product, I'm sure that, if you wanted
> enough of them, you could have them in any flavor you like.
Looking at http://www.sgi.com/products/servers/altix/450/configs.html
suggests it is an Itanium-based system, using dual-core Itaniums. They
even use the code name Montvale. Doesn't show Opteron, nor "Tukwila"
(Itanium 93XX).
SGI would still need to create an Opteron-oriented NUMAflex bit of
kit. I don't think anyone has done or at least shipped (commercially)
anything like that for Opteron - I can recall seeing the hype about
Horus os so many years ago, but near as I can recall, neither it, nor
anything like it, ever shipped for Opteron processors. I don't know
if that is indiciative of a technical limitation, or simply the size
of, as you put it, the checkbooks of AMD fanboys.
rick jones
--
web2.0 n, the dot.com reunion tour...
>
> Looking athttp://www.sgi.com/products/servers/altix/450/configs.html
> suggests it is an Itanium-based system, using dual-core Itaniums. They
> even use the code name Montvale. Doesn't show Opteron, nor "Tukwila"
> (Itanium 93XX).
>
> SGI would still need to create an Opteron-oriented NUMAflex bit of
> kit. I don't think anyone has done or at least shipped (commercially)
> anything like that for Opteron - I can recall seeing the hype about
> Horus os so many years ago, but near as I can recall, neither it, nor
> anything like it, ever shipped for Opteron processors. I don't know
> if that is indiciative of a technical limitation, or simply the size
> of, as you put it, the checkbooks of AMD fanboys.
>
SGI apparently hardly missed a beat in adapting NUMAflex to QPI and
x86. I thought, perhaps naively, that adapting to HyperTransport
would be a modest step by comparison.
Robert.
I'm already completely out of my depth, being SW not HW, but while I
would readily agree that at least at the physical layer the
difficulties should be minor, but I would guess that the devil would
be in the details of the coherence protocol(s). A paper from 2004 or
2005 on Horus had a bunch of coherence transaction diagrams. It also
had an expected scaling chart that didn't really look all that
impressive. I'm sure there is more there than is dreamt of in my
philosophy.
rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
They dropped Itanium, woudl require a major redesign to do a
Tukwila-based design because the memory controller is on the processor.
Rather than doing that, they moved to Xeon, the current high-end box is
a honking big NUMA Xeon box (see
http://www.sgi.com/products/servers/altix/uv/)
>
> SGI would still need to create an Opteron-oriented NUMAflex bit of
> kit.
They did, but it's Xeon EX-based.
> I don't think anyone has done or at least shipped (commercially)
> anything like that for Opteron - I can recall seeing the hype about
> Horus os so many years ago, but near as I can recall, neither it, nor
> anything like it, ever shipped for Opteron processors.
Technically, you could do it with Opteron, but if you're paying for a
box that size, you may as well use the Xeon, it's faster.
--
Nik Simpson
> On Sep 29, 5:22 am, Scorpiion <robert....@gmail.com> wrote:
> [snip]
> > It would be interesting if you Brett, or someone else could explain a
> > little bit more about the statement above;
> >
> > "Opcodes with three sources"
> >
> > What does that mean? (I know what Opcodes are, I've done a very simple
> > CPU in VHDL, but it's the tree sources part I don't get).
>
> I believe he is referring to the fact that classic ARM
> instructions have three input operands: the condition code
> register and two general purpose register inputs.
Wrong. ;)
Read the ARM7 manual.
I am talking about shifted Adds, and in particular register shifted Adds.
That makes four inputs, forgot about the condition codes.
Add with Shift:
ADD{S}<c> <Rd>,<Rn>,<Rm>{,<shift>}
Good for calculating the address in an array, the hard way.
A compiler will turn "for i=0 to 50{ float[i] = b[i]}" into simple
pointer incrementing.
ADD (register-shifted register):
ADD{S}<c> <Rd>,<Rn>,<Rm>,<type> <Rs>
Simply crazy?
Three register inputs, the only major CPU design to do this.
Unfortunately no double Add, or Add Minus, etc.
Nothing that could have made the pain worth it.
(~1% of opcodes today are shifts.)
Dont know if ARM cracks the instruction in two, or pays the
heavy price of a third register port.
I keep hearing from people here that adding a third
register read port would cause: famine, floods, pestilence,
the world to end, etc. ;)
I had a new instruction set all ready to go based off of this. ;(
> (Of course,
> Brett probably also dislikes requiring three encoded operands
> as in MIPS--one destination register name and two source/input
> register names--since code density can also improve power
> efficiency and often the destination register name can be the
> same as one of the source register names.)
To which I can only reply, OF COURSE!
Any good variable width instruction set will use one of the sources
for the dest in the 90% of cases where it matches.
x86 does this, the MC68000 did it, everyone smart did.
The RISC people were not smart. ;)
Fixed width opcodes are stupid. ;)
Brett
I did not know that ARM supported variable shift-and-operate!
[snip]
> Dont know if ARM cracks the instruction in two, or pays the
> heavy price of a third register port.
Not orthogonal, but the third port only needs to
read the 5 LSbits
> I keep hearing from people here that adding a third
> register read port would cause: famine, floods, pestilence,
> the world to end, etc. ;)
Not the end of the world, I thought, just a significant
nuisance when few instructions would use three
register operands. There seems to be increased
interest in three register inputs (MADD and cascading
ALUs perhaps most prominently).
[snip]
> Any good variable width instruction set will use one of the sources
> for the dest in the 90% of cases where it matches.
> x86 does this, the MC68000 did it, everyone smart did.
> The RISC people were not smart. ;)
> Fixed width opcodes are stupid. ;)
x86 does not provide a good means for preserving
all inputs. Perhaps a PRESERVE prefix could have
been useful (for x86-64, this could be used to
avoid a REX prefix by providing a 4-bit destination
register ID and bits to extend the source registers
in the same two bytes required by a MOV). MOV is
perhaps a bit wasteful for code size but is also at
least a little more difficult to fuse into the
following instruction.
Fixed width instructions are probably not stupid
any more than delayed branches are stupid--tradeoffs
change over time and knowledge tends to increase.
Paul A. Clayton
just a technophile
The most permanent reason for favouring fixed-width instructions
is that it makes diagnostics, debuggers and first-level interrupt
handlers rather easier to write and more reliable in practice.
That will remain true as long as systems are programmed in unsafe
languages, like assembler or C/C++.
Regards,
Nick Maclaren.
Is it fair then to infer that the "X86 tax" is limited to instruction
decode, and this is the reason that as you add other complexities such
as OOO that the percentage of that tax decreases? That would mean that
other differences between X86 and ARM (let me call them instruction set
semantics as opposed to syntax; things line two versus three operand
instructions, embedded shifts in ARM, etc.make no difference to performance?
>> 2. If it is true, what is it about ARM that provides this? Mitch Alsup
>> has estimated the more complex instruction decoding for X86 costs
>> perhaps 5%, but even if true, that can't be the whole difference. Is it
>> perhaps that neither Intel nor AMD has really focused on mips per watt
>> and could do a lot better if they wanted to? It would seem that, at
>> least Intel would benefit from its better fabs, but that doesn't seemed
>> to compensate. Is it the case that mips per watt is a function of mips?
>> That is, it costs more watts per mip to provide higher mips and the
>> difference in mips per watt is just an artifact of the different target
>> market segments?
>>
>> 3. If it isn't true, what is the reason for the conventional wisdom?
>
> Because ARM failed to make in in the general purpose desktop PC market -
> remember the BBC Micro - they had to make their name in small mobile
> devices, beginning with the Newton.
>
> Simply put, they've been doing low power for longer than Intel has taken
> low power seriously.
That is certainly true.
> Folks like me and Bob Dreyer tried to get Intel into low power circa
> 1995, but they only got serious with Atom. Earlier, the Intel 486SL,
> etc. pioneered, but were heldback because, although there was demand,
> the cost structure was not in Intel's favor. Intel simply made more
> money selling power hungry CPUs with separate chipsets, than in selling
> high integration low power parts,.
And I am sure it still does. But see below.
> It did not help that up until a few years ago Intel's own technical
> leaders, like Shekhar Borkhar of the Circuits Research Lab and the folks
> who did Itanium, felt that x86 could never be power efficient.
>
> In some ways what has happened is that the performance demands of mobile
> CPUs have increased to the point where simple in-order is no longer
> sufficient. And as your cell phone becomes an OOO processor, the x86 tax
> becomes smaller and smaller.
>
> The big question, however, is whether ARM has already become so
> established in the mobile marketplace that it cannot be displaced. x86
> tools have some advantage, but then so do ARM tools. Microsoft is not
> currently a leading SW/OS player in this space. And without Wintel,
> Intel must compete not om compatibility but on other factors.
You are getting to the key issues here. It seems that even ATOM is not
power competitive with ARM, though it could be that with more effort,
Intel (and AMD)'s next chips will be. But I suspect Intel is going to
suffer an "Innovator's Dilemma" type problem of being eaten from below
by things like ARM.
Another advantage of ARM is their business model. There are a variety
of SOC's available from different vendors but all based on the ARM
architecture. Intel has not competed in licensing its cores to other
vendors. It would really hurt their business model. So they win due to
their superior fabs if what they produce is exactly what you need, but
if you require an extra chip for some additional function, they probably
lose.
And, as you said above, without the Windows lock in, they have no
defensible competitive advantage. But is does mean that saying X86 has
won the ISA war may be a temporary situation, or winning may be
redefined to a smaller market segment.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
There is also a cost in terms of memory model, x86 is fairly strongly
ordered and doesn't always require explict operations to ensure
instruction to data side coherency. ARM is weakly ordered for normal
memory and does require explict operations.
Actually, I think of all of the things you mention - nondestructive
A=B+C, embedded shifts, etc - as being related to instruction decode,
or at least things related to instruction fetch, decode, and rename.
Here are aspects of the "x86 tax":
* instruction encoding, variable length, etc.
==> more and more evidence that some degree of variable length
encoding is a good idea, but x86 is past the point of benefit.
* not enough registers
==> made better by x86-64 andthe REX byte, at the cost of more
complicated instruction decode
* nondestructive A=B+C versus destructive A+=B
==> some evidence even from RISCs that the destructive form is more
efficient in many cases. (But not in others)
==> Many OOO microarchitects have proposed flavors of MOVE
ELIMINATION - stuff that converts the idioms used to do A=B+C on an
A++=B ISA:
MOV regD := regS1
ADD regD += regS2
into
ADD regD := regS1 + regS2
This is long known technology. I don't know of anyone who has shipped a
product that does this, but, once you have PRF rather than separate
ROB/RRF OOO microarchitectures, it is much easier to do. So I would
expect it to happen soon.
The fact that you might really need this on an x86 and not an an ARM
might be considered x86 tax. However, you may eventually want this on
a RISC OOO, because it helps even there (although not as much as on x86).
The above are really just front end and renamer stuff.
* Embedded shifts have not really been on my radar for a long time, not
since the HP shift-and-add paper, before multipliers became fast.
Moreover, note that x86 has scaled addressing modes, which remove much
of the need for embedded shifts.
And, frankly, if embedded shifts were hugely advantageous, either
x86 would use a renamer trick, or extend the ISA to support them via a
prefix. Soo, again, it becomes a front end issue.
* addressing modes
One of the biggest aspects of the x86 tax is addressing modes:
SegBase + BaseReg + IndexReg<<Scale + immediate
Big, mostly not used. Lots of chips have optimized - e.g. AMD
optimizes for SegBase=0, and some Intel chips have optimized for not
having BaseReag and IndexReg at the same time. But, that's complexity.
* segments
The x86 segment registers are a clump of state, mostly unused.
As a result, mostly unoptimized, not really renamed very aggressively.
And on the average that's okay, but they get modified during events like
interrupts and system calls.
I.e. system events like interrupts and system calls are slowed
down quite a bit by segments, and other examples of the x86 system
architecture.
* x86 system programming
The x86 system programming architecture - interrupt flows, etc. -
is a big part of the x86 tax. Probably the biggest part that is not
instruction decode and rename related.
* page tables
Not fundamentally hard, but the x86 page tables are a big
constraint. ARM's page tables have a bit more room for extension.
* memory ordering
x86 memory ordering gets in the way of a lot of optimizations.
On the other hand, there is a trend in other architectures to
go to stronger memory ordering. Weak ordering is so passe... that I am
thinking about it, e.g. in bitmask consistency, because I am contrarian.
ARM doesn't have the x86 memory ordering model. Indeed, ARM
hasn't had much of a memory ordering model until recently. And, I am
told by friends who work with ARM systems, this has caused compatibility
problems.
ARM's biggest advantage: They have less legacy. They have a lot of
legacy, but nowhere near so much as Intel. They might be able to get
things right when they have to implement them the first time. Thum?
>> The big question, however, is whether ARM has already become so
>> established in the mobile marketplace that it cannot be displaced. x86
>> tools have some advantage, but then so do ARM tools. Microsoft is not
>> currently a leading SW/OS player in this space. And without Wintel,
>> Intel must compete not om compatibility but on other factors.
>
> You are getting to the key issues here. It seems that even ATOM is not
> power competitive with ARM, though it could be that with more effort,
> Intel (and AMD)'s next chips will be. But I suspect Intel is going to
> suffer an "Innovator's Dilemma" type problem of being eaten from below
> by things like ARM.
Yep.
> Another advantage of ARM is their business model. There are a variety of
> SOC's available from different vendors but all based on the ARM
> architecture. Intel has not competed in licensing its cores to other
> vendors. It would really hurt their business model. So they win due to
> their superior fabs if what they produce is exactly what you need, but
> if you require an extra chip for some additional function, they probably
> lose.
>
> And, as you said above, without the Windows lock in, they have no
> defensible competitive advantage. But is does mean that saying X86 has
> won the ISA war may be a temporary situation, or winning may be
> redefined to a smaller market segment.
x86 has by no means won the ISA war.
It is entirely possible that ARM may win in the long run.
The more interesting question is whether the ISA "war" contestants are
limited to ARM and Intel, or whether there is room for others. E.g.
MIPS or SPARC may rise froom the ashes, or somebody new.
Certainly, there is opportunity for new ISAs, as we see several factors
converge:
* low power
* the x86 tax (not a big factor, IMHO)
* more important, Intel has not realigned itself
to ARM levels of low power
* non-Microsoft systems
* e.g. Android in mobile
* gpus
* datacenters
Datacenters are a threat to Intel and Microsoft and x86 because
datacenters often run a smaller set of applications than consumer
devices. And they are centrally administered. So, if the datacenter
can reduce the number of servers required by 10% by switching ISAs, to
an ISA where the small set of software they care about runs like it does
on x86, they may.
And even if they don't switch ISAs, datacenters hurt profits: better
administered systems require fewer chips, having a higher utilization.
It is a question whether the sales lost to consolidation will be
outpaced by sales due to growth of market. (My bet is that
consolidation won't shrink Intel)
Finally, it's not clear that it is an "ISA war" any more. The GPUs and
Java (and Python, and Perl, and C#, and ...) have shown that code that
is independent of binary machine representation is possible. It is
quite likely that the amount of code that is ISA-specific, whether x86
or ARM, will shrink. Certainly as a fraction of total code; it will
take longer to shrink in absolute terms.
But Intel and Microsoft have by no means lost the battle. They may yet
get serious and win; or, rather, they are probably already seriously
coompeting, and they may seriously win. Remember Microsoft versus
Netscape? Rememmber IIntel versus the RISCs?
There is an opportunity for non-x86, non-Wintel.
In particular, there is an opportunity for ARM.
But I'm not even sure that the odds are below 1:1 for Intel being
supplanted.
===
By the way, I predict that eventually Intel will become a fab, fabbing
other companies' SOC designs. But I dare not predict when. >10 years out?
Either Intel will become a fab, or Intel will have lost, or the computer
industry will have stopped innovating.
Us arm chair CPU architects had assumed that AMD/Intel had been doing
this for the past decade...
> * Embedded shifts have not really been on my radar for a long time, not
> since the HP shift-and-add paper, before multipliers became fast.
> Moreover, note that x86 has scaled addressing modes, which remove much
> of the need for embedded shifts.
> And, frankly, if embedded shifts were hugely advantageous, either
> x86 would use a renamer trick, or extend the ISA to support them via a
> prefix. Soo, again, it becomes a front end issue.
Someone want to spill the beans and post a URL or title of the HP paper.
ARM 7 has Shift And Add, ARM 5 does not, Cortex does not.
I assumed it was legacy and was being phased out.
This of course may turn out to be wrong headed. Why not expand out that
third port and do double Adds, etc?
> * addressing modes
> One of the biggest aspects of the x86 tax is addressing modes:
> SegBase + BaseReg + IndexReg<<Scale + immediate
> Big, mostly not used. Lots of chips have optimized - e.g. AMD
> optimizes for SegBase=0, and some Intel chips have optimized for not
> having BaseReag and IndexReg at the same time. But, that's complexity.
I view this as an x86 advantage. More work per cycle as opposed to less
like RISC. Complexity that pays for itself in performance is a win.
> * segments
> The x86 segment registers are a clump of state, mostly unused.
> As a result, mostly unoptimized, not really renamed very aggressively.
> And on the average that's okay, but they get modified during events like
> interrupts and system calls.
> I.e. system events like interrupts and system calls are slowed
> down quite a bit by segments, and other examples of the x86 system
> architecture.
AMD64 has all but eliminated segments, right?
> * x86 system programming
> The x86 system programming architecture - interrupt flows, etc. -
> is a big part of the x86 tax. Probably the biggest part that is not
> instruction decode and rename related.
>
> * page tables
> Not fundamentally hard, but the x86 page tables are a big
> constraint. ARM's page tables have a bit more room for extension.
>
> * memory ordering
> x86 memory ordering gets in the way of a lot of optimizations.
> On the other hand, there is a trend in other architectures to
> go to stronger memory ordering. Weak ordering is so passe... that I am
> thinking about it, e.g. in bitmask consistency, because I am contrarian.
>
>
> x86 has by no means won the ISA war.
A decade from now we may be debating how long x86 can hold on in fringe
demand.
> It is entirely possible that ARM may win in the long run.
ARM the company maybe, ARM the architecture has been a dead man walking
since Thumb2 and Cortex, a half decade ago.
> The more interesting question is whether the ISA "war" contestants are
> limited to ARM and Intel, or whether there is room for others. E.g.
> MIPS or SPARC may rise froom the ashes, or somebody new.
Dinosaurs are not going to revive from their 65 million year slumber
and re-conquer the earth.
The sooner MIPS and the others stop trying to revive their youth and
instead design a new architecture for the future, the sooner they will
have a future.
> There is an opportunity for non-x86, non-Wintel.
>
> In particular, there is an opportunity for ARM.
ARM is a flash in the pan, they could be gone tomorrow.
Just another RISC chip, replaceable overnight.
> But I'm not even sure that the odds are below 1:1 for Intel being
> supplanted.
>
>
> By the way, I predict that eventually Intel will become a fab, fabbing
> other companies' SOC designs. But I dare not predict when. >10 years out?
>
> Either Intel will become a fab, or Intel will have lost, or the computer
> industry will have stopped innovating.
Intel would want to extract monopoly style profits from its fab advantage,
no one will sign on for long, instead as the money dwindles the fab
advantage will evaporate.
The Intel empire will die, whats left will drift around rudderless and
nearly profitless as a ghost of its former self.
Wall Street will install a new CEO to turn around profits, but that
turn around may be just old fashioned Enron accounting.
Intel has lots of assets that could be striped, whats left might be
worthless.
A Dot Bomb style bubble so Wall Street can flip Intel for a quick buck,
based off the name recognition.
Cynical aren't I. ;)
Brett
> On Oct 2, 3:43 am, Brett Davis <gg...@yahoo.com> wrote:
> [snip]
> > Wrong. ;)
> > Read the ARM7 manual.
> > I am talking about shifted Adds, and in particular register shifted Adds.
> > That makes four inputs, forgot about the condition codes.
>
> I did not know that ARM supported variable shift-and-operate!
>
> [snip]
> > Dont know if ARM cracks the instruction in two, or pays the
> > heavy price of a third register port.
>
> Not orthogonal, but the third port only needs to
> read the 5 LSbits
Point taken, that solves the wire density problem, which is a
non-problem for embedded anyway, embedded never see 2GHz.
The real complaint I hear is OoO register scheduling, hard
limits have been hit?
> > I keep hearing from people here that adding a third
> > register read port would cause: famine, floods, pestilence,
> > the world to end, etc. ;)
>
> Not the end of the world, I thought, just a significant
> nuisance when few instructions would use three
> register operands. There seems to be increased
> interest in three register inputs (MADD and cascading
> ALUs perhaps most prominently).
>
> Fixed width instructions are probably not stupid
> any more than delayed branches are stupid--tradeoffs
> change over time and knowledge tends to increase.
Fixed width made sense for about one process node, and
even that is a stretch. In short it was STUPID.
If Cortex or MicroMIPS had shipped back in the 1980's
instead of ARM or MIPS, we might be complaining about
the ARM/MIPS monopolies instead of Intels monopoly.
ARM and MIPS designs do not scale down market, and
the up market was too small to grow the companies fast
enough to become truly successful.
Cortex or MicroMIPS would have crushed MC68000 in short
order, giving the volumes and profits to become major
forces in the industry.
I am being KIND in only calling fixed width STUPID, my
contempt for the design approach is absolute.
History, 20/20 hindsight, and the facts, make this choice clear.
Brett
When the modern flavour of RISC desgns started to be trendy, I was a
little disgusted at the way several of them introduced other forms
of 'tax' (generally via avoidable register dependencies). Yes,
instruction decode was a serious problem for x86 at that time, but
semantic dependency had also been a known problem for a LONG time.
So I believe that there is some 'tax' that is not just in the decode,
but the same is probably true for ARM - I don't know it well enough
to even guess how much. Is that fair?
Regards,
Nick Maclaren.
Brett, you impress me a lot: Not only are you insightful and
understanding, you are also a very humble guy who fully groks the exact
environment former cpu architects worked in and the limitations they had
to consider.
I would think that, at least in the cases where developers care about
performance, the common x86 idiom for non-destructive OP is
MOV regS2, regD
OP regD, regS1
That would be harder to handle at the front end.
That isn't just the "common idiom", it is the _only_ idiom!
All the way back to the original Pentium, this has been the default
setup, sometimes followed by one or two more pairs of MOV+OP just like
the first one, for those situations where a single variable is involved
in multiple different operations.
> That would be harder to handle at the front end.
Would it really?
The only new requirement is that a reg-reg MOV must be identified as
setting up an identity between the two involved registers. This should
be simple in a P6 style OoO architecture where you always allocate a new
virtual register as the target for every operation: Since the sources
are always read-only, it is easy to reuse a single source reg, i.e. use
the MOV to define them to be the same.
With a physical register file this would seem to be a bit harder to do?
Given that my Samsung "Galaxy S" phone[1] already has a
1 GHz CPU[2] in it, 2 GHz cellphones can't be far behind... ;-}
-Rob
[1] AT&T "Captivate" flavor.
[2] Samsung S5PC110 "Hummingbird": Cortex-A8 CPU + PowerVR SGX540 GPU.
-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
Even the ancient ARM5 variant that Intel used in the 80219 XScale
had an MMU...
-Rob
The problem is that this requires a rename of two architectural
registers - something which is itself uncommon in other instructions.
Now, its possible that you could build logic to detect this case and
fuse them post-decoder, but you're obviously not getting all of the
benefits here.
Never mind that it adds more complexity to the decoder, which isn't
exactly advantageous in increasing the decode rate and certainly
increases power consumption
I certainly don't think that a new architecture should force 3 operands
for all instructions - but I also think it should introduce a variable
length encoding where the most common operands come in both 2 and 3
operand forms.
Thumb-2 is very good in this respect; its just that many of the
instruction bits are not pulling as much weight as they ought to.
- Owen
(1) You may have missed this, since I am seldom as vociferous as you,
Brett, but: I think that have a third read port on the register file
per instruction is a net stupidity, worse than fixed width instructions.
Note that I said "a third RF read port per instruction".
I am certainly not talking about so-called three operand form, A=B+C
- 2 inputs, 1 output.
I am not even totally dissing three input forms such as FMA, floating
point multiply add, A=B*C+D.
However, having a third RF port per instruction, when you are 4 or more
instructions wide - that's inefficient use of resources.
Heck: for 2-input-operand forms, there's pretty good evidence that
having 2 RF read ports per instruction is suboptimal.
Note: a PRF instruction window rather than ROB/RRF has nothing to do
with the portedness of the PRF.
Most operands get picked up on a bypass path.
If you have a capture RS, you can similarly save RF read ports.
In MultiStar I propose evolving the capture RD to be a first level RF,
PRF1 - a small structure that may have full, or at least many, ports,
with a larger main PRF that has far fewer ports.
(2) By "double Adds", I assume you mean cascaded Adds:
R1 := R2+R2
R4 := R1+R5
I'm all for that (and even deeper logic trees). Although I tend to
think of it as being done dynamically, e.g. separate instructions that
are scheduled together, rather than having a single instruction that
does two adds.
>> * addressing modes
>> One of the biggest aspects of the x86 tax is addressing modes:
>> SegBase + BaseReg + IndexReg<<Scale + immediate
>> Big, mostly not used. Lots of chips have optimized - e.g. AMD
>> optimizes for SegBase=0, and some Intel chips have optimized for not
>> having BaseReag and IndexReg at the same time. But, that's complexity.
>
> I view this as an x86 advantage. More work per cycle as opposed to less
> like RISC. Complexity that pays for itself in performance is a win.
It is an advantageif the work is useful.
SegBase nearly always is not useful.
> Cynical aren't I. ;)
Cynical presumes wisdom, knowledge, or insight.
Uninformed cynicism is just ...
No.
Move elimination essentially converts all register to register MOVes
into NOPs. Not even NOPs.
Heck, many folks, such as my friend Andreas Moshovos, have proposed move
elimination through memory.
There are academic papers on this. I recommend reading them. And then
bear in mind that industry thought on this is about 5-10 years ahead of
academia.
With a PRF, MOVe elimination is *easier* to do.
MOVeelimination basically involves pointing to renamer entries to the
same PRF entry. And then doing some form of reference counting or
garbage collection to allow the preg to be deallocated when all copies
are overwritten.
The ROB/RRF makes MOVeelimination more complex, because midway through
the lifetime of a value the ROB entry associated with the older lreg
mapped to that value gets deallocated, and the value moved to the RRF.
The younger lreg has nowhere to point to.
Having a PRF makes all sorts of register renamer tricks easier.
The only reason P6 had ROB/RRF was that it meshed welll with Glenn
Hinton's branch misprediction recovery mechanism. My UIUC design,
HaRRM, was PRF based.
> Brett Davis wrote:
> > I am being KIND in only calling fixed width STUPID, my
> > contempt for the design approach is absolute.
> > History, 20/20 hindsight, and the facts, make this choice clear.
>
> Brett, you impress me a lot: Not only are you insightful and
> understanding, you are also a very humble guy who fully groks the exact
> environment former cpu architects worked in and the limitations they had
> to consider.
>
> Terje
;) ;) ;)
I took the opposite extremist position of a founder of MIPS.
During the hiring interview he would ask the prospect if they
would add anything to the CPU that would hurt the performance,
and if so the candidate was shown the door.
Adding variable width instructions can add a decode stage,
turning a 5 cycle pipeline into a 6, and costing it was thought
5 to 10 % performance. (Turned out to cost ~2%.)
In addition to being slower, variable width adds a potential
maintenance headache, design time, die cost, etc. (True)
Instruction fetch ahead is quite good, so instruction size is
irrelevant, small instructions are not fetched any faster. (True)
This should make the greater instruction cache re-use of small
instructions unimportant. (False)
So yes, I grok your general environment and the limitations.
A staff that was selected for "drinking the Kool-aid", and
a bunch of unimportant reasons for not doing what the CUSTOMERS
WERE DEMANDING.
In the embedded space 90% of ROM and RAM space is taken up by
code, not data. So if one picks MIPS suddenly your product is
going to cost 10 cents more, and when Wall Mart and the other
big buyers calls for price quotes, you lose.
Wall Mart buys the old product with the MC68000 in it.
The engineer DEMANDS smaller opcodes, and he does not care
about even a 10% performance drop. He had actually planned to
under clock the CPU by half compared to the MC68000 design
while still getting better performance, and save 10 cents in
the power supply.
---
So were you one of the people proposing variable width and
getting shot down.
Or were you doing the shooting. ;) ;)
And now have switched to the winning side. ;) ;)
It took ~two decades for MIPS/ARM to add variable width
instructions, clearly none of them wanted to do it, they
were dragged kicking and screaming by their customers to
add variable width instructions.
Your reply has told me that I have won this argument and
should move on to the next battle. You tire of me beating
your old dead horse. ;) ;) ;)
Brett
snipped really enlightening discussion of the "X86 tax". Thanks Andy!
> ARM's biggest advantage: They have less legacy. They have a lot of
> legacy, but nowhere near so much as Intel. They might be able to get
> things right when they have to implement them the first time. Thum?
According to Wikipedia, work started on the 8086 in 1976 (though they
did have the 8080 semi compatibility legacy before that). It also gives
the start date for ARM architecture as 1093. So yes, there is a
difference, but I don't think that alone accounts for ARMs success.
MIPS and SPARC have far less legacy than X86, but they don't seem to be
MIPS per Watt winners that ARM is.
>>> The big question, however, is whether ARM has already become so
>>> established in the mobile marketplace that it cannot be displaced.
Absolutely agreed. That is the big question.
I don't see MIPS or SPARC, but I guess it is possible. The question to
ask is whether there is some compelling feature/spec that a putative
competitor could offer that would justify the cost of conversion.
I would guess that say a 30% improvement in MIPS per Watt might be that,
but from your earlier comments, I don't know if that is possible. If
you were to start out with a clean slate with the goal of maximizing
MIPS per Watt for an iPAD class system, what percentage improvement over
ARM could you achieve.
> Certainly, there is opportunity for new ISAs, as we see several factors
> converge:
>
> * low power
> * the x86 tax (not a big factor, IMHO)
> * more important, Intel has not realigned itself
> to ARM levels of low power
> * non-Microsoft systems
> * e.g. Android in mobile
> * gpus
> * datacenters
>
> Datacenters are a threat to Intel and Microsoft and x86 because
> datacenters often run a smaller set of applications than consumer
> devices. And they are centrally administered. So, if the datacenter can
> reduce the number of servers required by 10% by switching ISAs, to an
> ISA where the small set of software they care about runs like it does on
> x86, they may.
Interesting speculation. But I thought the server market was something
like 1% of the desktop market. If that is so, and the mobile market is
say 25% pf the desktop market, I don't see a 4% market being worth a
great deal. But if it came for almost no cost in a new architecture,
then perhaps.
> And even if they don't switch ISAs, datacenters hurt profits: better
> administered systems require fewer chips, having a higher utilization.
> It is a question whether the sales lost to consolidation will be
> outpaced by sales due to growth of market. (My bet is that consolidation
> won't shrink Intel)
Agreed, though it may slow its growth. I think the server compute power
will increase faster than consolidation will decrease it.
> Finally, it's not clear that it is an "ISA war" any more. The GPUs and
> Java (and Python, and Perl, and C#, and ...) have shown that code that
> is independent of binary machine representation is possible. It is quite
> likely that the amount of code that is ISA-specific, whether x86 or ARM,
> will shrink. Certainly as a fraction of total code; it will take longer
> to shrink in absolute terms.
Good point. But then we are back to the discussion of what could make it
economically worthwhile for someone to convert to a different ISA. It
costs something, so there has to be some advantage to make it worthwhile.
> But Intel and Microsoft have by no means lost the battle. They may yet
> get serious and win; or, rather, they are probably already seriously
> coompeting, and they may seriously win. Remember Microsoft versus
> Netscape? Rememmber IIntel versus the RISCs?
>
> There is an opportunity for non-x86, non-Wintel.
>
> In particular, there is an opportunity for ARM.
>
> But I'm not even sure that the odds are below 1:1 for Intel being
> supplanted.
Agreed.
>
> ===
>
>
> By the way, I predict that eventually Intel will become a fab, fabbing
> other companies' SOC designs. But I dare not predict when. >10 years out?
Didn't they try that once, with almost no success? But they may do
better the next time. Their fabs are certainly a big strength and they
could leverage that, but I suspect it would take a big mind set change.
But you know the corporate culture much better than I do.
> Either Intel will become a fab, or Intel will have lost, or the computer
> industry will have stopped innovating.
By the way, I think I may see a misunderstanding:
The registerrenaming technique is orthogonal to the ROB/RRF vs PRF issue.
In the Andy Glew UIUC HaRRM (Hardware Register Renaming Mechanism) style
of OOO, you always allocate a new virtual register as the target for
every operation. (By the way, there may be earlier examples of this
technique, although I don't know of them; I am just saying "Andy Glew
UIUC HaRRRM" as a way of full identification, not as a claim to priority.)
My original HaRRM was PRF based.
P6 was an example of HaRRM, with read before schedule and a capture RS
and ROB/RRF.
Wmt was an example of HaRRM, with a PRF read after schedule.
Sandybridge is an example of HaRRM, with a PRF and read afterschedule.
===
There have been examples of other, non-HaRRM, register renaming
mechanisms, although they are not so common now as they once were.
For example, my advisor's HPSM (Hwu, Patt, Shebanow, and Melvin's High
Performance Substrate for Microprocessors) used register renaming via
CAMs in its RAT. HPSM's RAT was CAM-based; HaRRM's RAT is not. IIRC
HPSM's RAT actually was both a PRF and the register renamer.
Postulate #1: desktops are dead!?
Postulate #2: from the end-user point of view, mobile is all that matters.
Postulate #3: much end-user computation and data storage is moving on to
the cloud.
Personally, I have not bought a desktop in 10 years.
I don't think that I have used a desktop, at home or at work, in 9 years.
In the past 4 years, more and more of my family's stuff has been moving
onto the cloud.
My wife - a rather extreme computer luddite - just loves the cloud. She
loves it that her Google docs are accessible to her from home or work,
wherever she is.
If this trend sticks - and surveys seem to say that it is - then
"personal" computation gets done in cloud datacenters, except when it is
latency sensituve, in which case it gets done in your phone or laptop or
tablet.
All this being said: I am typing this while looking at 3 external LCD
panels plugged into my tablet PC. The rise of cheap LCD panels may
lead to a revival of the desktop. I may consider getting a cheap
desktop, not because I want to store anything on it, but so that I don't
have the hassle of having to dock and undock my external displays.
I.e. I may consider getting a desktop as a cheap accessory to my tablet
PC. Not vice versa. With as much as possible on the cloud, for
ubiquitous access.
> If you do cache the results of the translations you can run
> into trouble with copyright, I'd have to dig them out but there certainly
> have been court cases which treated the translated and cached binary as
> an unauthorised copy.
Interesting, I didn't even consider legal issues.
Anyway, in Poland one is allowed to make a copy
of legally possessed programs and multimedia, e.g.
to be used as a backup carrier and to me it seems
to be fair.
Best regards
Piotr Wyderski
> ARM is a flash in the pan, they could be gone tomorrow.
> Just another RISC chip, replaceable overnight.
An interesting but probably wrong opinion. ARM sell more than processor
designs they sell integrated solutions with various peripheral functions
built into one chip. As a result ARM has a lock on most of the high end
embedded market. The low end embedded market is mainly 8 bit processors,
the 8151 is still in use. In addition there is no way outside ARM of
even guessing how many ARM processors are in use. ARM breaks it's income
into that from direct sales and that from licensing. Licence terms are
commercially confidential and probably vary from company to company
therefore there is no way of estimating processor use just from licence
information.
In addition nobody using ARM is required to tell the world that they
are using it. While I can guess that all smart phones use ARM there is
no way I can confirm that.
Ken Young
A definitely wrong opinion. 20 years of success cannot be classed
as a flash in the pan, even if ARM closes shop later today. Which,
as you say, is implausible.
Regards,
Nick Maclaren.
Are you talking about move elimination
1) At the in-order front end, as part of instruction decoders
2) At the in-order front end, around the sink end of instruction queue
3) During [still in-order] allocation/renaming stage
4) Within OoO engine.
If (4), doesn't it provide too little of benefit to be implemented?
If (2), (3) or (4), could such technique really be considered "a
solution" for x86 2-op tax? After all, our n-wide front end still
delivers less work per cycle than what was possible if there were
"native" non-destructive OPs.
ARM do occasionally produce figures for the number of cores shipped
cumulatively and per quarter. The last time this was done would have
been
http://www.arm.com/about/newsroom/24403.php
You can guess the run rate by looking at earnings reports for example:
"Q2 [2010] processor royalty revenue came from the sales of 1.4 billion ARM
technology-based chips."
I have always liked variable width, based simply on an information
density argument. I.e. a very quick&dirty form of Huffman encoding.
Afair, at one particular point in time you could make a particular x86
code run faster by compiling to byte code and then interpret that: The
significantly smaller working set for this specific application meant
that it would now fit in cache.
OTOH, I have also been quite willing to accept it when the RISC faithful
stated that instruction decode was such a bottleneck that it was better
to make them all the same size, since that made the entire cpu faster.
I have even been willing to believe that the balance point for this
particular choice could move back and forth, just like pretty much all
other such engineering decisions.
1) If done during decode, you are restricted to adjacent instructions -
but you can probably treat a MOV followed by a use as a single instruction.
4) If done during rename, you can handle MOVes and uses that are far apart.
How about 4), caching the results in a decoded uop cache? So you don't
suffer the n-wide uop limitation while executing out of the uop cache,
only while executing the code for the first time.
By the way, most big codes that I am familiar with get nowhere close to
achieving N=4 or wider instructions per cycle. That is one reason why
x86's variable width instructions tend to have advantages over RISCy
fixed width instructions - smaller code.
You only get close to the instruction decode limit when executing fairly
tight loops. Which happens to be when a decoded uop cache works best.
For big codes, the motivation for MOVe elimination is to eliminate the
latency of MOVes - historically single cycle. In fact, some proposals
still leave the MOVe in the instruction stream, since we are not so
often instruction bandwidth limited: they just make subsequent
instructions not depend on it.
That would cost additional field in decoded instruction format - the
2nd destination register.
And I would think still some additional processing would be required
during allocation/rename.
>
> 4) If done during rename, you can handle MOVes and uses that are far apart.
>
> How about 4), caching the results in a decoded uop cache? So you don't
> suffer the n-wide uop limitation while executing out of the uop cache,
> only while executing the code for the first time.
If uop cache is not tiny then widening it up it would cost non-trivial
area.
>
> By the way, most big codes that I am familiar with get nowhere close to
> achieving N=4 or wider instructions per cycle. That is one reason why
> x86's variable width instructions tend to have advantages over RISCy
> fixed width instructions - smaller code.
True, but with 16B fetch you quite often have less than 4 instructions
out of instruction length detector. I'd think, on SSE code less than 3
instructions also happen more often than we would like. Hopefully, new
encoding introduced with AVX is going to reduce the severity of
problem.
>
> You only get close to the instruction decode limit when executing fairly
> tight loops. Which happens to be when a decoded uop cache works best.
>
> For big codes, the motivation for MOVe elimination is to eliminate the
> latency of MOVes - historically single cycle. In fact, some proposals
> still leave the MOVe in the instruction stream, since we are not so
> often instruction bandwidth limited: they just make subsequent
> instructions not depend on it.
The idiom, mentioned above, exists exactly for purpose of minimizing
the significance of of the latency of MOVes.
I think that displacing ARM there doesn't just require N% better MIPS
per watt from your processor, but also at least parity ARM MIPS per watt
and equal or greater performance in order to be viable
The important thing is that, for the consumer to accept it as a genuine
upgrade, then their existing apps must continue to work at least as
well. Because we cannot expect apps to magically shift to the new
architecture overnight, you need some form of compatibility layer (All
major OS transitions have demonstrated that; perhaps the most extreme
being PalmOS' 68k to ARM transition, in which 68k machine code basically
became the system's standard interpreted bytecode)
- Owen
> It also gives
> the start date for ARM architecture as 1093.
If that is a typo for 1993 it seems late. Acorn developed ARM for the
Archimedes which was IIRC late 80s. ARM as a company started up when
Acorn folded which could have been 93. IIRC it was not until after ARM
was spun off that the processor became available to other companies.
Ken Young
Sorry about that. It was indeed a type; the correct date from Wikipedia
is 1983.
The only issue there being that Huffman codes are optimal when you get
the relative frequences right, and really badly pessimal when you try to
encode something whose statistics are different from the ones used to
drive the encoding.
(Which may be one of the reasons behind this:)
>
> Afair, at one particular point in time you could make a particular x86
> code run faster by compiling to byte code and then interpret that: The
> significantly smaller working set for this specific application meant
> that it would now fit in cache.
Looking from a distance, I never really got a sense that x86 or other
instruction-length choices were based frequency statistics, rather on
historical cruft and (somewhat more justifiably) rough minimum
requirements to pack all the needed bits. I could well be wrong about
that though.
paul
Yes, that comment I made was unduly harsh, simply wrong.
I am just so used to tech firms booming and busting.
One surviving mainframe company.
No surviving Mini-frame companies.
Intel with a ~12 year growth rate of 0%.
ARM is on a seriously upward slope right now, even the loss of
Apple and its iPhone/iPad would merely be a setback, and not
effect the general trajectory.
Longer term ARM faces serious competition from Renessas which
is a monopoly like consolidation of three or more of the old
8 and 16 bit companies that dominate the low end.
Then there is Atmel with its AVR, built by two students...
http://en.wikipedia.org/wiki/Atmel_AVR
Brett
> On 10/2/2010 2:00 PM, Brett Davis wrote:
> > In article<5NGdnZHsUZE77TrR...@giganews.com>,
> > Andy Glew<"newsgroup at comp-arch.net"> wrote:
> > This of course may turn out to be wrong headed. Why not expand out that
> > third port and do double Adds, etc?
>
> (2) By "double Adds", I assume you mean cascaded Adds:
> R1 := R2+R2
> R4 := R1+R5
>
> I'm all for that (and even deeper logic trees). Although I tend to
> think of it as being done dynamically, e.g. separate instructions that
> are scheduled together, rather than having a single instruction that
> does two adds.
No, that also adds an extra output, which seems crazed to me,
as one can barely track all the inputs and outputs now.
But you are crazy like a fox. ;)
This gives a more balanced in/out system.
Three 2->1 ALUs is six inputs, but only 3 outputs, so three values
have to come from somewhere else, and the register file only has 2 ports.
Two 3->2 ALUs is also six inputs, but four outputs/results.
An extra net instruction executed per cycle at a wiring cost of just
one more output port, and ALU.
Hard to schedule for though, and worst case is bad.
Average ALU power use is kinda horrible, but perhaps solvable.
(ALU power use is almost lost in the noise today.)
You are fundamentally adding an ALU, so wires get longer, and clocks
can drop. (Same as my suggestion, but less so.)
> > Cynical aren't I. ;)
>
> Cynical presumes wisdom, knowledge, or insight.
>
> Uninformed cynicism is just ...
I realize I may have sounded crazed by saying that Intel could
be anything but the continued dominate monopoly until the end of time.
So I gave an explanation of sorts when I retracted my unfair
flash in the pan remark.
Brett
PS: Does anyone have a URL to a early MIPS die micrograph, preferably
with some annotation. Any 32+ bit core, OpenRISC for example.
> I am just so used to tech firms booming and busting.
> One surviving mainframe company.
You forgot Unisys.
> No surviving Mini-frame companies.
HP. Mini costumers who followed the path of least resistance have been
herded to Itanium, eg. PDP-11 -> Vax -> Alpha -> Itanium.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
IANAL
Under strict interpretation, the DCMA prevents caching copies.
The original language was so strict that an ISP could be charged with
infringement for buffering a communication stream that included
copyrighted material. The law was quickly amended to exclude ISPs,
however there still exists a provision whereby the memory image of a
digital document can be considered a copy for infringement purposes.
The current version of the law still contains language that - strictly
interpreted - prevents most exchanges of digital documents and also
technically makes using a general purpose computer to read any
copyrighted entertainment CD or DVD illegal.
Of course, the law was so badly written that it contradicts itself in
several places and still contains provisions that violate both patent
and consumer protection laws. So - more than a decade later - nobody
really is certain what all the law does and does not allow.
I remember seeing an interview with Rep. Rick Boucher of VA right
after the House passed the bill - I'm paraphrasing here because I
don't have the original quote, but he said something like ["it's a bad
law, but we had to pass something ... the courts will have to figure
it out."] Boucher worked against the law and was successful in
getting at least some Fair Use provisions into it (the original bill
had no Fair Use provisions at all).
IANAL
George
rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
> Another advantage of ARM is their business model. There are a variety of
> SOC's available from different vendors but all based on the ARM
> architecture. Intel has not competed in licensing its cores to other
> vendors. It would really hurt their business model. So they win due to
> their superior fabs if what they produce is exactly what you need, but
> if you require an extra chip for some additional function, they probably
> lose.
They realized their mistake/weakness and have been opening up a bit. If
your an actual device vendor selling to end users, that isn't a problem.
Intel seems to be trying to walk a very thin line here.
http://www.xbitlabs.com/news/cpu/display/20090331204301_Intel_Fabless_Semiconductor_Developers_Will_Not_Be_Able_to_License_
Atom_Cores_for_System_on_Chip_Products.html
Fine, don't use an extra output:
R1 := R2+R3+R4
but that's a new instruction.
Or just make it work with idioms of existing instructions.
R1 := R2+R3
R1 := R1+R4
> But you are crazy like a fox. ;)
>
> This gives a more balanced in/out system.
> Three 2->1 ALUs is six inputs, but only 3 outputs, so three values
> have to come from somewhere else, and the register file only has 2 ports.
> Two 3->2 ALUs is also six inputs, but four outputs/results.
>
> An extra net instruction executed per cycle at a wiring cost of just
> one more output port, and ALU.
>
> Hard to schedule for though, and worst case is bad.
Sun did this in one of their SPARCs, although, done badly, it may
explain some of the poor results of their systems.
Myself, I would actually just do it on a regular two wide superscalar
pipeline:
R1 := R2+R3
R4 := R5+R6
and give that pipeline (2 ops, each with 2 inputs and 1 output)
the ability to handle dependent ops in the same cycle
R1 := R2+R3
R4 := R1+R5
This is pretty easy to schedule for in a 2-wide superscalar machine.
If you want to do it in 2 sets of 2 at a time
R1 := R2+R3
R4 := R1+R5
R1' := R2'+R3'
R4' := R1'+R5'
It is a bit more of a pain - although equivalent to having a bypass cluster.
The motivation: it's pretty easy to do, circuit wise. I.e. it is low
hanging fruit.
> You are fundamentally adding an ALU, so wires get longer, and clocks
> can drop. (Same as my suggestion, but less so.)
Or, just adding a bypass to an existing set of superscalar ALUs. making
them work in a few more cases.
By the way, GPUs do this.
Would prohibiting instructions crossing some moderate-sized
alignment point be helpful (without having to sacrifice much
of the code density advantage of variable width)? (Instruction
translation also becomes significantly easier.)
Even just alignment of procedure entry points might have
some benefit. (Is it ironic that higher-performance x86
use aligned instruction fetch--with highly variable length
instructions--, but take significant effort to support
unaligned data access? The MIPS R10000 could fetch
up to four instructions starting at any valid address--
admittedly without the ability to cross cache lines.)
Paul A. Clayton
just a technophile
Only if they had also provided a crippled 8-bit
memory interface version for the IBM PC. :-/
(I have no clue, but I would not be surprised base on
what little I have read on the matter that the MIPS
and ARM developers were much more academics/engineers
than business executives. It might have taken them
too long to accept that quality does not sell [by
itself].)
This is news, and breaks my world view of what has been done on
major systems.
> Myself, I would actually just do it on a regular two wide superscalar
> pipeline:
>
> R1 := R2+R3
> R4 := R5+R6
>
> and give that pipeline (2 ops, each with 2 inputs and 1 output)
> the ability to handle dependent ops in the same cycle
>
> R1 := R2+R3
> R4 := R1+R5
>
> This is pretty easy to schedule for in a 2-wide superscalar machine.
> The motivation: it's pretty easy to do, circuit wise. I.e. it is low
> hanging fruit.
> By the way, GPUs do this.
I had considered adding a second output and math unit on a 2 in 1 out ALU.
If the accumulator started with a value you needed, then each of
the next two inputs can generate their own result.
This however violates my view that an ALU is not an accumulator, an ALU
is a math unit with two inputs and one output.
While true at one level this may be irrelevant, as at the scheduling
and bus level the ALU output is so local as to be "free".
This brings up a related question on instruction cracking for RISC
and/or x86.
Are the 2 in 1 out RISC instructions cracked into ACC load arg 1,
ACC operate arg 2, and ACC store?
Tracking registers and dependancies is far more important than tracking
the original instructions. There are more things you can do potentially,
and more efficiently as accumulator ops.
Or not?
Brett
Although accumulator architectures are interesting, you are being too hasty:
Cascaded ALUs are not accumulator basedALUs.
They are ALUs with inter-ALU bypassing.
Different things.
> Brett Davis <gg...@yahoo.com> wrote:
>
> > I am just so used to tech firms booming and busting.
> > One surviving mainframe company.
>
> You forgot Unisys.
88% services, most of the rest Xeon.
Two mainframe lines appear to still be available,
a 36 bit system, and a stack based processor.
http://en.wikipedia.org/wiki/Unisys
http://en.wikipedia.org/wiki/UNIVAC_1100/2200_series
http://en.wikipedia.org/wiki/Burroughs_large_systems
> > No surviving Mini-frame companies.
>
> HP. Mini costumers who followed the path of least resistance have been
> herded to Itanium, eg. PDP-11 -> Vax -> Alpha -> Itanium.
No surviving mini-computer lines?
Except for the PDP-11, I think someone still makes a clone of it.
Wrong, Mentec Inc no longer exists.
http://fixunix.com/dec/98198-mentec-us-gone.html
Brett
Cascaded adds don't need to handle carry propagation until the very end,
so adding one more input is just a couple of gate delays, right?
Going from 2 to 3 inputs is even simpler, since the difference between a
half and a full adder is quite small.
This also means that having horizontal adds make a lot of sense for a
vector/SIMD cpu, simply because it is can be handled so easily in hw
compared to sw.
> Under strict interpretation, the DCMA prevents caching copies.
Fortunately, I am not covered by DCMA.
Anyway, does that act forbid sane implementations of Java and C#? :-)
Best regards
Piotr Wyderski
>George Neuner wrote:
>
>> Under strict interpretation, the DCMA prevents caching copies.
>
>Fortunately, I am not covered by DCMA.
You are if any of your products end up in the US or its territories.
If you aren't, count your blessings!
>Anyway, does that act forbid sane implementations of Java and C#? :-)
I'm don't know what you mean by "sane" in this context. I'm not
certain that smiley is meant to be ironic.
> Best regards
> Piotr Wyderski
George