"Super-Cheap Supercomputing"
"A small supercomputing firm in Utah is building relatively cheap
machines that harness chains of field-programmable gate arrays
(FPGAs). Star Bridge Systems founder and chief technologist Kent
Gilson is regarded with skepticism in the general computer science
community, but has strong advocates among his few customers."
<snip>
"Gilson says the fastest system performs 400 billion floating point
calculations per second, but the claim has not been verified by
industry benchmark tests."
information taken from an article in Forbes:
http://www.forbes.com/2003/03/25/cz_dl_0325star2.html
Star Bridge Systems predictably got trashed here pretty thoroughly in
1999. Brief mention in 2001, less hostile.
Who knows what the reaction will be in 2003, but I say this is a
glimpse of the future. Since you can only jam so many tricks into a
hunk of silicon, you might as well make sure they're the right tricks
for the job at hand.
Flame away.
RM
Flame? Not on the tundra. too cold.
I see two concerns. First, it seems as if there is about a factor of 5
difference in frequency between custom logic and FPGA. Second, the
major obstacle is how one programs the thing. These kind of devices
have been used for a number of years as logic simulation engines, where
the programming is sort of natural. But can one program it well enough
to be superior to a general purpose device?
IBM makes such a device as a logic simulation engine, the latest of many
incarnations of such. Quickturn also makes such a device. see
http://www.quickturn.com (they got bought by Cadence).
I am having a little trouble envisioning how one runs Oracle on one
however. Or DB2.
del cecchi
>I see two concerns. First, it seems as if there is about a factor of 5
>difference in frequency between custom logic and FPGA.
*Apparently*, at least for some tasks, you can swallow that factor of
five. If you were willing to settle for a one-program computer, of
course...
>Second, the
>major obstacle is how one programs the thing. These kind of devices
>have been used for a number of years as logic simulation engines, where
>the programming is sort of natural. But can one program it well enough
>to be superior to a general purpose device?
>
Some people might say that the only way to get anywhere with parallel
computing is to make it look more like logic programming, anyway. I
think I would be one of the ones to say so. Easy to say, I know.
>IBM makes such a device as a logic simulation engine, the latest of many
>incarnations of such. Quickturn also makes such a device. see
>http://www.quickturn.com (they got bought by Cadence).
>
>I am having a little trouble envisioning how one runs Oracle on one
>however. Or DB2.
>
You couldn't envision building a database server out of ASIC's? Del,
you disappoint me.
I *think* your answer would be something like: sure, you can do it,
but the cost would be prohibitive.
I think the problem here is one of custom, training, and allocation of
resources. At the moment, huge amounts of resources are being shoved
into dead-end approaches. As to custom and training, most of us
software types would have to forget an awful lot and be ready to start
over.
RM
Let's just say I can't envision building a database server out of ASICs
that doesn't end up looking a lot like a Von Neuman processor, although
perhaps with a different ISA. Could one design a hardwired device that
would perform all the operations of a general purpose processor that
processes SQL commands? I guess so, but I bet it would be a lot of
work.
I know of a large company that chose to build its file serving machines
out of general purpose hardware building blocks even though it had
access to every kind of technology.
It will certainly be fun to watch. Hey programmers, ready to write a
SQL server in Verilog?
del cecchi
It certainly wouldn't be my choice as a starting exercise.
If there is real content to the Star Bridge system, it is mostly in
the Viva system that exists to do the programming. There are some
personal testimonials on the net (if you include NASA reports as
personal testimonials, and I would), but if there is anything but the
most general description of Viva available, I can't find it.
Presumably programming for a Star Bridge system using Viva is easier
than programming in Verilog. The only way to find out for sure would
be to become a customer, and that costs big bucks.
There are three possibilities here:
1. Star Bridge founder Kent Gilson and his co-workers have come up
with something genuinely new in how to approach the programming of
FPGA's.
2. Star Bridge founder Kent Gilson and his co-workers have done
something fairly obvious but still worthwhile and want to make headway
before they get squashed by someone much bigger benefitting from their
experience.
3. There is very little real content to what Star Bridge Systems has
done.
The fact that one cannot make much headway on deciding among the three
from what's easily available does make one suspicious.
That doesn't change my opinion that the direction they're taking,
whether they are doing it competently or not, is a glimpse of the
future.
A more plausible mid-way step would be a computer that uses a
general-purpose microprocessor to offload specialized tasks to FPGA's
that it eventually would have the capability to program on the fly.
RM
...
> Let's just say I can't envision building a database server out of ASICs
> that doesn't end up looking a lot like a Von Neuman processor, although
> perhaps with a different ISA. Could one design a hardwired device that
> would perform all the operations of a general purpose processor that
> processes SQL commands? I guess so, but I bet it would be a lot of
> work.
This sounds a bit like the ambitious 'database machine' efforts of the
mid-to-late-'80s: specialized parallel hardware to do things like
select/project operations on many database pages at once (which itself is a
bit reminiscent of keyed track-based look-ups from an earlier era). AFAIK
none of them ever came to anything, possibly because (unlike the earlier
track-based keyed look-ups) general-purpose processors were improving their
price/performance so quickly that any specialized hardware would have become
uncompetitive at just about the time it could be released.
- bill
The point is that it's not specialized hardware (which would be
faster, but less flexible). A fairly old article (1996) gives more
detail than the recent puff pieces:
http://www.e-insite.net/ednmag/archives/1996/032896/07dfcov.htm
"Reconfigurable logic: hardware speed with software flexibility"
"Kent Gilson, president of Metalithic Systems, describes the heart of
the system as two Xilinx XC3090 FPGAs. The FPGAs implement a RISC CPU
and instruction-set extensions. The RISC processor quickly motors
through simple operations and uses instruction-set extensions to
perform complex processing operations that would otherwise slow the
system. The instruction-set extensions are loaded into the
reconfigurable logic as needed, using logic gates only when the
function is required."
At some point, reconfigurable logic will become a part of most
computers.
RM
> At some point, reconfigurable logic will become a part of most
> computers.
I doubt that FPGA computing will hit the shelves for
general purpose use for the reasons already given.
Price.
Price/Performance.
I think that you are forgetting that Microprocessors
are amazingly versatile programmable parts themselves.
Sure, not as versatile as a bunch of FPGAs, but they
gain that versatility at the cost of interconnect
(lots of it which contributes to the cost but doesn't
do 'useful' work).
Cheers,
Rupert
The other scenario is that the universities weren't turning out people
that could program them. there have beens ome disucussions of
perfectly good stuff .... that had to be discarded because there were
no longer any skills for supporting it ... also somewhat related to
COTS.
Note that a lot of the parallel machine stuff would look at
everything. There was somewhat paradigm switch with much larger
random access memories containing indexes ... reasonably conditioned
data with efficient indexes ... a single processor could find
something faster than large parallel search of everything.
luther got his tree instructions into the 390 ... and there were some
custom machines that did some pretty effective and fancy stuff.
http://www.garlic.com/~lynn/98.html#19 S/360 operating systems geneaology
http://www.garlic.com/~lynn/98.html#20 Reviving the OS/360 thread (Questions about OS/360)
http://www.garlic.com/~lynn/2001h.html#73 Most complex instructions
http://www.garlic.com/~lynn/2002.html#14 index searching
http://www.garlic.com/~lynn/2002d.html#18 Mainframers: Take back the light (spotlight, that is)
http://www.garlic.com/~lynn/2002q.html#10 radix sort
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
Internet trivia 20th anv http://www.garlic.com/~lynn/rfcietff.htm
Bunch of FPGA's only for specialized tasks.
FPGA as front-end/coprocessor for RISC General Purpose uP for general
purpose computing.
RM
> "Kent Gilson, president of Metalithic Systems, describes the heart of
> the system as two Xilinx XC3090 FPGAs. The FPGAs implement a RISC CPU
> and instruction-set extensions. The RISC processor quickly motors
> through simple operations and uses instruction-set extensions to
> perform complex processing operations that would otherwise slow the
> system. The instruction-set extensions are loaded into the
> reconfigurable logic as needed, using logic gates only when the
> function is required."
How is this different, really, from loadable microcode?
--tim
...or the days when you could economically do custom ISA co-processors
from bit-slice chips.
Even when those new funky RISC chips first started arising, you could do
a lot of work in custom hardware which would beat them.
The cost was a lack of generality and the NRE envolved.
I guess the same balances are there with this solution: you pay
Metalithic for the NRE, pay for engineers with odd skills, and suffer
even more for lack of integration - when interconnect costs are much
larger contribution to system performance.
Be nice to do hardware again though....
--
Rob Friedlander
Depending on how the FPGA programming biz goes and how FPGA size and
performance track Moore's Law, could things Be Different This Time? My
impression is that the chips are getting bigger and faster at a rate
that puts them at most a constant factor behind real CPUs. That's
potentially a much better place for custom-machine designers to be than
they were in the 80s, when they had to spend ridiculous amounts of money
and engineering effort every year to keep up with general-purpose
hardware. If new FPGA design software lets you port to the next
generation relatively cheaply, the cost of staying competitive goes way
down. That, in turn, could invite enough investment to improve the tool
sets, etcetera.
The RISC-plus-FPGA arrangement seems more powerful to me than the old
writable control store stuff because you have, in effect, a much larger
set of functional units to choose from for the same silicon, but I'm
probably pretty obviously wrong.
paul
The same thought occurred to me, and here's how I answered it to
myself: the loadable microcode still has to choose from a finite set
of predefined operations possible with preconfigured logic.
*Maybe* most of what is interesting is already possible and people
just don't exploit the possibilities. If that's not yet true, maybe
you could get most of what's possible with fairly small changes to
existing processor designs.
Somebody other than me would have to step out onto that ice, but I
sure would like it if someone else would venture an opinion.
RM
>In article <ey3pto8...@cley.com>, Tim Bradshaw <t...@cley.com>
>writes
<snip>
>>
>>How is this different, really, from loadable microcode?
>>
>>--tim
>
>...or the days when you could economically do custom ISA co-processors
>from bit-slice chips.
>
>Even when those new funky RISC chips first started arising, you could do
>a lot of work in custom hardware which would beat them.
>
>The cost was a lack of generality and the NRE envolved.
>
>I guess the same balances are there with this solution: you pay
>Metalithic for the NRE, pay for engineers with odd skills, and suffer
>even more for lack of integration - when interconnect costs are much
>larger contribution to system performance.
>
>Be nice to do hardware again though....
If, when, this happens, it will be one of those, ummm, really small
print, barely above a whisper, dare I say it, ummm, paradigm shifts.
Different skill set, different programming style, different mind set,
managers who have to learn a whole new bunch of factoids so they don't
look as clueless as they really are. Those are the real obstacles.
IMHO.
RM
FWIW, I think this is where Transmeta is headed. If they're not, they
should be.
RM
Benjamin
>As a veteran of not one, but two reconfigurable computing companies
>that no longer exist (and I'm only 1.5 years out of undergrad), I can
>assure you that the commercial prospects for reconfigurable computing
>are a bit bleak right now.
Two companies in 1.5 years? The word that comes to mind is
undercapitalized. A really neat technical idea is not the same thing
as a sound business plan--but I guess you've figured that out by now.
RM
Hi,
My understanding of reconfigurable computing is that this type of
dynamically extensible ISA (DE-ISA... sounds like a product for your
car) has not been realisable on any of the current FPGA architectures
because the overheads required to reconfigure the devices are so huge
that you lose out on performance on all but the most carefully tuned of
applications. This is made worse by the tendency of FPGA manufacturers
to produce larger and larger devices without comparable improvements in
the reconfiguration mechanisms. A number of techniques to overcome this
have been suggested (such as multiple configuration memories), but thus
far no one has implemented them...
JonM
The problem with this stuff is that you are producing a machine that is very far
away from a general purpose computer. It reminds me of the PDP-11s with
programmable microcode -- nobody but the guys at DEC (remember DEC?) who
originally designed the hardware could make working microcode. And that was in
the days when lots of people actually programmed assembly code (remember that?).
These days, nobody would even try. It's a dead end.
--
-------------------------------------------------------------------------
= Jeff Kenton Consulting and software development =
= http://home.attbi.com/~jeffkenton =
-------------------------------------------------------------------------
> [...] If there is real content to the Star Bridge system, it is
> mostly in the Viva system that exists to do the programming. [...]
> if there is anything but the most general description of Viva
> available, I can't find it.
Look at <http://www.starbridgesystems.com/VivaHelp.htm>. It looks
like a strongly LabView-flavored dataflow visual programming language.
I wonder if it's telling that National Instruments hasn't seen it fit
to do an FPGA-backend port themselves.
- FChE
>
>The problem with this stuff is that you are producing a machine that is very far
>away from a general purpose computer.
A so-called general purpose computer is inevitably a compromise. The
only performance metric that really matters is how it performs on your
application, not how it performs on applications in general.
Reconfigurable logic offers the promise of making the computer match
the software, rather than the other way around.
> It reminds me of the PDP-11s with
>programmable microcode -- nobody but the guys at DEC (remember DEC?) who
>originally designed the hardware could make working microcode.
Yes I remember DEC all too well. They're no longer in business. Had
a long discussion about that here not too far back. If they had
*wanted* to make it easy for others to write working microcode, they
could have, but they didn't *want* to--and they're no longer in
business.
>And that was in
>the days when lots of people actually programmed assembly code (remember that?).
People don't have to program in assembly code because tools have been
created that make it (mostly) unnecessary. There is no fundamental
reason that comparable tools cannot be created to manage
reconfigurable logic.
>These days, nobody would even try. It's a dead end.
Apparently *somebody* is trying, and claiming considerable success, at
that.
People have a pretty set way of thinking about computers, but there is
nothing fundamental or inevitable about the way the work is currently
divided up. Hardware could be compiled, just like software.
RM
[SNIP]
> Yes I remember DEC all too well. They're no longer in business. Had
> a long discussion about that here not too far back. If they had
> *wanted* to make it easy for others to write working microcode, they
> could have, but they didn't *want* to--and they're no longer in
> business.
Seriously dude : It's the wrong approach. Why do you think
RISC worked when it did ? To quote Glew : What's changed
now ?
In essence we're seeing the lacklustre performance from
Transmeta, their product is still being hosed on performance
per Watt by fairly ancient* RISC cores. In terms of absolute
performance it is also hosed by ancient* RISC cores.
The transmeta/microcode/FPGA front end approach strikes me as
yielding results where you want to minimise time to market
with a minimal investment in R&D. The flip side is that if
you want to emulate something that already exists : you have
an additional (and substantial) development and verification
effort. Why bother ? Why not stick with a relatively easy
to develop and optimise for RISC core ? :)
I've tried to see a way that something WCS could really
make a valid contribution above and beyond patching broken
microcode. I just can't see it though. If you put it into
a multitasking environment it's just more context to switch,
to a certain extent I think even classic VLIW's suffer from
this.
*= in Microcomputing terms (ie : > 3 years).
Cheers,
Rupert
...
> > It reminds me of the PDP-11s with
> >programmable microcode -- nobody but the guys at DEC (remember DEC?) who
> >originally designed the hardware could make working microcode.
>
> Yes I remember DEC all too well. They're no longer in business. Had
> a long discussion about that here not too far back. If they had
> *wanted* to make it easy for others to write working microcode, they
> could have, but they didn't *want* to--and they're no longer in
> business.
If that's supposed to imply a cause and effect relationship, you clearly
haven't a clue about *why* DEC is no longer in business: it had nothing to
do with any lack of desire (and ability) to produce useful products, but
rather with lack of competence in marketing them and other high-level
corporate incompetence.
The PDP-11/60 (there may have been another - perhaps the 11/44 - that
supported programmable microcode as well) was mostly a solution in search of
a problem. Richie Lary used its 'writable control store' to create the
world's fastest PDP-8 as a hack, but I can't off-hand remember any actually
significant use of it, nor any indication that improved development tools
would have created any - which kind of put a damper on enthusiasm for
pouring major additional resources into such tools.
- bill
Reconfigurable logic, at least as realized today in FPGAs, is also a
compromise. While interconnect delay is becoming the dominate factor
in custom logic, FPGAs are all about interconnect. Programmable logic
is slow as a turtle compared with custom logic. I don't see a 3GHz
64bit adder being done in FPGAs, even though they use the same
technology as high-end processors.
> > It reminds me of the PDP-11s with
> >programmable microcode -- nobody but the guys at DEC (remember DEC?) who
> >originally designed the hardware could make working microcode.
>
> Yes I remember DEC all too well. They're no longer in business. Had
> a long discussion about that here not too far back. If they had
> *wanted* to make it easy for others to write working microcode, they
> could have, but they didn't *want* to--and they're no longer in
> business.
>
> >And that was in
> >the days when lots of people actually programmed assembly code (remember that?).
>
> People don't have to program in assembly code because tools have been
> created that make it (mostly) unnecessary. There is no fundamental
> reason that comparable tools cannot be created to manage
> reconfigurable logic.
It's been tried many times and just as many times it's failed. There
is a rather large difference between programming and logic design
(which is often lost on PHBs and HDL tool salespeople). "Programming"
reconfigurable logic *is* logic design, with little programming.
--
Keith
I can and, based on some of the work in the 1980s, I think that it
would work. Of course, its main benefits would be in functionality
rather than the rawest of performance, and benchmarketing rules :-(
But my approach to WCS would be that of the Pentium 5 and not that
of traditional microcode - i.e. a cache for the microcode in memory,
rather than something that had to be saved on context switch. For
example:
The underlying engine is an extreme RISC machine, with low-level
primitives designed for building instructions up from, rather than
direct use as an ISA. I.e. very much like traditional microcode.
The ISA has a fairly fixed set of formats and basic control operations,
but the remainder are handled as patterns to map into the underlying
actions and/or FAST eumulation routine calls. All established stuff.
Languages and libraries can load their own sets of patterns and
emulation routines and, in particular, there can be more than one
debugging level for a single ISA (e.g. overflow checking).
Now, two decades ago on mainframes and a decade ago on workstations,
the consequent reduction in application code size (perhaps 3 to 1)
could have had performance benefits. But, with modern Icache sizes,
what the hell?
So I just can't see any performance coming out of this, though I and
others would LOVE the flexibility!
Regards,
Nick Maclaren.
>
>"Robert Myers" <rmyer...@attbi.com> wrote in message
>news:fv1h8v0ernfkkg5lu...@4ax.com...
>>
>> Yes I remember DEC all too well. They're no longer in business. Had
>> a long discussion about that here not too far back. If they had
>> *wanted* to make it easy for others to write working microcode, they
>> could have, but they didn't *want* to--and they're no longer in
>> business.
>
>If that's supposed to imply a cause and effect relationship, you clearly
>haven't a clue about *why* DEC is no longer in business: it had nothing to
>do with any lack of desire (and ability) to produce useful products, but
>rather with lack of competence in marketing them and other high-level
>corporate incompetence.
>
I'll try to avoid taking gratuitous swipes at DEC. Try to remember
though, that the part of DEC *I* saw was the part for which you had
the least respect.
>The PDP-11/60 (there may have been another - perhaps the 11/44 - that
>supported programmable microcode as well) was mostly a solution in search of
>a problem. Richie Lary used its 'writable control store' to create the
>world's fastest PDP-8 as a hack, but I can't off-hand remember any actually
>significant use of it, nor any indication that improved development tools
>would have created any - which kind of put a damper on enthusiasm for
>pouring major additional resources into such tools.
>
Well, that begs the question: What did they *think* it was going to be
used for when they designed it in? Enquiring minds want to know.
RM
>Now, two decades ago on mainframes and a decade ago on workstations,
>the consequent reduction in application code size (perhaps 3 to 1)
>could have had performance benefits. But, with modern Icache sizes,
>what the hell?
>
In article <b2vi1i$2dc$1...@vkhdsu24.hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> writes:
|> The problem is that code size is already a problem, to the point where
|> many applications can run faster when optimized for size instead of speed.
To which one Nick Maclaren (no relation, I assume) replied:
>Yes, definitely.
RM
>In article <fv1h8v0ernfkkg5lu...@4ax.com>, rmyers1400
>@attbi.com says...
>>
>> People don't have to program in assembly code because tools have been
>> created that make it (mostly) unnecessary. There is no fundamental
>> reason that comparable tools cannot be created to manage
>> reconfigurable logic.
>
>It's been tried many times and just as many times it's failed. There
>is a rather large difference between programming and logic design
>(which is often lost on PHBs and HDL tool salespeople). "Programming"
>reconfigurable logic *is* logic design, with little programming.
Be interested to hear your comments on
http://www.ece.cmu.edu/~blevine/fccm02_queue.pdf
cited by Benjamin Ylvisaker in this thread.
Also, do you or anyone else know of TLA+, as in _Specifying_Systems_
by Leslie Lamport? It sure makes logic design look like programming
and vice versa to me.
RM
To which I can only quote Walt Whitman:
Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes).
Regards,
Nick Maclaren.
...
> The PDP-11/60 (there may have been another - perhaps the 11/44 - that
> supported programmable microcode as well) was mostly a solution in search of
> a problem. Richie Lary used its 'writable control store' to create the
> world's fastest PDP-8 as a hack, but I can't off-hand remember any actually
> significant use of it, nor any indication that improved development tools
> would have created any - which kind of put a damper on enthusiasm for
> pouring major additional resources into such tools.
>
Certainly, no tools available 30 years ago would have helped. And I doubt any
of today's tools would do much better on *that* microcode design. Hopefully,
what we're designing today is a bit more regular than microcode was then.
Still, as a software guy who has looked at FPGA solutions recently, FPGAs are
still hardware, and the market for people willing to code up a special purpose
implementation just to gain a year or two on Moore's Law seems to be small. As
Bill Todd said above, it's a question of where people want to put their resources.
> Well, that begs the question: What did they *think* it was going to be
> used for when they designed it in? Enquiring minds want to know.
The Series 64 in the Hewlett-Packard 3000 line of proprietary
minicomputers had writable control store which was used to great
effect in allowing upgrades to the model 68 and 70 to be software-only
changes to WCS rather than board swaps (as equivalent upgrades to
other models in the line were).
The 64->68 was an increase in fundamental architectural limits, and
the 68->70 included the migration of common OS primitives into the
microcode for a significant performance improvement.
The Series 37 also had WCS, though I don't recall this leading to
any great advantage. The 37 and the 70 were the smallest and largest
CISC (aka "Classic") HP 3000 systems ever made. All the in-between
models had ROM-based microcode.
G.
You forget : The verification nightmare, sure the hardware
might be easier to verify, but you've just added a large
chunk of software to verify too. :/
> But my approach to WCS would be that of the Pentium 5 and not that
> of traditional microcode - i.e. a cache for the microcode in memory,
> rather than something that had to be saved on context switch. For
> example:
To make microcode work and give advantages over a normal
RISC approach I strongly suspect that you'd need rather
huge quantities of registers, would 128 do you think ? :)
Oh look, IA64 here we come ! :)
> The underlying engine is an extreme RISC machine, with low-level
> primitives designed for building instructions up from, rather than
> direct use as an ISA. I.e. very much like traditional microcode.
>
> The ISA has a fairly fixed set of formats and basic control operations,
> but the remainder are handled as patterns to map into the underlying
> actions and/or FAST eumulation routine calls. All established stuff.
Err, and that really is just RISC, why bother with mucking
about ? :)
All that's happening is that you're moving a bunch of
nasty hardware dependant guff into the software.
Cheers,
Rupert
FPGAs change that a little, by being slower than custom silicon, but
reducing the lead time a bit.
The history of the last six or seven years seems to have proven him
right, but I wonder whether that will always be the case (as more hurdles
get in the way of Moore's law). I suspect that the answer will be seas
of processor+memory units, transputer-style, rather than seas of FPGAs,
though. A few highly-optimized general purpose ALUs and FPUs per chunk
of memory are pretty useful things. The IBM/Sony Cell architecture seems
to be heading that way.
Then there's PicoChip, which has 430 processor+memory systems tesselated
on a chip. Mobile phone base station (the target market) benchmarks that
I've seen recently still only put it into the same order of magnitude as
the latest/fastest single-core VLIW DSP from TI, at about the same cost,
but requiring quite a different view of programming it.
--
Andrew
> On Mon, 31 Mar 2003 14:26:12 -0500, Keith R. Williams
> <k...@attglobal.net> wrote:
>
> >In article <fv1h8v0ernfkkg5lu...@4ax.com>, rmyers1400
> >@attbi.com says...
>
> >>
> >> People don't have to program in assembly code because tools have
> >> been created that make it (mostly) unnecessary. There is no
> >> fundamental reason that comparable tools cannot be created to
> >> manage reconfigurable logic.
> >
> >It's been tried many times and just as many times it's failed.
> >There is a rather large difference between programming and logic
> >design (which is often lost on PHBs and HDL tool
> >salespeople). "Programming" reconfigurable logic *is* logic design,
> >with little programming.
>
> Be interested to hear your comments on
>
> http://www.ece.cmu.edu/~blevine/fccm02_queue.pdf
>
> cited by Benjamin Ylvisaker in this thread.
I should point out that there is a pretty big difference between what
we want to do and the many CPU+FPGA hybrid projects that have existed
over the years. The big drawback to our approach is that in the
highly parallel piece of the architecture we don't give the programmer
anywhere near the flexibility of an FPGA or any other more general
purpose circuit technology. I prefer to think about what we're trying
to do as making a much more flexible vector architecture, as opposed
to making a more structured CPU+FPGA hybrid.
> Also, do you or anyone else know of TLA+, as in _Specifying_Systems_
> by Leslie Lamport? It sure makes logic design look like programming
> and vice versa to me.
I only briefly looked at some introductory TLA paper, but the
existence of a language that is something like a typical programming
language and something like a digital logic design language doesn't
mean that programming and circuit design are similar activities. I
happen to think that those two endeavors are usually different enough
that intentionally blurring the line between the them is not a good
idea (I absolutely loathe the ``[C/C++/Java] To Gates!'' projects).
Benjamin
Then there is Blue Gene/L
http://www-916.ibm.com/press/prnews.nsf/jan/911819B507AA5EE485256AFF004E
EC0D
If anyone on the group would like to buy one, I think there might be a
slot available.
Just think, you too could have 200 TeraFlops in your garage. (if it's a
big garage)
del cecchi
>To make microcode work and give advantages over a normal
>RISC approach I strongly suspect that you'd need rather
>huge quantities of registers, would 128 do you think ? :)
>
>Oh look, IA64 here we come ! :)
It depends on what you're trying to maximize. See:
Jack Liu, Timothy Kong, Fred C. Chow: _Effective Compilation Support
for Variable Instruction Set Architecture_, 2002 International
Conference on Parallel Architectures and Compilation Techniques (PACT
2002), 22-25 September 2002, Charlottesville, VA, USA. IEEE Computer
Society 2002, ISBN 0-7695-1620-3
for a modern example.
-- greg
Three answers:
1. It doesn't matter. No architectural feature is ever used the way
the designers anticipated.
2. Another way to provide updates, feature enhancements, etc.
3. I don't know, but some guys used the analogous feature on the old
360/145 (it had microcode in main memory) to do a really bang-up APL
interpreter.
--
Greg Pfister
Sure - fpgas right away
or mail josh fisher + co at hp labs your measurements
and they'll mail you back a customized chip
that their tools have compiled to be optimal
for your problem.
which compile time loop wins?
Imagine that you had some nanomachines that could
walk out over a die, a die probably pre-implanted with
transistors, and that the nano-machines could deposit
on any of several layers of metal.
And imagine the nanomachines could do this configuration
overnight.
Then I think you might have a chance for configurable logic.
You would not be paying the switching overhead,
so your logic would be about 3X faster and denser than
FPGAs, although it would probably be a factor of 3X
slower and less dense than fully custom logic,
because of the pre-laid out pattern of transistors.
Not at all. What you are doing is splitting the software into two
parts (modularity is good), which you can verify separately. Unless
this technique simplifies the code generated by the compiler, I agree
that it is a disaster and should be binned. Exactly the same remark
applies to all forms of introduced modularity.
|> > But my approach to WCS would be that of the Pentium 5 and not that
|> > of traditional microcode - i.e. a cache for the microcode in memory,
|> > rather than something that had to be saved on context switch. For
|> > example:
|>
|> To make microcode work and give advantages over a normal
|> RISC approach I strongly suspect that you'd need rather
|> huge quantities of registers, would 128 do you think ? :)
Why? Even if you want work registers, you can keep things clean
by ensuring that the extra ones are not visible to the program
and preferably are not preserved between instructions.
|> > The underlying engine is an extreme RISC machine, with low-level
|> > primitives designed for building instructions up from, rather than
|> > direct use as an ISA. I.e. very much like traditional microcode.
|> >
|> > The ISA has a fairly fixed set of formats and basic control operations,
|> > but the remainder are handled as patterns to map into the underlying
|> > actions and/or FAST eumulation routine calls. All established stuff.
|>
|> Err, and that really is just RISC, why bother with mucking
|> about ? :)
|>
|> All that's happening is that you're moving a bunch of
|> nasty hardware dependant guff into the software.
Modularity is one aspect. More importantly, it permits a higher
level, cleaner, more flexible interface between the ISA known to
the compiler and that used to drive the hardware.
Secondarily, it provides a clean and flexible way to handle modes
(which are NOT always bad). My primary example is that you could
switch on array bound and/or scope checking at program load time,
so you are debugging EXACTLY your production binary. As I am sure
that you know, having to add debugging as a compiler switch often
distorts the problem to insolubility in hard cases.
Regards,
Nick Maclaren.
This is the crucial part:
There's a quote to the effect that 'any (software) tool that can only be
used for what the designer intended, is a lousy product'.
I've had people sending me email quite recently about how they've used a
Q&D program I wrote about 15 years ago to automate tasks that didin't
even exist at the time I wrote the program.
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Aren't some embedded processors providing FPGA functional
units as part of a standard RISC? This seems to be a middle
ground between ideal power-consumption for target performance
of a ASIC and the ideal price of commodity components. For
smaller volume parts (e.g., market entries, prototypes) this
might be especially attractive. An FPGA extension presumably
does allow faster time-to-market [fixes can be made much
later].
Two 'trends' might be instructive: the move away from mask
ROMs toward FLASH and the addition of SIMD instructions.
FLASH provides the field-upgradability (for bug-fixes or
enhancements) like FPGAs. SIMD provides a siginficant
performance benefit for a constrained set of code with a
significant (though diminishing) programming barrier (there
is also more context for a process).
For high-volume parts, the limited advantage of FPGAs
would seem to push their integration somewhat distantly
into the future (if ever)--i.e., I suspect that die area is not
nearly cheap enough yet and that other enhancements
can provide more benefit for the same cost (for more
users).
Unfortunately for desktop-processor FPGAs, software
tends to be developed on existing systems for existing
systems. (E.g., ISTR reading that cryptographic standards
moved away from not-commonly-implemented bit
manipulations.)
Another hinderance to such inclusion of FPGAs is the
factor that really common functions can be implemented
in hardware at lower cost or at least primitives can be
implemented that can provide 'performance' equal to or
better than an FPGA.
Yet another hinderance might be the memory barrier.
The FPGA could benefit compute-constrained
operations, but even then might encounter a memory
bottleneck before showing enough benefit to justify
added costs.
Are there enough performance-limited functions that
are sufficiently unconstrained by memory performance
(or I/O performance) to make an FPGA co-processor
attractive for non-embedded systems?
Paul A. Clayton
just a former McD's grill worker and technophile
>> A lecturer of mine at Uni, who had been involved with the "FIRST"
>> silicon compiler at Cambridge, I believe, wrote a paper several
>> years ago that pretty much argued that point. For numerical
>> algorithms, he argued, there was no way that a small team building
>> small-runs of custom silicon could match the leading-edge general
>> purpose micros (exemplified by 150MHz Alpha at that time), even
>> given the ability to make the hardware do "exactly" what they
>> wanted. Moore's law and compettitive demand (for general purpose
>> CPU performance) would beat them handily.
I'm slightly surprised at this assertion, given the existence of GRAPE
(astrophysical dynamics engine), QCDOC (quantum chromo-dynamics) and
Blue Gene/L. I suspect it depends on the definition of "small team"
and "small runs"; the latter two have a team with IBM on it, and all
of them seem to want to pile together of the order of ten thousand
custom chips.
It may be that the demands of those sorts of computations, where you
might well be prepared to fill a room with graduate students to figure
out how to use all 32 floating-point units on your chip
simultaneously, are different from the demands of general-purpose
CPUs. What I don't have is a particularly good estimate of the costs
of the research projects; I don't know how they compete with wardrobes
full of dual-Xeons, which are currently about $1 million per _peak_
teraflop. I think GRAPE and QCDOC are $3-10 million efforts; I'm not
sure Blue Gene isn't a $500 million project.
> Then there is Blue Gene/L
> http://www-916.ibm.com/press/prnews.nsf/jan/911819B507AA5EE485256AFF004EEC0D
> If anyone on the group would like to buy one, I think there might be
> a slot available. Just think, you too could have 200 TeraFlops in
> your garage. (if it's a big garage)
Is there a plan to use the same technology to offer 20 gigaflops by
the side of your desk at one ten-thousandth the price? Eight times
the total peak computrons of the November 1997 Top500 list is rather
a large chunk to buy equipment in ...
You're right: Speed on a 200 MHz PentiumPro was one of the key (not
crypto key :-) selection criterias for the AES standard.
Of course that was the 370/145, but you knew that. :-)
--
Del Cecchi
cec...@us.ibm.com
Personal Opinions Only
Today you could do it with FIB, to a limited degree. The 7 layer metal could be
a little tricky, even for nanomachines.
>This thread on reconfigurable computing got me thinking
>about what sort of (re)configurable logic I would feel
>fairly comfortable saying would have a definite advantage
>over non configurable approaches:
>
You have said (in effect) that you admire the microarchitecture of
Itanium but loathe the ISA. Why wouldn't WCS offer a magic bullet to
turn Itanium into a chip you could admire without reservation (other
than the fact that you now work for AMD, of course)?
RM
Because it was basically "safe microcode"?
--
`-_-' Ar rug tú barróg ar do mhactíre inniu?
'U` Peter da Silva
> And imagine the nanomachines could do this configuration
> overnight.
Then we'd be doing our computing using the nanomachines instead
and pointing out to "NPGA" fans that the fastest devices they can
build are still 50 times slower than buckytube-rod logic.
> Itanium but loathe the ISA. Why wouldn't WCS offer a magic bullet to
> turn Itanium into a chip you could admire without reservation (other
> than the fact that you now work for AMD, of course)?
Why can't you just use search and replace to turn Barbara Cartland novels
into Shakespeare scripts?
-Peter
The latter two projects are using an existing general purpose CPU core
as most of their ASIC, but at a fairly low clock. So they don't make
very good examples.
-- greg
You misunderstand me.
I admit that the McKinley implementation of the Itanium instruction
set does pretty well.
But, I frankly think that it is a microarchitectural dead end. It is
solving
the wrong problem, execution unit parallelism, and not solving the real
problem, memory latency, toleration thereof.
I constantly try to re-evaluate the bases of my reasoning, but I have
not yet figured out any reality in which the McKinley direction makes
sense.
When I talk admiringly about McKinley, it is rather like saying
"Damn! Who would have thought that they could have made that old
clunker go that fast! As antiques go, it ain't bad."
The biggest thing about McKinley is the cache size and speed,
and that's orthogonal to their CPU execution microarchitecture.
I expect that we will see out-of-order Itania soon,
if, that is, Itanic survives that long.
> Why wouldn't WCS offer a magic bullet to
> turn Itanium into a chip you could admire without reservation
Why would WCS help McKinley in any way?
Pretty damned close to every McKinley operation is fully exposed.
You are either talking about having multiple Itanic cores, and having
WCS control that, or you are talking about having more intimate
control of the logic that is in such a pipeline, to make the execution
units do what they are capable of, instead of what they are pre-wired
to do.
Let's talk about the latter, which is most interesting.
Yes, EUs can do one hell of a lot more than the instruction sets allow.
But, WCS will not help that. At least not WCS in the sense of
microcode words stored somewhere else that have to run control signals
into the EU. The wires to do that running will cost you too much.
The only way to get such intimate control is to store the control signals
rigfht at the EUs - a wide horizontal control word near the EU,
with a narrow, encoded, vertical control number indexing which
set of horozontal controls you want. But, again, the wires to support
the multiple uses, and the storage for the horizontal control words,
get in the way and make your circuit bigger and slower.
Apparently.
<snip>
> I admit that the McKinley implementation of the Itanium instruction
> set does pretty well.
>
> But, I frankly think that it is a microarchitectural dead end. It is
> solving
> the wrong problem, execution unit parallelism, and not solving the real
> problem, memory latency, toleration thereof.
>
Data starvation is the main thing holding Itanium back? Or am I
misunderstanding you again?
<snip>
>
> Yes, EUs can do one hell of a lot more than the instruction sets allow.
> But, WCS will not help that. At least not WCS in the sense of
> microcode words stored somewhere else that have to run control signals
> into the EU. The wires to do that running will cost you too much.
> The only way to get such intimate control is to store the control signals
> rigfht at the EUs - a wide horizontal control word near the EU,
> with a narrow, encoded, vertical control number indexing which
> set of horozontal controls you want. But, again, the wires to support
> the multiple uses, and the storage for the horizontal control words,
> get in the way and make your circuit bigger and slower.
>
I think I understand that.
RM
With the possible exception of the asynchronous Register Stack Engine.
Oops. That's the bit that isn't there yet, isn't it? :-)
>I constantly try to re-evaluate the bases of my reasoning, but I have
>not yet figured out any reality in which the McKinley direction makes
>sense.
Though there are realities in which it does appear to make sense until
you look at it closely.
>When I talk admiringly about McKinley, it is rather like saying
>"Damn! Who would have thought that they could have made that old
>clunker go that fast! As antiques go, it ain't bad."
I think that's a bit unfair. It is far more the sort of custom
vehicle that used to be built out of every interesting component
that its owners could lay their hands on. But where you can safely
say "They will be damn good engineers when they have matured a bit."
>I expect that we will see out-of-order Itania soon,
>if, that is, Itanic survives that long.
With or without the asynchronous RSE? And how on earth are you going
to validate the compilers and interrupt handling for such a beast?
If I were given that task and a tenfold increase in salary, I would
retire immediately ....
Regards,
Nick Maclaren.
I don't know the plans. Perhaps the powers that be have plans to offer somewhat
smaller versions, who knows. But since we here are now part of Engineering
Services, if someone with money were to want such a device, I bet they could be
accomodated. :-)
According to a paper from SC-2002, it will have 64K compute nodes with 1024
compute nodes per rack. 256MB/node. So a single rack will do 2.8 teraflops.
Half a rack, or one populated midplane is 1.4 teraflops.
What do you think one of those would go for? Maybe we could put it on EBAY?
:-)
...
> > I admit that the McKinley implementation of the Itanium instruction
> > set does pretty well.
> >
> > But, I frankly think that it is a microarchitectural dead end. It is
> > solving
> > the wrong problem, execution unit parallelism, and not solving the real
> > problem, memory latency, toleration thereof.
> >
>
> Data starvation is the main thing holding Itanium back?
Data starvation has become the main thing, period. Itanic's problem is that
it doesn't take steps to address it (other than its large, fast on-chip
cache, which is hardly specific to that ISA) that are as effective as OoO
approaches are.
Of course, if you take a somewhat wider view then efficient use of power and
chip area for a given level of performance also becomes significant - but
Itanic doesn't fare well on those metrics either.
- bill
While I cannot say what Andy Glew means, you are clearly misunderstanding
the issue. There are two main things that hold execution unit parallelism
back: data starvation and lack of available ILP. This was as true on
the vector processors as it is now.
Sweating blood to increase execution unit parallelism works only if the
ILP is there to support it; you can push the data in as fast as you like
with no useful effect if the code is too entangled to parallelise well.
Memory latency is more of an ILP problem than anything else, though it
is not strictly ILP. In order to alleviate it, you need to be able to
predict a long way ahead of time what data you will need. There are
many solutions that work well for 'vectorisable' codes and few that
work well for a tangle of spaghetti.
Regards,
Nick Maclaren.
In this context, I think it's interesting to consider modern graphics chips.
They started with actual hardware pipelines, where each stage in the pipeline
was a little state machine and some datapath buzzing through vertexes or scan
lines or pixels or texels or whatever, and handing them off to the next stage
in the pipeline.
In the last few years they've become more programmable. I would argue that
the number of stages in the pipeline has decreased, and each stage has become
more like a regular CPU. Granted, these are CPU-like objects with awesome
memory operation pipelines that do SIMD texture-cache lookups.
What I think is interesting is that the stages of the pipeline have become
similar. Vertex processing has a very similar instruction set to pixel
processing. But the queues that connect these stages are a very valuable
control flow mechanism, which I think is unlikely to go away.
I think there is a very interesting tension on the motherboard between the
nForce chip and the CPU. When they both want to get bigger, which subsumes
the other? Think another way: could you put SIMD texture lookups, queues,
and start-at-this-PC-for-each-element-in-this-queue doodads into a CPU without
screwing it up as a CPU?
Last year there was a joke about nVidia acquiring AMD. There's a germ of
truth in that idea.
>
>"Robert Myers" <rmyer...@attbi.com> wrote in message
>news:b3kia.53179$Zo.13840@sccrnsc03...
>> Andy Glew wrote:
>
>...
>
>> > I admit that the McKinley implementation of the Itanium instruction
>> > set does pretty well.
>> >
>> > But, I frankly think that it is a microarchitectural dead end. It is
>> > solving
>> > the wrong problem, execution unit parallelism, and not solving the real
>> > problem, memory latency, toleration thereof.
>> >
>>
>> Data starvation is the main thing holding Itanium back?
>
>Data starvation has become the main thing, period. Itanic's problem is that
>it doesn't take steps to address it (other than its large, fast on-chip
>cache, which is hardly specific to that ISA) that are as effective as OoO
>approaches are.
>
In terms of relieving data starvation (as opposed to contention for
execution units) OoO only permits opportunistic exploitation of data
that happen to be in cache. It makes *that* much of a difference? Or
am I still missing something?
RM
Like most academic articles it's irrelevant.
Have you ever done an FPGA design? I thought not. It's not
SMOP! There is this thing called "timing" that one has to worry
about that programmers don't (at least on the micro scale). Can
you imagine a *compiler* figuring out if a CASE statement will
fit within a clock cycle? Well, that's what FPGA tools try to
do. In fact they may do well, if the CASE is designed by an
engineer. The fact is that programmers do not think like
engineers. The difference can be orders of magnitude in speed
here.
> cited by Benjamin Ylvisaker in this thread.
>
> Also, do you or anyone else know of TLA+, as in _Specifying_Systems_
> by Leslie Lamport? It sure makes logic design look like programming
> and vice versa to me.
It may look like it to *you*. But anyone who has designed FPGAs,
indeed any modern hardware, laughs at such "revolutions" in logic
design. All the hardware types I know have a pile of stories
about the tool marketeers (yes, even the captive ones) promising
life everlasting and delivering Satan, after great pain. If the
good folks at Intel/HP can't do a perfect Itanic compiler what
makes you think a *decent* hardware compiler is available?
Abstraction is just that.
--
Keith
Dang. You're of course correct.
--
GP
>In article <1jeh8vc46jnpvv9tl...@4ax.com>,
>rmyer...@attbi.com says...
>
>Have you ever done an FPGA design? I thought not. It's not
>SMOP! There is this thing called "timing" that one has to worry
>about that programmers don't (at least on the micro scale). Can
>you imagine a *compiler* figuring out if a CASE statement will
>fit within a clock cycle? Well, that's what FPGA tools try to
>do. In fact they may do well, if the CASE is designed by an
>engineer. The fact is that programmers do not think like
>engineers. The difference can be orders of magnitude in speed
>here.
>
<snip>
>>
>> Also, do you or anyone else know of TLA+, as in _Specifying_Systems_
>> by Leslie Lamport? It sure makes logic design look like programming
>> and vice versa to me.
>
>It may look like it to *you*. But anyone who has designed FPGAs,
>indeed any modern hardware, laughs at such "revolutions" in logic
>design. All the hardware types I know have a pile of stories
>about the tool marketeers (yes, even the captive ones) promising
>life everlasting and delivering Satan, after great pain. If the
>good folks at Intel/HP can't do a perfect Itanic compiler what
>makes you think a *decent* hardware compiler is available?
>Abstraction is just that.
>
Mr. Lamport's work arises out of worrying about concurrent systems,
where timing does matter, and a timing constraint is a timing
constraint. If things are done out of the expected order, whether by
a second or a nanosecond, problems arise.
If you're going to describe a system where real-time constraints and
synchronization of independent processes is important, you'd better
have a formalism that can include those effects, and I believe Mr.
Lamport has described such a formalism.
Describing a formalism and implementing it in a way that can actually
be used and that accurately describes reality, of course, are two
different things, and software engineering has a long way to go in
that respect.
The unfortunate reality of software engineering is that mistakes
appear to be less expensive than in hardware engineering. They
aren't, of course. To the extent that "programmers" think differently
from "engineers", it is largely because the ease with which you can
cobble together a piece of ad hoc software and fix it if something
goes wrong has lulled those who work entirely with software into a
false sense of security.
RM
as per post about vs/repack ... cambridge science center had taken
apl\360 and did some number of things to it, including sensitizing it
for virtual memory environment (especially the storage allocation &
garbage collector) with the help of vs/repack.
http://www.garlic.com/~lynn/2003f.html#15 Alpha performance, why
cambridge also put in the support for making system calls ... and it
was released as cms\apl. the system call stuff caused quite a bit of
consternation among the apl purists ... since it violated some number
of the original apl principles.
palo alto science center then took cms\apl and did some number of
things to it, including revamping the system call stuff into shared
variable paradigm ... as well as doing the apl microcode assist for
the 370/145. this was release as apl\cms and then apl\sv. A lot of apl
applications ran as fast on 370/145 with the apl microcode assist as
they did on a 370/168 w/o apl microcode assist (not quite ten times).
Across the back parking lot from the palo alto science center was
hone, probably for a time, the largest single system cluster in the
world. It had something like 40,000 userids and supported all the
branch and field people in the US. In addition, HONE system was cloned
and deployed in a number of other countries (in a couple cases, I hand
carried it) around the world supporting branch and field people all
over the world.
The major environment for the branch and field people was a large
subsystem environment written in APL called sequoia (possibly one of
the most used APL applications of all time) ... and within sequoia ran
a lot of support tools ... like machine configurators (allow branch
office people to configure and order machines for customers). A lot of
sequoia would have ran as fast on 370/145 with apl m'code assist as on
370/168s .... but there was some amount of sequoia which wasn't
addressed by the apl m'code assist.
some amount of discussions w/regard to hone & apl
http://www.garlic.com/~lynn/subtopic.html#hone
note that the person that was primarily responsible for the 145 apl
microcode assist was also fundamentally responsible for FORTQ
... which became FORTHX.
http://www.garlic.com/~lynn/2003b.html#52 Disk drives as commodities. Was Re: Yamhill
when we were woking on ECPS ... a kernel microcode assist for the
138/148 (follow-on to the 135/145), he did a special microcode
PSW/instruction-address sampler for us on the 145 ... that help
identify were the CP kernel was spending its time (there was actually
two technologies ... one was the microcode psw sampler ... the other
was some software kernel instrumentation):
http://www.garlic.com/~lynn/94.html#21 370 ECPS VM microcode assist
http://www.garlic.com/~lynn/94.html#27 370 ECPS VM microcode assist
http://www.garlic.com/~lynn/94.html#28 370 ECPS VM microcode assist
and
http://www.garlic.com/~lynn/subtopic.html#mcode
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
Internet trivia 20th anv http://www.garlic.com/~lynn/rfcietff.htm
What IA64 does do well, is to have enough registers to hide memory
latency completely for many regular fp codes.
OTOH, as John M, Toon M and others have been telling us, such codes are
increasingly being replaced by more sophisticated/sparse versions, where
memory access is much less regular.
I.e. I don't know if the 'have a huge number of visible registers' will
always be better than simply having a very fast L1 cache, and support
for streaming loads as well as suitably big number of outstanding misses
which the prefetch mechanism can take advantage of.
The fact is that almost all really good programmers I've met/known are
engineers.
A preponderance have been electrical engineers, not comp.sci, but that
might be self-selection since computer science wasn't a separate degree
field until relatively recently.
Terje
(MSEE)
maybe juste write a translator C/C++ -> verilog
SAAT
> What IA64 does do well, is to have enough registers to hide memory
> latency completely for many regular fp codes.
Exactly.
Also, the register set is big enough to express the parallelism for
really wide issue with a long pipeline. Renaming can only go so far.
> OTOH, as John M, Toon M and others have been telling us, such codes are
> increasingly being replaced by more sophisticated/sparse versions, where
> memory access is much less regular.
Yes, though IMHO this is a compiler and code idiom problem rather than
inherent to sparse matrices or the hardware.
The huge number of visible registers _could_ be used to alleviate latency
in exactly the same way you mention above in sparse matrix codes. You
prefetch the lookup table 2 ahead, fetch the data 1 ahead while operating
on the current data, and write out the previous results. With so many regs
and predication it's a piece of cake.
On other chips W/O predication I have to swizzle pointers to bit buckets
on the stack at loop drain to avoid end of array effects, and have to
just cache touch rather than load in the "fetch stage" due to register
pressure.
The only problem is that the simple sparse matrix idiom is hidden in these
pesky objects, aliasing and all!
IMHO, it "only" requires HPC compiler writers to implement
the agressive prefetch and sfw pipelining needed for at least some simple
cases of the indirected idiom. (both the explicit multi-dimensional arrays
with index table and obvious loops over structs/classes through a pointer
lookup table).
For "C" some contract needs to be entered by the programmer to
guarantee non-aliasing. Then it requires education of the writers of
sparse matrix codes to match whatever supported idiom is least painful to
use.
Peter
> I.e. I don't know if the 'have a huge number of visible registers' will
> always be better than simply having a very fast L1 cache, and support
> for streaming loads as well as suitably big number of outstanding misses
> which the prefetch mechanism can take advantage of.
>
> Terje
> --
> - <Terje.M...@hda.hydro.com>
> "almost all programming can be viewed as an exercise in caching"
>
>
Peter Boyle pbo...@physics.gla.ac.uk
>The huge number of visible registers _could_ be used to alleviate latency
>in exactly the same way you mention above in sparse matrix codes. You
>prefetch the lookup table 2 ahead, fetch the data 1 ahead while operating
>on the current data, and write out the previous results. With so many regs
>and predication it's a piece of cake.
Given that the ATLAS guys say they're working on generating
sparse-matrix code, you may discover lots of different ways to
efficiently compute with sparse matrices on other cpus.
greg
> Given that the ATLAS guys say they're working on generating
> sparse-matrix code, you may discover lots of different ways to
> efficiently compute with sparse matrices on other cpus.
>
> greg
Absolutely.
My point is that sparse matrix application is entirely
predictable, and repetitive in most algorithms. This is in-principle
easy to take advantage of, has been in the past, but sadly isn't
currently!
Peter
Code idiom, partially, but I don't agree about compiler. People have
been trying to resolve it in compilers for 25+ years and have got
effectively nowhere.
>The huge number of visible registers _could_ be used to alleviate latency
>in exactly the same way you mention above in sparse matrix codes. You
>prefetch the lookup table 2 ahead, fetch the data 1 ahead while operating
>on the current data, and write out the previous results. With so many regs
>and predication it's a piece of cake.
Yes, indeed, it could. What is unclear is whether that statement is
equivalent to saying that cracking RSA is easy - all you need to do
is run the inverse transformation.
>The only problem is that the simple sparse matrix idiom is hidden in these
>pesky objects, aliasing and all!
Grrk. Yes and no. In the cases you are thinking of, perhaps. More
generally, it is less clear whether there IS an appropriate idiom
under all the mess.
>IMHO, it "only" requires HPC compiler writers to implement
>the agressive prefetch and sfw pipelining needed for at least some simple
>cases of the indirected idiom. (both the explicit multi-dimensional arrays
>with index table and obvious loops over structs/classes through a pointer
>lookup table).
That has been done, and worked pretty well on several of the vector
systems (e.g. the Hitachi S-3600 and, I believe, the Cray YMP).
Unfortunately, it is really only a solution to the problems posed in
the 1960s and 1970s, and doesn't address those of the 1980s to now.
>For "C" some contract needs to be entered by the programmer to
>guarantee non-aliasing. Then it requires education of the writers of
>sparse matrix codes to match whatever supported idiom is least painful to
>use.
For those very simple cases, yes.
Let me be more explicit. The original sparse matrix problems were
classic dense rectangular linear algebra problems, where the sparsity
was and is used to avoid storing the necessarily zero elements. In
that case, some relatively minor language enhancements and standard
compilation techniques should do a decent job. The only real problems
are (a) deciding on the idiom and (b) actually doing the coding (in
the compiler and applications). All very routine - for suitable
experts :-)
It is a bit trickier to extend them to closed topologies, such as the
surface of a sphere, but not enough to get worried about. It does
rather increase the number of options and the chances of mistakes,
though.
Modern codes are often operating over more complex structures, such as
Dirichlet tesselations, or even non-geometric data structures (such as
more general graphs). I have spent some time thinking of how I would
describe the aliasing constraints for some of the simpler of these,
and have retired hurt. I don't think that it is impossible, but it
needs someone with more mental flexibility than I have.
One of the really nasty issues is that I can't think of a way of
binding the properties sufficiently close to the language that there
is a significant chance of errors being detected. Because there isn't
any obvious way of telling whether a particular program is using
indirection for rectangular sparsity or not. And that is critical to
get optimisation right.
For example, in an NxN rectangle, you can follow successors for N-2
steps and guarantee not to have aliasing problems. In a Dirichlet
tesselation, you can't. In a general graph, you can't even guarantee
that each element is referred to only once.
Regards,
Nick Maclaren.
>
>On Wed, 2 Apr 2003, Terje Mathisen wrote:
>
<snip>
>
>IMHO, it "only" requires HPC compiler writers to implement
>the agressive prefetch and sfw pipelining needed for at least some simple
>cases of the indirected idiom. (both the explicit multi-dimensional arrays
>with index table and obvious loops over structs/classes through a pointer
>lookup table).
>
>For "C" some contract needs to be entered by the programmer to
>guarantee non-aliasing. Then it requires education of the writers of
>sparse matrix codes to match whatever supported idiom is least painful to
>use.
>
I wanted to say Amen! and Amen! again, but I know that someone will
mutter to themselves if not post, well, then *you* write the compiler
and convince a good part of the world of programming to think
differently.
The social and financial dynamics underlying this problem are
complicated. The microarchitecture of the P4 and of Itanium are not
transparent to the programmer, and they make compilers and programmers
work alot harder. The literature is rich with clever and aggressive
proposals for prefetch and prediction, but sorting through them and
working with them is a daunting task--a task that is possible only if
you are a compiler designer, working with an open source compiler and
know what to do with it, or are willing to optimize by hand.
I don't yet understand the microarchitecture of Itanium well enough to
criticize it, but I have been critical, and repeatedly so, of the P4.
The three biggest problems with both appear to be latency, latency,
and, er, latency. I am not yet convinced that the P4 cannot be made
to perform much better and that Itanium may not yet prove to be the
chip it set out to be through steps that can be taken entirely in
software.
RM
I agree that memory latency toleration is the "big issue", but clearly the
IA-64 architects put in a lot of things designed to help in that area. The
large register set, the speculative loads, even the ALAT seem designed to
minimize the number of memory operations and to allow overlapping them as
much as possible with other things. They obviously felt (and stated so)
that they felt that the complexity of OOO logic wouldn't scale. Now they
may be wrong about that, but I don't think it is fair to imply that they
ignored the real problem. Other than OOO, what things should they have done
that they didn't? (Bearing in mind your earlier comment that they put in
almost every idea, good and bad that anyone ever had :-)
> I constantly try to re-evaluate the bases of my reasoning, but I have
> not yet figured out any reality in which the McKinley direction makes
> sense.
Is it just the lack of OOO capabilities or something else that you think
doesn't make sense?
> When I talk admiringly about McKinley, it is rather like saying
> "Damn! Who would have thought that they could have made that old
> clunker go that fast! As antiques go, it ain't bad."
:-) Of course, many people said the same thing about P6 :-)
--
- Stephen Fuld
e-mail address disguised to prevent spam
> Modern codes are often operating over more complex structures, such as
> Dirichlet tesselations, or even non-geometric data structures (such as
> more general graphs). I have spent some time thinking of how I would
> describe the aliasing constraints for some of the simpler of these,
> and have retired hurt. I don't think that it is impossible, but it
> needs someone with more mental flexibility than I have.
Could you post the inner loop?
Every Krylov solver I have ever seen necessarily uses a "result" vector
in the sparse matrix which is guaranteed not to overlap either the
"guess/source" vector or the elements of the matrix.
The source vector and elements are therefore known unaffected by the
stores in the matrix application, and the pointers for the
RHS of the equation can alias all you want since it's all read only,
and reloads are not required in the sort of pipelining I described.
That's for an arbitrarily complex pointer table on the RHS (i.e. encoding
of the topology) since all the pointers in that table point
into a vector of "constant" values as far as the sparse matrix multiply
sector of the code is concerned.
Communicating this knowledge to the compiler is what is currently hard.
Peter
> One of the really nasty issues is that I can't think of a way of
> binding the properties sufficiently close to the language that there
> is a significant chance of errors being detected. Because there isn't
> any obvious way of telling whether a particular program is using
> indirection for rectangular sparsity or not. And that is critical to
> get optimisation right.
>
> For example, in an NxN rectangle, you can follow successors for N-2
> steps and guarantee not to have aliasing problems. In a Dirichlet
> tesselation, you can't. In a general graph, you can't even guarantee
> that each element is referred to only once.
>
>
> Regards,
> Nick Maclaren.
>
Peter Boyle pbo...@physics.gla.ac.uk
Not easily, because I don't have one to hand. However, I am not thinking
of problem with a particular code, but the generic issue.
>Every Krylov solver I have ever seen necessarily uses a "result" vector
>in the sparse matrix which is guaranteed not to overlap either the
>"guess/source" vector or the elements of the matrix.
OK. That approach works, no problem. It means that you have to have
two copies of every matrix that is being operated on at once, but
that isn't unreasonable. It just isn't the hard parts of the problem.
>The source vector and elements are therefore known unaffected by the
>stores in the matrix application, and the pointers for the
>RHS of the equation can alias all you want since it's all read only,
>and reloads are not required in the sort of pipelining I described.
Right. But you can specify that perfectly well in Fortran 90 and
even C99, but the judicious use of INTENT(IN) and INTENT(OUT) or
restrict and const. For some meaning of "perfectly well" :-)
>Communicating this knowledge to the compiler is what is currently hard.
I don't see that. Why don't the methods I am describing work? Yes,
I know that their use in optimisation is rarer than it should be, but
surely they do the job?
The issues I was thinking of were structures that are updated in place,
and autoparallelisation (with the false sharing problems, even for
output-only, no overlap data). Now, those are foul.
Regards,
Nick Maclaren.
Hi Nick,
On 2 Apr 2003, Nick Maclaren wrote:
PAB>Every Krylov solver I have ever seen necessarily uses a "result" vector
PAB>in the sparse matrix which is guaranteed not to overlap either the
PAB>"guess/source" vector or the elements of the matrix.
>
> OK. That approach works, no problem. It means that you have to have
> two copies of every matrix that is being operated on at once, but
> that isn't unreasonable. It just isn't the hard parts of the problem.
Multiple _vectors_. The point about the sparse matrix, is you store
it in a compressed way (1 copy) and write an operator to apply it on a
vector, without modifying the matrix, and usually not the vector.
The reason is that typical algorithms use it an iterative solver,
and the input vector is the current guess (expensive since it is the
fruit of the history of the entire solve!). This guess naturally
needs to be kept around so it can be refined for the next iteration,
so you don't overwrite it.
The simplest form of solver still has three vectors, source, current guess
and temporary to hold both what I call the result and residual vectors
and looks something like:
set_up_guess()
do {
tmp = M * guess // tmp is result
tmp = tmp - source // tmp is residual
update_guess(tmp) // refine guess based on error vector
} while ( check_stopping_cond(tmp) == keep_going )
> The issues I was thinking of were structures that are updated in place,
> and autoparallelisation (with the false sharing problems, even for
> output-only, no overlap data). Now, those are foul.
Every sparse matrix solver I've seen doesn't update in place, by nature
of the algorithms.
Peter Boyle
You are missing something.
OoO is not simply about loads that hit the cache. OoO allows the
processor to get *cache-missing* loads out to external memory sooner,
without necessarily being blocked waiting for earlier instructions
to finish. In other words, OoO is a mechanism for hiding memory
latency.
There are different mechanisms of implementing OoO. The classic
method is dynamic scheduling a-la the P6--execute instructions when
their operands become available.
You can also use "run-ahead execution"--execute the program in-order,
but continue running ahead past cache-missing instructions to
discover and issue more loads. Then when the original cache-missing
load finally gets its data, go back and re-execute the program
starting from that load.
In the limit as processor speed approaches infinity, the performance
of run-ahead is indistinguishable from that of dynamic scheduling.
Neither run-ahead nor dynamic scheduling will get you very far
unless you also have register renaming.
The explicit speculation available in the Itanic instruction set
is not as good at "getting the loads out" as OoO execution. OoO
will discover opportunities to issue loads that the compiler cannot
know about, since it doesn't always know in advance which path
program execution will follow.
In addition Itanic is further limited by the number of available
register names. The original Pentium 4 has 128 internal register
names--the same as the Itanic ISA. The next generation Pentium 4
(Prescott) has 256 internal register names. The main purpose of
the additional register names is to allow OoO execution to explore
further ahead in hopes of finding more useful work to cover for the
(relatively) higher memory latency.
The Itanic ISA does not even have enough registers to allow its
explicit speculation to hide the memory latency as well as Prescott.
If future implementations of Itanic want to remain competitive,
they will have to include register renaming and probably some form
of dynamic scheduling or run-ahead to get the loads out. Then all
the Itanic features intended to support staticly scheduled in-order
execution will be at best irrelevant, and at worst serious baggage.
Interesting... I didn't realize that 128 could already become a limit,
since this would mean that more than 128 destination registers for
waiting instructions must be active. Compared to the 8 real regs, 256 is
a _very_ big number.
> The Itanic ISA does not even have enough registers to allow its
> explicit speculation to hide the memory latency as well as Prescott.
>
> If future implementations of Itanic want to remain competitive,
> they will have to include register renaming and probably some form
> of dynamic scheduling or run-ahead to get the loads out. Then all
> the Itanic features intended to support staticly scheduled in-order
> execution will be at best irrelevant, and at worst serious baggage.
I wouldn't be quote so harsh:
On the one hand, having an infinitely large OoO buffer will effectively
generate the exact same memory traffic as having explicit speculative
loads and predicated use of the results.
OTOH, some of the features, specifically the branch avoidance, and the
huge visible register set, will allow several algorithms to treat the
register set as an "infinitely fast" L0, suitable for the innermost
level of cache blocking. It seems to me that you'll need more resources
in the form of a humonguous forwarding network to get the same effect
from a similar of larger set of renaming resources.
On the gripping hand, I still believe in interesting times while waiting
to see how the 64-bit market turns out.
I also believe that my skills in getting the maximum performance out of
x86 architecture cpus is going to stay worthwhile for quite a bit
longer. :-)
That's not how it started :-)
Yes, I agree that the use of out-of-order execution for keeping the
ALUs balanced is yesterday's solution to the day before's problem,
but that is where it came in. Let's stick to the memory latency
issue, which is the one we are stuck on today.
One of the oldest forms of out-of-order execution is touching cache
and preloading registers as soon as the address can be calculated;
the advantage about doing only this is that it vastly simplifies
interrupt handling and pipeline draining. With care, you can ensure
that any pre-executed instructions can simply be repeated if they
have been cancelled.
Now, 'modern' out-of-order execution allows the 'pre-execution' of
the instructions that are needed to calculate those addresses, but
I wonder how useful that really is? It clearly adds a lot of
complexity to the interrupt handling and pipeline draining, which is
normally tackled by a very complicated register renaming scheme, and
the removal of fancy interrupt handling facilities from the ISA (e.g.
precise interrupt handling of arbitrary, fully optimised, code).
Are there any real figures on how much the latter gains you? I know
that it does, because the first form can largely be done statically
(and has been), and works extremely well on predictable codes, but
fails horribly on spaghetti. The latter method is comparable for the
predictable codes, but gives at least some benefit on spaghetti. But
how much?
If it is really beneficial, then I think that the mistake is trying
to keep it out of the ISA. I would much rather go for a well-designed
out-of-order ISA, using concepts like a sticky 'bad value' flag (as
in IEEE floating-point) and consistency barriers (as in BSP), though
I agree that the average programmer would call upon Von Neumann to
exorcise me as a sulphur-breathing, horned apparition :-)
Regards,
Nick Maclaren.
> In article <e8dk8v0mpdjdet77h...@4ax.com>, Robert Myers wrote:
> > In terms of relieving data starvation (as opposed to contention for
> > execution units) OoO only permits opportunistic exploitation of data
> > that happen to be in cache. It makes *that* much of a difference? Or
> > am I still missing something?
>
> You are missing something.
>
> OoO is not simply about loads that hit the cache. OoO allows the
> processor to get *cache-missing* loads out to external memory sooner,
> without necessarily being blocked waiting for earlier instructions
> to finish. In other words, OoO is a mechanism for hiding memory
> latency.
>
> There are different mechanisms of implementing OoO. The classic
> method is dynamic scheduling a-la the P6--execute instructions when
> their operands become available.
>
> You can also use "run-ahead execution"--execute the program in-order,
> but continue running ahead past cache-missing instructions to
> discover and issue more loads. Then when the original cache-missing
> load finally gets its data, go back and re-execute the program
> starting from that load.
If I understand correctly, this "run-ahead execution" approach is the
one used by the VIA C3 'Nehemiah' core. It is a single issue, in-order
core, but with the new "StepAhead(TM)" technique it gathers data
needed by instructions ahead in the instruction stream. They claim
this approach increases the performance of "mainstream productivity"
applications (mostly integer, I guess) by 20% over the old VIA C3
'Ezra T' core. For the "3D graphics applications" the speedup is 73%
compared to the old core, but it is difficult to see if this is really
the result to "StepAhead", since, at the same time, they have doubled
the (clock) speed of the FPU unit. This is the only x86-cpu I'm aware
of doing it this way. The architecture is different enough from the
other x86 CPU's, making it difficult to compare this approach to full
OOO execution.
http://www.via.com.tw/en/viac3/pb.jsp
>
> In the limit as processor speed approaches infinity, the performance
> of run-ahead is indistinguishable from that of dynamic scheduling.
This is, on the other hand, very far from the VIA C3 approach... :-)
--
Daniel Spångberg
http://www.teoroo.mkem.uu.se/daniels/
Linux user #47682,
http://counter.li.org
<snip>
>
>You can also use "run-ahead execution"--execute the program in-order,
>but continue running ahead past cache-missing instructions to
>discover and issue more loads. Then when the original cache-missing
>load finally gets its data, go back and re-execute the program
>starting from that load.
>
>In the limit as processor speed approaches infinity, the performance
>of run-ahead is indistinguishable from that of dynamic scheduling.
>
>Neither run-ahead nor dynamic scheduling will get you very far
>unless you also have register renaming.
>
>The explicit speculation available in the Itanic instruction set
>is not as good at "getting the loads out" as OoO execution. OoO
>will discover opportunities to issue loads that the compiler cannot
>know about, since it doesn't always know in advance which path
>program execution will follow.
>
<snip>
The methods you mention for getting the loads out all go through the
usual Von Neumann bottleneck: one very fast processor struggling to
look far enough ahead to keep the pipeline fed by slow memory.
Has anyone thought of or tried speculatively creating threads on the
fly? CPU identifies possible speculative threads from run-time
behavior, compiler hints, or even random speculation, off-loading the
tasks to other processors that maybe don't even have to be cache
coherent because they are executing in the run-ahead mode you
describe?
The idea is a little short on details :-), but it seems like one
possible way to create an arbitrarily fast processor (albeit involving
more than one CPU) to do run-ahead.
RM
RM
Yes, this is a focus of active research.
Simplest low-hanging fruit: fork-on-call. While the "non-speculative"
thread follows a procedure call, fork a speculative thread to follow
the code after the return point.
Interesting problem: stitching the threads back together so the results
appear to the executed program as if there was only one thread. Treating
the speculative threads as run-ahead is, in fact, one good way to do this.
There is a published Ph.D. dissertation by Haitham Akkary, of Portland
State University, that gives some algorithms and speedups for various
forms of implicit multithreading. It looks quite promising. Akkary
reported average speedups of 20-30% and peak speedups of over 80% on
some benchmarks.
Robert Myers wrote:
>
> Has anyone thought of or tried speculatively creating threads on the
> fly? CPU identifies possible speculative threads from run-time
> behavior, compiler hints, or even random speculation, off-loading the
> tasks to other processors that maybe don't even have to be cache
> coherent because they are executing in the run-ahead mode you
> describe?
>
> The idea is a little short on details :-), but it seems like one
> possible way to create an arbitrarily fast processor (albeit involving
> more than one CPU) to do run-ahead.
>
Yes, but I don't know if it's in the way you are thinking. I was thinking
of being able to do in software what hardware does for virtual address
translation, it simultaneously looks in the TLB and page/segment tables
at the same time. There are lookup algorithms that have different performance
tradeoffs. It would be nice to combine them to get the best of both.
For example, you may have a linear lookup that has a good LRU hit ratio, e.g.
O(1) 50% of the time, and O(n/2) the other 50%. It would be nice to combine
that with a O(log n) so that you get 50% O(1) and 50% O(log n).
Joe Seigh
Intel had a paper on co-thread speculative prefetching.
Of course, they didn't actually implement it, but mimicked it
using regular threads to get an estimate of the effect.
Peter Boyle
I can be wrong, but AFAIR the stated goal of the first agressive OoO
CPU on the market was even less ambitious than that.
MIPS R8000 was intended primarily for FP-intensive supercomputing. It
featured slow clock rate and short pipeline so neither units latency
nor load latency were were much of the concern. The problem was that
they didn't want it to be embarrassingly slower that R4000 on existing
R4000 binaries. OoO came as their answer.
Once again, I don't know what was the real goal, but at least this was
a stated goal as expressed in IEEE Micro article.
Many people at comp.arch tend to ignore it, but the best thing about
agressive OoO CPUs is their ability to produce decent performance
without recompilation. All other benefits are of lower importance.
Here's the intel link.
http://www.intel.com/technology/itj/2002/volume06issue01/art03_specprecomp/p01_abstract.htm
Copyright intel I guess yadda yadda:
Peter
----- Snipped from page
ABSTRACT
Speculative Precomputation (SP) is a technique to improve the latency of
single-threaded applications by utilizing idle multithreading hardware
resources to perform aggressive long-range data prefetches. Instead of
trying to explicitly parallelize a single-threaded application, SP does
the following:
* Targets only a small set of static load instructions, called
delinquent loads, which incur the most performance degrading cache miss
penalties.
* Identifies the dependent instruction slice leading to each
delinquent load.
* Dynamically spawns the slice on a spare hardware thread to
speculatively precompute the load address and perform data prefetch.
----Snipped
If you are still flogging the Alpha-FP-exceptions horse, note that the
in-order CPUs have the performance penalties for precise exceptions
(see
http://www.complang.tuwien.ac.at/anton/spec95/21164a-600/CFP95.013.asc
and
http://www.complang.tuwien.ac.at/anton/spec95/21164a-600/ieee/CFP95.025.asc),
whereas in the OoO, it is barely present (see
http://www.complang.tuwien.ac.at/anton/spec95/21264-500/CFP95.003.asc
and
http://www.complang.tuwien.ac.at/anton/spec95/21264-500/ieee/CFP95.009.asc).
I actually expect the 21264 to produce precise FP exceptions even if
you leave the trapb away. OoO as typically implemented nowadays gives
you precise exceptions almost for free as a side-effect of being able
to undo speculative execution.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
> In addition Itanic is further limited by the number of available
> register names. The original Pentium 4 has 128 internal register
> names--the same as the Itanic ISA. The next generation Pentium 4
> (Prescott) has 256 internal register names. The main purpose of
> the additional register names is to allow OoO execution to explore
> further ahead in hopes of finding more useful work to cover for the
> (relatively) higher memory latency.
How are TLB misses handled? Hit under miss, miss under miss?
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
For heaven's sake! Not merely was I not referring to the Alpha
specifically - you will find that MOST out-of-order CPUs, from the
360/91 onwards, place restrictions on the handling or precision of
exceptions - I have no idea what your examples are meant to show.
Are you saying that the 21164 is an in-order CPU and the 21264 is
not? That is an, er, interesting viewpoint.
If you are saying that the performance differences between the 21164
in 'normal' mode and with IEEE support switched on is solely due to
the support of precise exceptions, then you are seriously mistaken.
At least according to DEC.
>I actually expect the 21264 to produce precise FP exceptions even if
>you leave the trapb away. OoO as typically implemented nowadays gives
>you precise exceptions almost for free as a side-effect of being able
>to undo speculative execution.
An interesting meaning of the word "typical"! Yes, the ability to
undo speculative execution of more-or-less arbitrary instruction
streams does give you the ability to make exceptions precise, at
least in principle, but few "out of order" CPUs actually implement
that sort of speculative execution. It is a VERY complex thing to
get right.
Regards,
Nick Maclaren.
See the ISCA 2001 proceedings to get started. Collins and Zilles both
had papers in this general area.
There are probably at least 30 papers on this topic published in the last
5 years. That doesn't even take into account old stuff.
nate
>The PDP-11/60 (there may have been another - perhaps the 11/44 - that
>supported programmable microcode as well) was mostly a solution in search of
>a problem. Richie Lary used its 'writable control store' to create the
>world's fastest PDP-8 as a hack
ISTR hearing that the 11/60's main vale was as a full-dress rehearsal
in WCS for DEC Field Circus before the 11/780. The 11/60marketing
looked *really* quaint when I saw it in the mid-80s (superhero-style
overmuscled limbs coming out of an '11 cabinet labeled ``Control Store Power!''
You are mixing up pieces of R8000 and R10000.
>Many people at comp.arch tend to ignore it, but the best thing about
>agressive OoO CPUs is their ability to produce decent performance
>without recompilation. All other benefits are of lower importance.
Transmeta also achieves decent performance without recompilation, see
also numerous other binary-binary translation projects.
The best thing about aggressive OoO machines is that they are the fastest
things we currently know how to build for a very wide range of applications,
both those existing and those to be written/compiled.
nate
Well, FPGAs that will make for greate compromises towards even near-general
computing will have to be different in several ways, one of which being that
the amount of RAM and how ou can handle it is right now way too anemic.
reconfigurable logic that came with a very large amount of ram on chip that
you could creatively organize woul dbe much ore interesting than the present
'lots of registers and a little bit of scratch ram/cache' achievable.
>
> RM
--
Sander
+++ Out of cheese error +++
Is there anything more published online beyond Akkary's thesis and the
1998 "A dynamic multithreaded processor"?
[Earlier]
Mike> In the limit as processor speed approaches infinity, the performance
Mike> of run-ahead is indistinguishable from that of dynamic scheduling.
Mike> Interesting problem: stitching the threads back together so the results
Mike> appear to the executed program as if there was only one thread. Treating
Mike> the speculative threads as run-ahead is, in fact, one good way to do this.
So the idea is to fork-on-call and maybe on loops too, and let those
threads go where they may, spewing I$, D$, and branch mispredicts
along the way. The main thread then catches up because it suffers
far fewer cache misses of any sort. When well over 50% of your CPI
is from cache misses, this gets a lot of speedup.
One issue that comes up, which has to be solved for any DMT processor, is
how you deal with cache pollution from misspeculated misses. To get those
forked threads moving, you have to speculate that the register data and
loaded data they start with is correct, or come up with some better
speculation. Either way, you'll be wrong some of the time, and that'll
lead inevitably to cache misses at the wrong locations.
So what do you do with the incoming fill data? Do you keep it in no-added-
latency fill buffers, and write it into the cache only when a real miss
validates the speculated miss? Or do you make a bigger victim cache and
pray?
Once the main thread sees many fewer cache misses, it's going to see a
lot fewer replays, and the instruction fetch unit is going to have a lot
fewer opportunities to catch up to drained issue queues. I can imagine
that the speculated threads might save their PCs or even fetched instruction
streams to buffers which are then used to accelerate instruction fetch,
sort of a better trace cache.
This actually seems like a pretty good design point to me, because we
can fairly naturally build cores which are sized to deal with the ILP
actually seen in real programs. Of course, the machine has to be able
to execute the speculative threads too, which roughly doubles the
instruction bandwidth required -- maybe more with branch mispredicts
and the resulting pipe flushes.
So maybe you even have a dual-core approach, where the main thread core
handles perhaps less instruction bandwidth, but has minimal latencies.
The speculating core handles maybe four threads, switching on cache misses,
and has more bandwidth but somewhat more latency -- and maybe gets wrong
answers every now and again.
Akkary's thesis pushes way past this point, though, and claims just 30%
speedup for it. And he wades through the muck of dual-level issue queues
and attempting to splice threads back together again at commit time.
Rename splicing -- yuk.
If you wanted to recover some work from the speculative core, you could
even run the speculated stream in a mode where some of the input data
(incoming register values and load values) is deemed unreliable. The
stream is then split into two streams. Instructions operating on reliable
data get to issue real cache misses, and are saved to one buffer with their
results. Instructions operating on unreliable data do not issue cache
misses, and are saved to an another buffer. When the main thread catches
the speculated thread, the unreliable instructions are replayed through the
main execution core, and the reliable instructions merely inject their
results into that core and check that their inputs really were reliable.
Sort of like partial evaluation, but speculatively, and at runtime.
The speculative threads end up doing your data structure traversals,
and the main thread ends up doing your actual computation. This is
only really interesting if the execution stream has a lot of ILP,
enough to keep both the main datapath and the speculative datapath
busy.
> I was reading about this on Andy Glew's home page just a few days ago,
> which was quite disappointing, since I've been chewing over these ideas
> since 1992, and thought I had a big leg up on you folks. [Any ex-SGI'ers
> remember Dead Chicken Mode?]
>
> Is there anything more published online beyond Akkary's thesis and the
> 1998 "A dynamic multithreaded processor"?
>
> [Earlier]
> Mike> In the limit as processor speed approaches infinity, the performance
> Mike> of run-ahead is indistinguishable from that of dynamic scheduling.
>
> Mike> Interesting problem: stitching the threads back together so the results
> Mike> appear to the executed program as if there was only one thread. Treating
> Mike> the speculative threads as run-ahead is, in fact, one good way to do this.
>
> So the idea is to fork-on-call and maybe on loops too, and let those
> threads go where they may, spewing I$, D$, and branch mispredicts
> along the way. The main thread then catches up because it suffers
> far fewer cache misses of any sort. When well over 50% of your CPI
> is from cache misses, this gets a lot of speedup.
Some citations for you which evaluate many of the ideas you mentioned:
http://citeseer.nj.nec.com/purser00study.html
http://citeseer.nj.nec.com/luk01tolerating.html
http://citeseer.nj.nec.com/collins01dynamic.html
http://citeseer.nj.nec.com/zilles01executionbased.html
http://citeseer.nj.nec.com/reinhardt00transient.html
Also, if you wish to treat the speculative threads as real work, you
get into the realm of thread level speculation (TLS), which happens to
be what I work on. :-) I think most of the TLS work has focussed on
extracting additional parallelism out of code, and has not put a great
deal of emphasis on using TLS as a souped-up prefetching technique.
Some random TLS citations:
http://citeseer.nj.nec.com/steffan00scalable.html
http://citeseer.nj.nec.com/85835.html
http://citeseer.nj.nec.com/sohi95multiscalar.html
http://citeseer.nj.nec.com/rauchwerger95lrpd.html
You can also flip through the proceedings of recent HPCA, Asplos,
Micro, and ISCA conferences to get all sorts of other papers exploring
these ideas...
Chris
--
Chris Colohan Email: ch...@colohan.ca PGP: finger col...@cs.cmu.edu
Web: www.cs.cmu.edu/~colohan Phone: (412)268-4751
In the limit, as processor speed appraoches infinity, the performance
of any thread is indistinguishable from that of the latency caused by
main memory accesses.
<snip>
Mitch Alsup
That may indeed be the case and it doesn't surprise me at all.
However those that I know (all engineers, BTW) have a totally
different outlook on life than their hardware brothers. THe
software types think in data structures, the hardware types think
in physical structures. There is a wide gap here!
> A preponderance have been electrical engineers, not comp.sci, but that
> might be self-selection since computer science wasn't a separate degree
> field until relatively recently.
I don't doubt you at all. But that doesn't change the facts.
FPGAs aren't reasonable for generalized computing. A *good* FPGA
designer may get a part working at 1/10 the clock of a reasonable
CPU. To get there the designer will instantiate every cell in
the target FPGA's architecture. The wuss that uses the synthesis
tools will do well to be another order of magnitude behind this.
No, compilers do not work here. Physical structure is far more
important than data structures here.
Now if you're talking about something more coarse grained than an
FPGA, perhaps arrays of multipliers, we're in the realm of
microcode, perhaps vector processors. This ain't FPGAs.
>
> Terje
> (MSEE)
CompSci was in the math department when I was in college
(actually it had just moved from the graduate college). I had
the choice of a BSEE or BSCsE, but the latter was simply too new
to bother with and the BSEE was more general. I don't see the
relevance here though.
--
Keith
Yawn.
Why do FPGA designers instantiate logic cells in the target
technology? They do so because they'll pay a 10x penalty for not
designing to the array at hand. They're already paying a 10x
speed penalty (and 10-100X unit cost penalty) for using
programmable logic.
You go figure the numbers. Indeed no one cares enough about
programmable computing to pull the cash out of their pockets.
Hype sells advertising space. It doesn't make many chips.
>
> If you're going to describe a system where real-time constraints and
> synchronization of independent processes is important, you'd better
> have a formalism that can include those effects, and I believe Mr.
> Lamport has described such a formalism.
Formalism? I have history on my side. Academics looking for VC
are a dime a dozen. Show me the silicon!
> Describing a formalism and implementing it in a way that can actually
> be used and that accurately describes reality, of course, are two
> different things, and software engineering has a long way to go in
> that respect.
Formalism is just so sweet. Try the real world. Again, history
is on my side. Many smart people have tried to make hardware
design simple, all have failed miserably.
> The unfortunate reality of software engineering is that mistakes
> appear to be less expensive than in hardware engineering. They
> aren't, of course. To the extent that "programmers" think differently
> from "engineers", it is largely because the ease with which you can
> cobble together a piece of ad hoc software and fix it if something
> goes wrong has lulled those who work entirely with software into a
> false sense of security.
Hardware engineers naturally think in "parallel processes", since
hardware is concurrent. Programmers have been spoiled by linear
processes. The differences between Verilog and VHDL demonstrate
the differences. Indeed all the weenies trying to make 'C' into
a hardware description language underscore the issue.
Many of the heavyweights in FPGA design still use schematics.
The others use HDLs as if they were schematics. Indeed the
products I work on use HDLs as little more than markup languages.
If the real hardware work is done at the "gate" level, just how
do you expect a "silicon compiler" to beat a gate level design?
No, I think an FPGA in the datapath of a processor is a horrible
idea. Who's going to do the design/verification.
--
Keith
>You go figure the numbers. Indeed no one cares enough about
>programmable computing to pull the cash out of their pockets.
>Hype sells advertising space. It doesn't make many chips.
Your generalization exceeds reality by a bit. Several companies have
tried to build a systems company around machines with FPGAs in them
(SRC, with a FPGA in the datapath, and StarBridge), and there are
several small companies making PCI cards. None of these companies have
had great success, either in making programming an FPGA easier or in
the financial arena, but there is modest funding going on.
I think it's a losing strategy, personally, but I won't claim that no
one cares.
greg