Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

600 transistor CPU

24 views
Skip to first unread message

Charles Perkins

unread,
Dec 10, 1992, 1:44:09 PM12/10/92
to
Actually From: Michael Taht <mt...@kc.org>

Your comments on bunches of CPUs per chip caught my eye. I've been
thinking along similar lines, but on a smaller scale. The argument:

RISC machines are being taken superscalar now, and we've hit new
barriers in superscalar execution, out of order scheduling, and
compiler technology (and code) in general that look pretty insoluable
from the standpoint of those of us maintaining normally written C code
as opposed to benchmarks. Each new generation of
chip offers 2-5 times the performance of the previous, at a cost
of 3-4 times the transistor budget. As an architecture ages the
average performance improvement per revision drops closer to 2, while
the transistor budget continues at 3-4.

We've seen this in the 68000, SPARC, and 80x86 series. It starts making
it makes sense to start sticking multiple versions of an old model
CPU into a single housing, and improving the memory interface. Think
of how fast MIPs could have come to market with a 4 fold performance
improvement if they
could have taken their R3000 architecture, and grafted 4 processors
against a 16K onboard cache - rather than invent super-pipelining.

CPU design suddenly becomes an interesting problem again - we have
efficient CPI, nice short pipelines for branchy code, and not a lot
of space wasted for multiple register ports, ALUs, et al - so just
about every part on the CPU is in use 100% of the time - which is a somewhat
decent measure of a CPU's efficiency - all you need is to run 4 programs at
a time (which is quite common, even on workstations and PCs. Most compilers
are three passes or more, word processors need a print spooler, etc.
Networked
OS's need to monitor networks. Also, I'm a populist - if MP processors become
common and cheap, we'll invent uses for them)

We suddenly start spending a lot of time in the cache design and (especially)
MMU design. A simple two level on-chip cache + a stack cache would probably
be enough (1k per processor, 16-32 entry stack cache per, 16k unified)
for a 100 Mhz four processor MIP box.

The MMU/Cache controller becomes a central resource. It gets the multiple
read ports that
are wasted on a stack register machine. It gets a nice large virtual
memory map (so often skimpy on RISC designs), it gets to schedule reads
and determine write policy (for example, we could get an MMU to figure
out that several consecutive memory locations were dirty and flush them
in sequence via a bursty write, on an instruction fetch the external memory
read should run as long as possible, etc) Think of it as an additional CPU
that uses spare memory cycles to optimize memory use, tightly coupled
to the normal CPUs.

We can also save chip area on the CPU by only having a single FPU, which
could be anything from your simple R3000 companion (or simpler) to an
R4000-alike - we're already doing interlocked pipelined access, so
probably we can multiplex the incoming floating point information for
the worst case (4 processes doing heavy math), however:

99.99 % of all CPUs sold spend 99.99% of their time not doing floating point.
Floating point isn't important in the non-scientific market. Once you can do
a 5 year projection of a company like borland on a spreadsheet in a few
seconds
(well, ideally, it should be around 1/10 of a second - but we are using very
integer intensive programs (sparse recalculations) to do our math on a
spreadsheet so once you are in the right order of magnitude (less than 40
clocks per multiply) FP really has no impact but this is another story).

And, it gets better. DMA is a getting tougher to do right on modern, 15 ns
copy-back caches. Put the DMA controller in with the on-chip MMU.
(Or get rid of the
DMA controller entirely, and let a CPU just transfer data back and forth) Now
we can do bursty reads and bursty writes for block I/O and avoid thrashing
the cache. (OK, so you can't do bus master style DMA - you lose 1/2 your
theoretical bus master performance. At CPU speeds, big deal)

Mainframe design principles applied to RISC.

Anyway, I haven't been up on comp.arch in ages, so all this may be old
hat. I can't post from this site, so if you enjoyed, could you post
this for me?

Certainly.
_____________
Michael Taht
Borland/Interbase

<mt...@kc.org>


--
.8Chuck .2Socrates
per...@ursa11.law.utah.edu
"I drank what?" "Pepsi." "Oh."

0 new messages