Michael Bringle
The first problem is that you would need a very wide data path... i.e.
1024 wires. To get all the performance out of it, you would need 1024
pins just for data. Current chips have 200 - 300 pins. This is just a
mechanical/manufacturing problem in one sense. As soon as you multiplex
the data lines or have a smaller path from L1 cache to memory, etc. you
are giving up some performance.
Another problem is that not all applications scale, usefully with
larger
ALU sizes, thus you would be wasting lots of the transistors a large
percentage of the time.
I'm sure there are other problems that make such a CPU not economical
for general purpose computing. I'm sure that someone has their special
application that this would be perfect for, though.
Mike Schmit
-------------------------------------------------------------------
msc...@ix.netcom.com author:
408-244-6826 Pentium Processor Programming Tools
800-765-8086 ISBN: 0-12-627230-1
-------------------------------------------------------------------
Michael> I was wondering, barring compatability, what are the
Michael> manafacturing or performance restraints that would keep us
Michael> from jumping, say to a 1024 bit processor? Or more
Michael> gernerally, why don't we just start using very wide data
Michael> pathways to speed up our component?
There are several problems. First, 1024 bit addresses and integers
take up lots more space in both memory and CPU. It's easy to put 2048
bits of registers on a chip - that's enough for 32 64-bit registers or
two 1024-bit registers. DRAM usage also goes up, since all pointers
take up more space.
Beyond those issues, there's also the problem of fitting the ALU on
the chip. It's not all that hard to design a 1024-bit ALU (well, the
carry lookahead logic could be difficult...), but it's hard to fit it
onto a die we can build today. We're also pin-limited - modern chips
have a few hundred pins, not the few thousand you'd need for a 1024
bit processor.
I'm sure it'll be possible eventually, but it's not feasible right
now.
ethan
--
( Prof. Ethan L. Miller voice: +1 410 455-3972 fax: 455-3969 )
( U of Maryland Baltimore County email: e...@cs.umbc.edu )
( CSEE Dept, 5401 Wilkens Ave URL: http://www.cs.umbc.edu/~elm/ )
( Baltimore, MD 21228-5398 USA These opinions are mine, all mine! )
( PGP key fingerprint: 86 45 87 10 D7 18 35 6A 3E ED 0F B7 99 53 4E 4A )
I don't think the problem is merely technological.
It's much more likely to be:
what would you do with those 1024bits ?
what's the point ?
if you can find a way to take advantage of a 1024bits datapath in more than a
few special cases (like cryptography), maybe you could convince people to start
thinking about it.
Stefan
|> I don't think the problem is merely technological.
|> It's much more likely to be:
|>
|> what would you do with those 1024bits ?
|> what's the point ?
|>
|> if you can find a way to take advantage of a 1024bits datapath in more than a
|> few special cases (like cryptography), maybe you could convince people to start
|> thinking about it.
Note: I think an N-bit CPU has N-bit wide integer registers and datapath.
1) In an R10000, the 64-bit-wide integer datapath is about 20% of the width.
1024 is 16X larger. To get a 1024-bit datapath to be the same fraction of
the width of a same size chip, you only need about 10 shrinks, i.e. 10 chip generations; assuming 3 years apiece, that's 30 years before you'd even want
to think about this. If you were willing for it to be a larger fraction of
a chip, you might save 2 generations.
(You can jiggle the numbers, but that's the idea.)
2) Of course, since wires often don't shrink as fast a transistors, there is
the "minor" issue of running a lot of 1024-bit wide busses around the chip.
3) Besides the space issue, consider that designers are fighting hard to
reduce the delays caused by long wires on chips ... and are not likely to
be thrilled by needing to do 1024-bit-wide adders and shifters.
4) And as Stefan notes, you need a *good reason* to even think about
it:
I've lost the posting, but we went thru this last year, but just discussing
when *128*-bit could come in. Since we're right on the 32/64-bit boundary,
and 64 bits addressing has added 32 more bits, and you can argue that we
consume 2 bits every 3 years (to track DRAM), that's 3*32/2, or 48 years.
For various reasons, I've predicted that somebody would do it earlier,
maybe around 2020 or 2030, assuming current growth rates.
Put another way, about the time you *might* consider 1024 to be possible,
you'll be *thinking* about doing 128.
BOTTOM LINE:
a) 32-bit CPUs are already insufficient for some uses.
b) 64-bit CPUs are likely to be sufficient for a *long time*;
Note that there are already <$35 64-bit micros available, so this
is not exotica.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@sgi.com
DDD: 415-933-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
I wonder how much stuff that now requires floating point operations
could be done using 1024-bit integers? If you could do most of it
that way, you could save chip real estate and execution time, it seems
to me. Floating point was, after all, an expedient devised to overcome
the limitations of short register operations, and still has the disadvantage
that "It's like moving piles of sand around: every time you do, you lose
a little sand, and pick up a little dirt."
ed
Hey, I *like* thinking about "obviously" silly ideas. You never know
where they might lead.
I'm pretty sure I read some articles 3-4 years ago about a machine built at
a university in the UK which had something on the order of a 1024 bit ALU,
and yes, it was principally interesting for crypto factoring problems, on
which it was remarkably fast, even given it's very slow clock rate. Does
anyone remember the name of the machine or it's designers?
Several other respondents to this thread have pointed out that 1024 bits
of addressing are excessive, and that 1024-bit data paths are presently
very difficult to do on-chip and impossible to do off-chip. However, I
reflect that a classic Cray vector register *could* be thought of as a
1024-bit register. Already, with 32 or 64-bit registers, microprocessor
designers are looking at ways, sometimes called "visual" or "graphics"
instructions, to perform vector-like operations on agregates of 8 or 16-bit
data. If we can map such hardware into useful languages - and it should be
possible, given that vectorisation algorithms already exist - we might see
a trend toward larger registers on the 1024-bit order, feeding configurable
ALUs that can perform some number of 8/16/32/64/128(?)-bit operations in
parallel. As the efficiency of such a scheme would be a function of the
regularity of the structure of the computation to be performed, and as
numerical codes show the most regularity, it would be more valuable to
support ganged floating-point operations than ganged integer operations.
I suppose that the principal difference between what I describe and a Cray
is that, in the Cray designs, the vector registers were really piped to
functional units one element at a time, whereas in a neo-vector machine,
one would be processing multiple elements in parallel on the same clock.
The data memory bandwidth requirements would be correspondingly higher.
--
Opinions expressed may not be Kevin D. Kissell
those of the author, let alone Silicon Graphics Core Technology Group
those of Silicon Graphics. Cortaillod, Switzerland
Its been done before - while not quite with 1024 bit registers
there have been systems that used huge "super-accumulators" that
allowed lower precision subresults to be accumulated in a very
high precision super-accumulator. Not entirely sure how much of this
was hardware supported.
--
--------------------------------------------------------------------------
david shepherd
SGS-THOMSON Microelectronics Ltd, 1000 aztec west, bristol bs12 4sq, u.k.
tel/fax: +44 1454 611638/617910 email: d...@bristol.st.com
"whatever you don't want, you don't want negative advertising"
>In <310ead63...@newsstand.cit.cornell.edu> mp...@cornell.edu
>(Michael Bringle) writes:
>>I was wondering, barring compatability, what are the manafacturing or
>>performance restraints that would keep us from jumping, say to a 1024
>>bit processor? Or more gernerally, why don't we just start using very
>>wide data pathways to speed up our component?
>The first problem is that you would need a very wide data path... i.e.
>1024 wires. To get all the performance out of it, you would need 1024
>pins just for data. Current chips have 200 - 300 pins. This is just a
>mechanical/manufacturing problem in one sense. As soon as you multiplex
>the data lines or have a smaller path from L1 cache to memory, etc. you
>are giving up some performance.
Assuming packaging issues can be negotiated, a 1024-bit cpu would be a
very attractive component of a parallel machine, say, a hypercube.
Forget slow memory, put as much static ram as possible, and implement
a few parallel operations in hardware, for example, Guy Blelloch's
scans. Even better if the processor chip could include a
reconfiguration coprocessor ala Maresca and Li. Such machine would run
circles around today's single-cpu systems.
>Another problem is that not all applications scale, usefully with
>larger ALU sizes, thus you would be wasting lots of the transistors a large
>percentage of the time.
Parallel programming has evolved a lot these last years. We now have
lots of very fast algorithms to do all sort of things; you can start
with JaJa or Leighton, and go from there to more modern papers.
>I'm sure there are other problems that make such a CPU not economical
>for general purpose computing. I'm sure that someone has their special
>application that this would be perfect for, though.
The limitation as of today isn't scientific, but technological. My
company's video chip, for example, has a 128-bit bus to its VRAMs.
Somebody else is talking about a 192-bit bus. As technology advances,
packaging will necessarily follow the greater miniaturization trends.
A few years ago an 80-pin chip was a tour-de-force; today a 304-pin
chip is commonplace, and 500-or-so isn't far away. As technology
evolves, computing is bound to move towards greater parallelism at
chip level, and slowly evolve out of one-cpu systems.
_alberto_
I don't know of any applications that use the feature; you'd have to find out
from somebody else. (My interest lies in machine architecture.)
Michel.
|> Several other respondents to this thread have pointed out that 1024 bits
|> of addressing are excessive, and that 1024-bit data paths are presently
|> very difficult to do on-chip and impossible to do off-chip. However, I
Internal DRAM buses (bitlines->sense amps) are on this order, or even
larger. Lots of potential bandwidth lurking about inside.
|> reflect that a classic Cray vector register *could* be thought of as a
|> 1024-bit register. Already, with 32 or 64-bit registers, microprocessor
|> designers are looking at ways, sometimes called "visual" or "graphics"
|> instructions, to perform vector-like operations on agregates of 8 or 16-bit
|> data. If we can map such hardware into useful languages - and it should be
|> possible, given that vectorisation algorithms already exist - we might see
|> a trend toward larger registers on the 1024-bit order, feeding configurable
|> ALUs that can perform some number of 8/16/32/64/128(?)-bit operations in
|> parallel. As the efficiency of such a scheme would be a function of the
The media processor from Microunity already does this up to 128b,
including 1/2/4b ops too. See URL: http://www.microunity.com/
|> regularity of the structure of the computation to be performed, and as
|> numerical codes show the most regularity, it would be more valuable to
|> support ganged floating-point operations than ganged integer operations.
Where there's a lot other than numerical codes which show regularity.
Graphics perhaps? :-)
|> I suppose that the principal difference between what I describe and a Cray
|> is that, in the Cray designs, the vector registers were really piped to
|> functional units one element at a time, whereas in a neo-vector machine,
|> one would be processing multiple elements in parallel on the same clock.
No difference really, except that your instruction issue rate you need
to saturate your machine when operating on long vectors, in
instructions per clock cycle is:
N_ipc = F * W / L
Where F is the number of parallel functional units (say mul, add,
memory pipe), and W is the width of each functional unit (say 2-8
parallel pipelines per functional unit), and L is the maximum vector
length.
For example, a Cray C90 has F = ~11 (not sure on exact number here,
but I'd imagine it's rare to use more than 5-6 vector functional units
simultaneously (2 load, 1 store, 1 mul, 1 add) on a C90), W = 2, L =
128 so
N_ipc = 11 * 2 / 128 = 0.17
T0 (http://www.icsi.berkeley.edu/real/spert/t0-intro.html) has F = 3,
W = 8, L = 32, giving a minimum instruction issue rate of
N_ipc = 3 * 8 / 32 = 0.75.
So, roughly 3/4 of a single issue machine to keep all three pipes
busy. T0 has a 256b internal datapath split into 8x32b slices (and
about 5-6x that number of bus wires running through the datapath).
As the W/L increases, so does the instruction issue bandwidth to keep
your vector machine fed.
|> The data memory bandwidth requirements would be correspondingly higher.
Well, yes, but only cause you're actually doing more work!
--
Krste Asanovic phone: +1 (510) 642-4274 x143
International Computer Science Institute fax: +1 (510) 643-7684
Suite 600, 1947 Center Street email: kr...@icsi.berkeley.edu
Berkeley, CA 94704-1198, USA http://www.icsi.berkeley.edu/~krste
>>|> what would you do with those 1024bits ?
>>|> what's the point ?
>>|>
>>|> if you can find a way to take advantage of a 1024bits datapath in more
>than a
>>|> few special cases (like cryptography), maybe you could convince people
>to start
>>|> thinking about it.
---
Brian Valters McGroarty
> In article <4eo5d5$g...@info.epfl.ch>, "Stefan Monnier"
<stefan....@lia.di.epfl.ch> writes:
>
> |> I don't think the problem is merely technological.
> |> It's much more likely to be:
> |>
> |> what would you do with those 1024bits ?
> |> what's the point ?
> |>
> |> if you can find a way to take advantage of a 1024bits datapath in
more than a
> |> few special cases (like cryptography), maybe you could convince
people to start
> |> thinking about it.
>
> BOTTOM LINE:
> a) 32-bit CPUs are already insufficient for some uses.
> b) 64-bit CPUs are likely to be sufficient for a *long time*;
> Note that there are already <$35 64-bit micros available, so this
> is not exotica.
>
There are couple of apps can use 1024 bits ALU or FP today.
1) Multimedia - The higher the data bandwidth, the better.
1024 ALU can help on encode/decode an image.
2) 3D graphics
a) A point in 3D space is 3 * 64 Bits.
b) A matrix to operated on the points is 4x4x64 = 1024 bits - This
fits well.
If I am designing the Multimedia co-processor, I would use 1024 bits ALU,
FP, and internal datapath.
For external datapath, I would use 64 bits SDRAM or Burst Mode EDO DRAM
to get the bandwidth for the system.
As far as instruction set, they would be multimedia or 3D instruction set or
microcodes the optimize for 1024 bits datapath operation.
If you're interested in design something like this as a garage project,
email me. We can get together and write a proposal to Xilinx to get some
funding for the hardware. Here's how I would proceed with the
project:
The software architect (me?):
1) Start getting JPEG, MPEG, 3D algorithms/source code.
I had them already.
2) Compile it.
3) Profile the code and do proformance analysis on it.
The hardware architect (you?)
1) Write the proposal to xilinx to get the funding from their
re-configurable computing program.
2) Build one or more PCI based XILINX based board stuff with
a bunch high I/Os, high CLB chips - XC4013 or better.
We would probably put some FP co-processors in the system
because the XILINX stuff is not good for FP operation.
3) (Both of us?) Designed the system/environment for rapid
prototyping the
instruction sets and system for the multimedia co-processor
or a multimedia co-processor system.
At this point of the project, we will get together to analysis the
cost/proformance of the system and see if it can form a technology base
for a viable business enterprise.
If yes, we can try to can some private or venture funding for
the venture and proceed with building the company.
If not, we can have a beer, pat each other on the back for a job
well done, put the experience on a report and attach it to our resumes.
We can also put outline of the design on a web page or write an article
for EE Time, EDN or Computer Design and see if any big company want
to buy the whole design and put it on their system.
Send me an email if you want work with me on this.
-Tony Lee
For multimedia things, it is not uncommon for an N*k-bit ALU to also act
like k N-bit ALUs, operating in parallel ... but if something is
truly implemented as k separate N-bit ALUs, and cannot do N-bit arithmetic,
hardly anyone would call this a k*N-bit CPU [except for random marketing silliness.]
Like many other chips, MIPS R80000s and R10000s have 2 integer ALUs,
in this case 64-bits wide. Nobody calls them 128-bit processors...
|> There are couple of apps can use 1024 bits ALU or FP today.
|>
|> 1) Multimedia - The higher the data bandwidth, the better.
|> 1024 ALU can help on encode/decode an image.
Having a 1024-bit ALU isn't going to help bandwidth any.
Having 16 64-bit ones in parallel might :-)
|> 2) 3D graphics
|> a) A point in 3D space is 3 * 64 Bits.
|> b) A matrix to operated on the points is 4x4x64 = 1024 bits - This
|> fits well.
In the first place, most 3D graphics codes use 32-bit FP.
In the second place, people use one or more geometry engines (i.e.,. special
FPUs) in parallel ... but they are *not* 1024-bit FPUs.